<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://nockawa.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://nockawa.github.io/" rel="alternate" type="text/html" /><updated>2026-04-12T21:36:55+00:00</updated><id>https://nockawa.github.io/feed.xml</id><title type="html">Nockawa’s Blog</title><subtitle>Hello there, I&apos;m Loïc Baumann&lt;br/&gt; Welcome to my site&lt;br/&gt; I talk about programming</subtitle><author><name>Loïc Baumann</name></author><entry><title type="html">Microsecond Latency in a Managed Language: The Performance Philosophy Behind Typhon</title><link href="https://nockawa.github.io/blog/microsecond-latency-managed-language/" rel="alternate" type="text/html" title="Microsecond Latency in a Managed Language: The Performance Philosophy Behind Typhon" /><published>2026-04-12T00:00:00+00:00</published><updated>2026-04-12T00:00:00+00:00</updated><id>https://nockawa.github.io/blog/microsecond-latency-managed-language</id><content type="html" xml:base="https://nockawa.github.io/blog/microsecond-latency-managed-language/"><![CDATA[<blockquote>
  <p>💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.<br />
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.</p>
</blockquote>

<blockquote>
  <p><strong>Series: A Database That Thinks Like a Game Engine</strong></p>
  <ol>
    <li><a href="https://nockawa.github.io/blog/why-building-database-engine-in-csharp/">Why I’m Building a Database Engine in C#</a></li>
    <li><a href="https://nockawa.github.io/blog/what-game-engines-know-about-data/">What Game Engines Know About Data That Databases Forgot</a></li>
    <li><strong>Microsecond Latency in a Managed Language</strong> <em>(this post)</em></li>
    <li>Deadlock-Free by Construction <em>(coming soon)</em></li>
  </ol>
</blockquote>

<blockquote>
  <p><img class="emoji" src="https://github.githubassets.com/images/icons/emoji/octocat.png" alt="Octocat" height="20" width="20" /> <a href="https://github.com/nockawa/Typhon">GitHub repo</a>  •  :mailbox_with_mail: <a href="https://nockawa.github.io/feed.xml">Subscribe via RSS</a></p>
</blockquote>

<p>The first two posts in this series covered the <em>why</em> and the <em>what</em>. Why C# for a database engine. What happens when you combine ECS storage with database guarantees.</p>

<p>This post is the <em>how</em>. Specifically: the five design principles that guide every performance decision in Typhon. Not a bag of tricks — a philosophy. Individual optimizations come and go as the engine evolves, but these principles are stable. They’re what let a managed language deliver sub-microsecond transaction latency.</p>

<p>When your tick budget is 16 milliseconds and you have 100,000 entities to process, every nanosecond of per-entity cost matters. And most of that cost comes from decisions made at design time, not runtime.</p>

<h2 id="principle-1-control-memory-layout">Principle 1: Control Memory Layout</h2>

<p>Performance starts at the struct definition, not the algorithm. If your data layout causes cache misses, no algorithm can save you.</p>

<p>The most dramatic example: Typhon recently moved from per-entity hash-table lookups to cluster-based Structure of Arrays (SoA) storage. Same data, same queries, different memory layout. Measured on a Ryzen 9 7950X:</p>

<table>
  <thead>
    <tr>
      <th>Path</th>
      <th>ns / entity</th>
      <th>vs baseline</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Standard EntityAccessor</td>
      <td>139 ns</td>
      <td>1.0x</td>
    </tr>
    <tr>
      <td>ArchetypeAccessor (cached)</td>
      <td>94 ns</td>
      <td>1.5x</td>
    </tr>
    <tr>
      <td><strong>Cluster iteration</strong></td>
      <td><strong>2.5 ns</strong></td>
      <td><strong>55x</strong></td>
    </tr>
  </tbody>
</table>

<p>That’s a 55x improvement from changing memory layout alone. The reason: clusters pack N entities (8 to 64, auto-computed per archetype) in contiguous SoA memory. All positions together, all health values together. Every cache line the CPU loads is 100% useful data. For 100K entities, the working set dropped from scattered L3/DRAM access to ~2.5 MB that fits entirely in L2 cache — and L2 is 3x faster than L3 on Zen 4.</p>

<p>The cluster size isn’t a magic constant. An auto-tuning algorithm evaluates every N from 8 to 64 and picks the one that maximizes entities per 8 KB page for a given archetype’s component schema. Non-power-of-2 sizes often pack better: N=14 can yield 28 entities per page vs N=16 yielding only 16. The capacity is derived from the data, not from convention.</p>

<p><strong>False sharing</strong> is the other side of layout control. When multiple threads write to adjacent fields, the CPU bounces the shared cache line between cores — a 40-60 cycle penalty per bounce. Typhon wraps mutable per-thread state in 64-byte padded structs. The WAL commit buffer goes further: explicit padding fields isolating the producer’s <code class="language-plaintext highlighter-rouge">_tailPosition</code> and the consumer’s <code class="language-plaintext highlighter-rouge">_drainPosition</code> onto separate cache lines. Seven unused <code class="language-plaintext highlighter-rouge">long</code> fields between them, suppressed with <code class="language-plaintext highlighter-rouge">#pragma warning</code>, because the correct layout matters more than the linter’s opinion.</p>

<p>The same hardware awareness drives B+Tree node sizing:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">StructLayout</span><span class="p">(</span><span class="n">LayoutKind</span><span class="p">.</span><span class="n">Sequential</span><span class="p">,</span> <span class="n">Pack</span> <span class="p">=</span> <span class="m">4</span><span class="p">)]</span>
<span class="k">unsafe</span> <span class="k">public</span> <span class="k">struct</span> <span class="nc">Index32Chunk</span>
<span class="p">{</span>
    <span class="c1">// 256 bytes — fills four cache lines. Adjacent Line Prefetcher (ALP) on</span>
    <span class="c1">// Zen 4+/recent Intel automatically fetches paired 64-byte lines within</span>
    <span class="c1">// 128-byte regions, so two ALP triggers cover the full node.</span>

    <span class="k">public</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">Capacity</span> <span class="p">=</span> <span class="m">29</span><span class="p">;</span>

    <span class="k">public</span> <span class="kt">int</span> <span class="n">Control</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">OlcVersion</span><span class="p">;</span>       <span class="c1">// bit 0 = locked, bit 1 = obsolete, bits 2-31 = version</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">PrevChunk</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">NextChunk</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">LeftValue</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">HighKey</span><span class="p">;</span>          <span class="c1">// B-link upper bound</span>
    <span class="k">public</span> <span class="k">fixed</span> <span class="kt">int</span> <span class="n">Values</span><span class="p">[</span><span class="n">Capacity</span><span class="p">];</span>  <span class="c1">// 29 × 4 = 116 bytes</span>
    <span class="k">public</span> <span class="k">fixed</span> <span class="kt">int</span> <span class="n">Keys</span><span class="p">[</span><span class="n">Capacity</span><span class="p">];</span>    <span class="c1">// 29 × 4 = 116 bytes</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This struct is exactly 256 bytes because of the CPU’s prefetcher. The Adjacent Line Prefetcher on modern x86 fetches paired 64-byte lines within 128-byte aligned regions — so two ALP triggers cover the full node. A 256-byte node costs effectively the same as a 128-byte node in terms of memory access, but holds nearly twice the keys.</p>

<p>The capacity of 29 keys isn’t a round number because it isn’t derived from the algorithm. It’s derived from the hardware: 256 bytes of budget minus 24 bytes of header, divided across Keys and Values arrays. Typhon has three B+Tree variants — 16-bit, 32-bit, and 64-bit keys — and all three hit exactly 256 bytes with different capacities (38, 29, and 19 keys respectively). Post #1 mentioned 128-byte nodes. We’ve since moved to 256 bytes after measuring ALP behavior on Zen 4 — capacity went up, lookup latency stayed flat.</p>

<h2 id="principle-2-eliminate-allocations-on-hot-paths">Principle 2: Eliminate Allocations on Hot Paths</h2>

<p>In .NET, every allocation is a future GC event. On hot paths, the cost isn’t the allocation itself (~5 ns) — it’s the Gen0/Gen1 collection later that pauses unrelated threads. The discipline is simple: allocate nothing in steady state.</p>

<p><code class="language-plaintext highlighter-rouge">ref struct</code> is the primary weapon. A <code class="language-plaintext highlighter-rouge">ref struct</code> lives on the stack, dies when the scope ends, and the GC never knows it existed. Post #1 showed <code class="language-plaintext highlighter-rouge">EntityRef</code> (96 bytes, inline component cache). But ref structs are a systematic discipline in Typhon, not a one-off optimization:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">OlcLatch</code></strong>: wraps a single <code class="language-plaintext highlighter-rouge">ref int</code> — the B+Tree node’s version field. The entire optimistic lock coupling protocol (read version, validate, try-write-lock) in a struct that’s basically a typed pointer. Allocated millions of times per second during tree traversal, at zero GC cost.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">EpochGuard</code></strong>: <a href="https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization">RAII</a> scope for epoch-based page protection. Enter and exit in 3.3 ns. Because it’s a <code class="language-plaintext highlighter-rouge">ref struct</code>, it can’t be boxed, captured in a closure, or passed to async code — exactly the constraints you want for a scope guard.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">WalClaim</code></strong>: a Write-Ahead Log buffer claim containing a <code class="language-plaintext highlighter-rouge">Span&lt;byte&gt;</code> that points directly into native WAL memory. Can’t escape to the heap by construction — the Span field makes it a <code class="language-plaintext highlighter-rouge">ref struct</code> automatically.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">PointInTimeAccessor</code></strong>: a reusable snapshot attached to parallel workers. One per worker, stored in a flat array indexed by worker ID. Zero per-entity dictionary overhead — no <code class="language-plaintext highlighter-rouge">Dictionary&lt;EntityId, T&gt;</code> on the hot path.</li>
</ul>

<p>For short-lived buffers, <code class="language-plaintext highlighter-rouge">stackalloc</code> with a threshold pattern: stack-allocate when the array is small (under 64 elements), fall back to the heap otherwise. Most arrays stay small, so they never touch the allocator.</p>

<p>For larger long-lived buffers, the Pinned Object Heap: <code class="language-plaintext highlighter-rouge">GC.AllocateArray&lt;byte&gt;(capacity, pinned: true)</code>. Pre-zeroed by the OS, never compacted by the GC, stable pointer for direct access. Typhon’s HashMap uses this for its entire entry array.</p>

<p>For medium reusable buffers, <code class="language-plaintext highlighter-rouge">ArrayPool&lt;T&gt;.Shared</code>. FPI compression rents 9 KB buffers, returns them in a <code class="language-plaintext highlighter-rouge">finally</code> block. Query execution rents stream arrays sized for the common case (8 slots), doubles if needed.</p>

<p>Four strategies — ref struct for scoped access, stackalloc for small temporaries, POH for large long-lived buffers, ArrayPool for medium reusable buffers. The result: zero hot-path allocations in steady state.</p>

<h2 id="principle-3-reduce-memory-indirections">Principle 3: Reduce Memory Indirections</h2>

<p>Every pointer chase is a potential cache miss. An L3 hit costs ~100 cycles, a DRAM miss costs ~200+. The goal: minimize the number of hops from “I want this data” to “here’s the data.”</p>

<p>Post #1 showed the flagship example — the <a href="https://nockawa.github.io/blog/why-building-database-engine-in-csharp/">SIMD chunk accessor</a> with its 3-tier lookup (MRU check, Vector256 search, clock-hand eviction). Each tier reduces indirection compared to the next.</p>

<p><strong>Epoch-based page protection</strong> eliminates another class of indirection. The traditional approach: atomic increment on page access, atomic decrement on release. For N page accesses in a transaction, that’s 2N atomic operations — each one a potential cache-line bounce. Typhon uses epoch-based protection instead: one stamp when entering a transaction scope, one clear when exiting. Pages accessed within an active epoch can’t be evicted. Cost: 2 operations per transaction, regardless of how many pages are touched.</p>

<p><strong>Zone maps</strong> eliminate entire clusters of indirection. Each indexed field maintains per-cluster min/max bounds. A range query like <code class="language-plaintext highlighter-rouge">WHERE Level &gt;= 50</code> checks two integers per cluster — if the cluster’s maximum is below 50, skip every entity in it without loading a single component byte. The impact at different selectivities, measured on 100K entities:</p>

<table>
  <thead>
    <tr>
      <th>Selectivity</th>
      <th>Without zone maps</th>
      <th>With zone maps</th>
      <th>Speedup</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>100%</td>
      <td>13.4 ms</td>
      <td>1.3 ms</td>
      <td>10x</td>
    </tr>
    <tr>
      <td>50%</td>
      <td>13.4 ms</td>
      <td>0.65 ms</td>
      <td>21x</td>
    </tr>
    <tr>
      <td>10%</td>
      <td>13.4 ms</td>
      <td>0.16 ms</td>
      <td>84x</td>
    </tr>
    <tr>
      <td>1%</td>
      <td>13.4 ms</td>
      <td>0.05 ms</td>
      <td>268x</td>
    </tr>
  </tbody>
</table>

<p>The float ordering trick makes this work for non-integer types: an IEEE 754 sign-flip converts floats to a representation where integer comparison order equals numeric order, enabling the same two-comparison interval overlap check regardless of field type.</p>

<p>At the other end of the scale, division elimination saves cycles on every single chunk lookup:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Field: precomputed at segment creation</span>
<span class="c1">// Replaces expensive division (~20-80 cycles) with multiply+shift (~3-4 cycles)</span>
<span class="k">private</span> <span class="k">readonly</span> <span class="kt">ulong</span> <span class="n">_divMagic</span><span class="p">;</span>

<span class="c1">// Constructor: compute magic multiplier once</span>
<span class="n">_divMagic</span> <span class="p">=</span> <span class="p">(</span><span class="m">0x1</span><span class="n">_0000_0000UL</span> <span class="p">+</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span><span class="n">_otherChunkCount</span> <span class="p">-</span> <span class="m">1</span><span class="p">)</span> <span class="p">/</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span><span class="n">_otherChunkCount</span><span class="p">;</span>

<span class="c1">// Hot path: every chunk lookup uses this instead of idiv</span>
<span class="kt">var</span> <span class="n">pageIndex</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)((</span><span class="n">adjusted</span> <span class="p">*</span> <span class="n">_divMagic</span><span class="p">)</span> <span class="p">&gt;&gt;</span> <span class="m">32</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">offset</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)(</span><span class="n">adjusted</span> <span class="p">-</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)(</span><span class="n">pageIndex</span> <span class="p">*</span> <span class="n">_otherChunkCount</span><span class="p">));</span>
</code></pre></div></div>

<p>Integer division (<code class="language-plaintext highlighter-rouge">idiv</code> on x64) is notoriously slow — 20 to 80 cycles depending on operand size. The magic multiplier replaces it with a multiply and a shift: 3-4 cycles. The precomputation happens once when a segment is created; the benefit repeats on every one of the millions of chunk lookups that follow. Six lines of math, 20x speedup on a hot path. This is a classic systems programming trick that most managed-language developers have never needed — but when your per-entity budget is 2.5 nanoseconds, you need it.</p>

<h2 id="principle-4-let-the-jit-help">Principle 4: Let the JIT Help</h2>

<p>The JIT compiler is your optimization partner, not your enemy. Write code in patterns it can optimize, and it does work for you that you’d have to do manually in C or Rust.</p>

<p><strong>Constrained generics</strong> give you monomorphization. When you write <code class="language-plaintext highlighter-rouge">where TMask : struct, IArchetypeMask&lt;TMask&gt;</code>, the JIT generates a separate native code path for each concrete type. <code class="language-plaintext highlighter-rouge">ArchetypeMask256</code> (four <code class="language-plaintext highlighter-rouge">ulong</code> fields, bitwise operations) gets fully inlined — no vtable, no virtual dispatch. This is the same optimization Rust gets from generics, but opt-in through the <code class="language-plaintext highlighter-rouge">struct</code> constraint.</p>

<p><strong><code class="language-plaintext highlighter-rouge">sealed</code></strong> enables devirtualization. <code class="language-plaintext highlighter-rouge">DirtyBitmap</code> and <code class="language-plaintext highlighter-rouge">ArchetypeClusterInfo</code> are both on hot paths and both sealed. The JIT knows no subclass can exist, so it converts virtual calls to direct calls and can inline them.</p>

<p><strong><code class="language-plaintext highlighter-rouge">[AggressiveInlining]</code></strong> eliminates call overhead on micro-operations. B+Tree binary search, transaction state validation, every lock acquire/release — the overhead of a method call (save registers, set up stack frame, restore) is 2-5 ns. On a path called millions of times, that compounds.</p>

<p><strong>SoA layout enables auto-vectorization.</strong> When a cluster is fully occupied (all N slots in use), the iteration loop becomes a simple sequential walk over contiguous SoA arrays with no branches. The JIT can auto-vectorize this on AVX2 — processing 8 floats per SIMD instruction. The SoA layout isn’t just about cache locality; it’s about giving the JIT a pattern it can vectorize.</p>

<p>But the most surprising JIT trick is dead-code elimination through <code class="language-plaintext highlighter-rouge">static readonly</code> fields:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// TelemetryConfig.cs — field declarations</span>
<span class="c1">/// &lt;summary&gt;</span>
<span class="c1">/// static readonly fields allow the JIT to eliminate disabled telemetry code paths</span>
<span class="c1">/// entirely. When a readonly field is false, the JIT treats guarded blocks as dead</span>
<span class="c1">/// code and removes them completely in Tier 1 compilation.</span>
<span class="c1">/// &lt;/summary&gt;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">readonly</span> <span class="kt">bool</span> <span class="n">Enabled</span><span class="p">;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">readonly</span> <span class="kt">bool</span> <span class="n">EcsEnabled</span><span class="p">;</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">readonly</span> <span class="kt">bool</span> <span class="n">EcsActive</span><span class="p">;</span>    <span class="c1">// Combined: Enabled &amp;&amp; EcsEnabled</span>

<span class="c1">// Static constructor — computed once at startup</span>
<span class="k">static</span> <span class="nf">TelemetryConfig</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">var</span> <span class="n">section</span> <span class="p">=</span> <span class="n">config</span><span class="p">.</span><span class="nf">GetSection</span><span class="p">(</span><span class="s">"Typhon:Telemetry"</span><span class="p">);</span>
    <span class="n">Enabled</span> <span class="p">=</span> <span class="n">section</span><span class="p">.</span><span class="nf">GetValue</span><span class="p">(</span><span class="s">"Enabled"</span><span class="p">,</span> <span class="k">false</span><span class="p">);</span>
    <span class="n">EcsEnabled</span> <span class="p">=</span> <span class="n">ecsSection</span><span class="p">.</span><span class="nf">GetValue</span><span class="p">(</span><span class="s">"Enabled"</span><span class="p">,</span> <span class="k">false</span><span class="p">);</span>
    <span class="n">EcsActive</span> <span class="p">=</span> <span class="n">Enabled</span> <span class="p">&amp;&amp;</span> <span class="n">EcsEnabled</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// EcsQuery.cs — usage on hot path</span>
<span class="k">if</span> <span class="p">(</span><span class="n">TelemetryConfig</span><span class="p">.</span><span class="n">EcsActive</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">activity</span> <span class="p">=</span> <span class="n">TyphonActivitySource</span><span class="p">.</span><span class="nf">StartActivity</span><span class="p">(</span><span class="s">"ECS.Query.Execute"</span><span class="p">);</span>
    <span class="n">activity</span><span class="p">?.</span><span class="nf">SetTag</span><span class="p">(</span><span class="n">TyphonSpanAttributes</span><span class="p">.</span><span class="n">EcsArchetype</span><span class="p">,</span> <span class="k">typeof</span><span class="p">(</span><span class="n">TArchetype</span><span class="p">).</span><span class="n">Name</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">EcsActive</code> is <code class="language-plaintext highlighter-rouge">false</code>, the JIT doesn’t just short-circuit the branch — it <strong>eliminates the entire <code class="language-plaintext highlighter-rouge">if</code> block</strong> from the generated native code. No branch instruction, no condition check, zero cost. The <code class="language-plaintext highlighter-rouge">static readonly</code> field, initialized in a static constructor, is treated as a constant after Tier 1 JIT compilation. The dead branch and everything inside it vanish.</p>

<p>This gives you zero-cost observability. Full OpenTelemetry tracing when enabled; literally nothing — not even a branch — when disabled. Most C# developers don’t know the JIT does this. It’s worth structuring your telemetry and feature flags around this pattern.</p>

<h2 id="principle-5-design-for-the-hardware">Principle 5: Design for the Hardware</h2>

<p>The CPU manual is a requirements document. Cache-line size, SIMD register width, TLB coverage, memory bandwidth — these aren’t abstract numbers. They drive struct sizing, batch sizes, and allocation strategy.</p>

<p><strong>Cache-line size (64 bytes on x86, 128 bytes on Apple Silicon)</strong> drives <code class="language-plaintext highlighter-rouge">CacheLinePaddedInt</code> sizing, B+Tree node alignment, and SoA array alignment. The ViewDeltaRingBuffer aligns each sub-buffer to 64-byte boundaries so that the hardware prefetcher doesn’t waste bandwidth loading adjacent unrelated data.</p>

<p><strong>SIMD width</strong> determines batch sizes. Typhon’s <code class="language-plaintext highlighter-rouge">SimdPredicateEvaluator</code> uses three-tier CPU dispatch for filtering entities by field values: AVX-512 processes 16 integer comparisons per instruction, AVX2 processes 8, with a scalar fallback for older hardware. The AVX-512 path uses a workaround — .NET doesn’t expose 512-bit gather intrinsics, so it performs two 256-bit AVX2 gathers and combines them into a <code class="language-plaintext highlighter-rouge">Vector512</code> for the comparison step. The JIT emits a native <code class="language-plaintext highlighter-rouge">vpcmpd</code> instruction for the 16-wide comparison. On Zen 4 (which double-pumps 512-bit operations), throughput matches two AVX2 iterations but with half the loop overhead.</p>

<p><strong>Software prefetch</strong> hides memory latency where it matters most. During HashMap resize, speculative prefetch computes the <em>future</em> entry’s position in the resized table and issues <code class="language-plaintext highlighter-rouge">Sse.Prefetch0</code> to start loading that cache line while the current entry is being processed. The JIT translates this to a <code class="language-plaintext highlighter-rouge">prefetcht0</code> instruction — essentially free to issue, and it hides 100+ cycles of latency per entry.</p>

<p><strong>BMI2 instructions</strong> accelerate spatial indexing. Morton key encoding (Z-order curves) uses <code class="language-plaintext highlighter-rouge">Bmi2.ParallelBitDeposit</code> to interleave X/Y coordinates in ~1 cycle. The scalar fallback costs ~10 cycles. Morton ordering places spatially adjacent grid cells at nearby array indices, improving cache locality during neighbor queries.</p>

<p><strong>TLB coverage</strong> constrains working set design. Without 2 MB huge pages, x86 L2 TLB covers only 8-12 MB. Every access beyond that risks a 15-20 ns page walk penalty on top of the data access itself. Typhon’s cluster storage keeps 100K entities in ~2.5 MB — comfortably within L2 TLB coverage even without huge pages. For larger datasets, the page cache’s 8 KB pages and sequential access patterns keep the hardware prefetcher effective.</p>

<p><strong>Memory bandwidth (~50 GB/s on Zen 4)</strong> is the ceiling for bulk scans. If your SoA component scan isn’t approaching this number, something is leaving performance on the table — unnecessary indirection, poor alignment, or branches that defeat the prefetcher.</p>

<p>All measurements in this post were taken on an AMD Ryzen 9 7950X with .NET 10, BenchmarkDotNet, release configuration.</p>

<h2 id="the-numbers">The Numbers</h2>

<p>Individual principles are nice. What matters is how they compound. Here’s what the engine actually delivers:</p>

<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cluster iteration (per entity)</td>
      <td><strong>2.5 ns</strong></td>
    </tr>
    <tr>
      <td>CRUD lifecycle (spawn, read, update, destroy, commit)</td>
      <td><strong>2.95 μs</strong></td>
    </tr>
    <tr>
      <td>Transaction create-read-commit (100 entities)</td>
      <td><strong>3.6 μs</strong></td>
    </tr>
    <tr>
      <td>B+Tree point lookup (10K entries)</td>
      <td><strong>191 ns</strong></td>
    </tr>
    <tr>
      <td>Component read (1 MVCC version)</td>
      <td><strong>703 ns</strong></td>
    </tr>
    <tr>
      <td>Component read (50 MVCC versions)</td>
      <td><strong>720 ns</strong></td>
    </tr>
    <tr>
      <td>Uncontended RW lock acquire</td>
      <td><strong>7.5 ns</strong></td>
    </tr>
    <tr>
      <td>Page cache hit</td>
      <td><strong>5.5 ns</strong></td>
    </tr>
    <tr>
      <td>Chunk accessor MRU hit</td>
      <td><strong>1.1 ns</strong></td>
    </tr>
    <tr>
      <td>Epoch enter/exit</td>
      <td><strong>3.3 ns</strong></td>
    </tr>
    <tr>
      <td>Cascade delete 10K entities</td>
      <td><strong>7.6 μs</strong></td>
    </tr>
  </tbody>
</table>

<p>The version invariance number deserves a callout: reading a component with 50 MVCC revisions costs the same as reading one with a single revision. 703 ns vs 720 ns — within measurement noise. The revision chain design works.</p>

<p>These principles also scale to parallel execution:</p>

<table>
  <thead>
    <tr>
      <th>Workers</th>
      <th>Tick time</th>
      <th>Speedup</th>
      <th>Efficiency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>~37 ms</td>
      <td>1.0x</td>
      <td>100%</td>
    </tr>
    <tr>
      <td>2</td>
      <td>~18 ms</td>
      <td>2.1x</td>
      <td>104%</td>
    </tr>
    <tr>
      <td>4</td>
      <td>~10 ms</td>
      <td>3.8x</td>
      <td>95%</td>
    </tr>
    <tr>
      <td>8</td>
      <td>~5.3 ms</td>
      <td>7.1x</td>
      <td>89%</td>
    </tr>
  </tbody>
</table>

<p>89% parallel efficiency on 8 workers. The 16-worker result (6.7x, 42% efficiency) hits the L3 cache / CCD boundary on the 7950X — a hardware wall, not a software one.</p>

<p>To put these numbers in perspective, here’s the concurrency cost hierarchy that drives Typhon’s design decisions:</p>

<table>
  <thead>
    <tr>
      <th>Level</th>
      <th>Cost</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0: Thread-local</td>
      <td>~2 ns</td>
      <td>TLS counter, local variable</td>
    </tr>
    <tr>
      <td>1: Uncontended atomic</td>
      <td>5-10 ns</td>
      <td>AccessControl read latch</td>
    </tr>
    <tr>
      <td>2: Contended atomic</td>
      <td>20-140 ns</td>
      <td>Multiple writers, same lock</td>
    </tr>
    <tr>
      <td>3: System call</td>
      <td>500-1000 ns</td>
      <td>Timestamp via syscall</td>
    </tr>
    <tr>
      <td>4: Context switch</td>
      <td>~10,000 ns</td>
      <td>Blocking lock, futex wait</td>
    </tr>
    <tr>
      <td>5: Oversubscription</td>
      <td>100,000+ ns</td>
      <td>More threads than cores</td>
    </tr>
  </tbody>
</table>

<p>Each level is roughly 10x more expensive than the previous one. Typhon’s <code class="language-plaintext highlighter-rouge">AdaptiveWaiter</code> (spin → yield → sleep progression) keeps most contention at Level 2, avoiding the 100x jump to Level 4. The cache-line padding from Principle 1 keeps parallel workers from bouncing each other between Level 1 and Level 2. Every design decision maps to staying as low in this hierarchy as possible.</p>

<h2 id="trade-offs">Trade-offs</h2>

<p><strong>Unsafe is unsafe.</strong> These techniques require <code class="language-plaintext highlighter-rouge">unsafe</code> code — pointer arithmetic, raw memory access, manual layout control. One bug can corrupt the page cache. Roslyn analyzers catch some classes of errors at compile time, but not all. The safety net has holes.</p>

<p><strong>Complexity budget.</strong> Magic multipliers, SIMD evaluators, epoch-based protection, zone maps — each one is simple in isolation. The combination creates a codebase that demands systems-level understanding to navigate. There’s no shortcut around understanding the hardware.</p>

<p><strong>Not all of this transfers.</strong> Most .NET applications don’t need microsecond latency. Using <code class="language-plaintext highlighter-rouge">CacheLinePaddedInt</code> in a web API is premature optimization. These techniques are for when you’ve measured, profiled, and confirmed that memory access patterns are your bottleneck — not before.</p>

<h2 id="whats-next">What’s Next</h2>

<p>The next post dives into concurrency: “Deadlock-Free by Construction: How Typhon Eliminates Deadlocks Instead of Detecting Them.” Most databases treat deadlocks as a runtime problem — detect the cycle, abort a transaction, retry. Typhon makes deadlocks structurally impossible through a three-pillar mathematical argument. No detection, no timeouts, no retries.</p>]]></content><author><name>Loïc Baumann</name></author><category term="csharp" /><category term="dotnet" /><category term="performance" /><category term="database" /><category term="typhon" /><summary type="html"><![CDATA[Five design principles that let a C#/.NET database engine hit sub-microsecond transaction latency — from cache-line-aware structs to JIT-eliminated dead code.]]></summary></entry><entry><title type="html">What Game Engines Know About Data That Databases Forgot</title><link href="https://nockawa.github.io/blog/what-game-engines-know-about-data/" rel="alternate" type="text/html" title="What Game Engines Know About Data That Databases Forgot" /><published>2026-04-05T00:00:00+00:00</published><updated>2026-04-05T00:00:00+00:00</updated><id>https://nockawa.github.io/blog/what-game-engines-know-about-data</id><content type="html" xml:base="https://nockawa.github.io/blog/what-game-engines-know-about-data/"><![CDATA[<blockquote>
  <p>💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.<br />
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.</p>
</blockquote>

<blockquote>
  <p><strong>Series: A Database That Thinks Like a Game Engine</strong></p>
  <ol>
    <li><a href="https://nockawa.github.io/blog/why-building-database-engine-in-csharp/">Why I’m Building a Database Engine in C#</a></li>
    <li><strong>What Game Engines Know About Data That Databases Forgot</strong> <em>(this post)</em></li>
    <li><a href="https://nockawa.github.io/blog/microsecond-latency-managed-language/">Microsecond Latency in a Managed Language</a></li>
    <li>Deadlock-Free by Construction <em>(coming soon)</em></li>
  </ol>
</blockquote>

<blockquote>
  <p><img class="emoji" src="https://github.githubassets.com/images/icons/emoji/octocat.png" alt="Octocat" height="20" width="20" /> <a href="https://github.com/nockawa/Typhon">GitHub repo</a>  •  :mailbox_with_mail: <a href="https://nockawa.github.io/feed.xml">Subscribe via RSS</a></p>
</blockquote>

<p>Game servers sit at an uncomfortable intersection. They need the raw throughput of a game engine — tens of thousands of entities updated every tick. But they also need what databases provide: transactions that don’t corrupt state, queries that don’t scan everything, and durability that survives crashes.</p>

<p>Today, game server teams pick one side and hack around the other. An <a href="https://en.wikipedia.org/wiki/Entity_component_system">Entity-Component-System</a> framework for speed, with manual serialization to a database for persistence. Or a database for safety, with an impedance mismatch every time they touch game state.</p>

<p>Typhon draws from both traditions. It’s a database engine that stores data the way game engines do — and provides the guarantees that game servers need. Here’s why those two worlds aren’t as far apart as they look.</p>

<h2 id="two-fields-one-problem">Two Fields, One Problem</h2>

<p>ECS architecture evolved in game engines. Relational databases evolved in enterprise software. They never talked to each other. But look at what they built:</p>

<table>
  <thead>
    <tr>
      <th>ECS Concept</th>
      <th>Database Concept</th>
      <th>Shared Principle</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Archetype</td>
      <td>Table</td>
      <td>Homogeneous, fixed-schema storage</td>
    </tr>
    <tr>
      <td>Component</td>
      <td>Column</td>
      <td>Typed, blittable, bulk-iterable data</td>
    </tr>
    <tr>
      <td>Entity</td>
      <td>Row</td>
      <td>Identity with dynamic composition</td>
    </tr>
    <tr>
      <td>System</td>
      <td>Query</td>
      <td>Process all records matching a signature</td>
    </tr>
    <tr>
      <td>Frame Budget (16ms)</td>
      <td>Latency SLA</td>
      <td>Hard real-time deadline</td>
    </tr>
  </tbody>
</table>

<p>An ECS “archetype” is a table. A “component” is a column. A “system” is a query. The vocabulary is different, the underlying structure is the same. Two fields, separated by decades and industry boundaries, converged on structurally identical solutions because they were solving the same fundamental problem: managing structured data under performance constraints.</p>

<p>This convergence is why a synthesis is possible at all. It’s not an accident — it’s driven by the same physics. Data must be laid out for the CPU cache. Access patterns must be predictable. Latency budgets are real.</p>

<h2 id="what-we-learned-from-game-engines">What We Learned From Game Engines</h2>

<p>ECS taught the database world something important about how data should be stored. Three lessons Typhon draws directly from game engine architecture:</p>

<p><strong>Cache locality by default.</strong> In a traditional row store, reading all player positions means loading entire rows — names, inventories, health, everything. Most of those bytes are wasted. In ECS, components are stored per type: all positions contiguous, all health values contiguous. Reading 10,000 positions is a linear memory scan where every byte is useful.</p>

<p>This matters more than most developers realize. An L1 cache hit costs roughly 1 nanosecond. A DRAM miss costs 60-70 ns — a <strong>65x penalty</strong>. When your database layout forces cache misses, no amount of algorithmic cleverness can save you.</p>

<p><a href="https://nockawa.github.io/assets/posts/typhon-ecs-vs-rowstore.svg" target="_blank" style="display:block; text-align:center">
  <img src="https://nockawa.github.io/assets/posts/typhon-ecs-vs-rowstore.png" alt="Storage layout comparison — traditional row store vs Typhon's component store" style="max-width:420px; width:100%" />
</a></p>

<p><strong>Zero-copy is the default, not the optimization.</strong> In a traditional database, reading a record means deserializing from a storage page into a language-level object. In ECS, a component is already in memory in its final layout — you just hand back a pointer. Typhon preserves this: components are blittable <code class="language-plaintext highlighter-rouge">unmanaged</code> structs read directly from pinned memory pages. No serialization, no managed heap allocation, no GC involvement.</p>

<p><strong>Entity as pure identity.</strong> In ECS, an entity is just an ID — a 64-bit number with no inherent structure. All data lives externally in component tables. This is the opposite of ORM thinking where the object <em>is</em> the entity. Typhon inherits this: <code class="language-plaintext highlighter-rouge">EntityId</code> is a lightweight value type, all state lives in typed component storage. This separation is what makes the rest of the architecture possible — per-component versioning, per-component storage modes, independent indexes per component type.</p>

<h2 id="what-we-learned-from-databases">What We Learned From Databases</h2>

<p>Traditional databases solved problems that ECS never had to face. Four capabilities Typhon draws from database architecture:</p>

<p><strong>ACID transactions with per-component MVCC.</strong> Game engines typically have no isolation. Two systems modifying the same entity in the same tick is a race condition — and in a single-process game, you control the execution order so you can manage it. On a game server with concurrent player sessions, you can’t.</p>

<p>Databases solved this decades ago with MVCC: snapshot isolation where readers never block writers, with conflict detection at commit time. Typhon brings this in — but with a twist. Traditional databases version entire rows. Typhon versions each component independently. An entity’s <code class="language-plaintext highlighter-rouge">PositionComponent</code> and <code class="language-plaintext highlighter-rouge">InventoryComponent</code> each maintain their own revision chain: a circular buffer of 12-byte revision entries, each stamped with a 48-bit transaction sequence number.</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Simplified: finding the visible revision for a snapshot</span>
<span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">rev</span> <span class="k">in</span> <span class="nf">WalkRevisions</span><span class="p">(</span><span class="n">entityId</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">rev</span><span class="p">.</span><span class="n">IsolationFlag</span> <span class="p">&amp;&amp;</span> <span class="n">rev</span><span class="p">.</span><span class="n">TSN</span> <span class="p">!=</span> <span class="n">myTransactionTSN</span><span class="p">)</span>
        <span class="k">continue</span><span class="p">;</span>  <span class="c1">// Skip uncommitted revisions from other transactions</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">rev</span><span class="p">.</span><span class="n">TSN</span> <span class="p">&lt;=</span> <span class="n">snapshotTSN</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">rev</span><span class="p">;</span> <span class="c1">// Most recent revision visible to our snapshot</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This means a transaction reading a player’s position sees a consistent frozen point-in-time across <em>all</em> component types simultaneously — without locking any of them. Writers never block readers. And because revisions are per-component rather than per-entity, updating a player’s position doesn’t create a new version of their inventory. Less data copied, less garbage to collect.</p>

<p><strong>Indexed selective access.</strong> This is the big one. ECS systems iterate <em>everything</em> matching a component signature every tick. That works brilliantly for particle simulations where every particle needs updating. But game servers often don’t need all of them:</p>

<table>
  <thead>
    <tr>
      <th>Scenario</th>
      <th>Total Entities</th>
      <th>Processed Per Tick</th>
      <th>Useful Work</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Battle royale (per-client relevancy)</td>
      <td>50,000 actors</td>
      <td>500–2,000</td>
      <td><strong>1–4%</strong></td>
    </tr>
    <tr>
      <td>MMO area of interest</td>
      <td>100,000</td>
      <td>200–1,000</td>
      <td><strong>0.2–1%</strong></td>
    </tr>
    <tr>
      <td>Physics (awake bodies only)</td>
      <td>All rigidbodies</td>
      <td>Awake subset</td>
      <td><strong>5–20%</strong></td>
    </tr>
  </tbody>
</table>

<p>When you’re processing 1–4% of your entities, scanning everything is doing 25–100x more work than necessary. ECS frameworks recognized this — Unity DOTS added enableable components, Flecs added <code class="language-plaintext highlighter-rouge">group_by</code>, Unreal MassEntity added LOD tiers. These are all clever workarounds for the same underlying issue: ECS was designed for bulk iteration, not selective access.</p>

<p>Databases solved this with indexes. B+Trees for value-based lookups, spatial trees for area-of-interest queries, selectivity estimation to decide when to scan versus when to seek. Typhon brings these into the component storage model — not as bolted-on workarounds, but as first-class citizens.</p>

<p><strong>Spatial partitioning.</strong> For spatial access patterns specifically — the #1 selective access need in game servers — Typhon integrates a two-layer spatial index directly into the component storage:</p>

<ul>
  <li><strong>Layer 1: Sparse hash map</strong> — maps coarse grid cells to entity counts. O(1) rejection of empty regions before the tree is even touched.</li>
  <li><strong>Layer 2: Page-backed R-Tree</strong> — AABB, radius, ray, frustum, and kNN queries. Same OLC-latched, SOA node architecture as the B+Trees.</li>
</ul>

<p>Both layers run inside the same transactional model as everything else. No external spatial hash bolted on alongside your ECS. No cache locality destroyed by chasing pointers into a separate data structure.</p>

<p><strong>Durability.</strong> A game client can afford to lose state on crash — reload the level. A game server cannot. Player inventories, economy state, progression data — all must survive process restarts and crashes. WAL-based crash recovery, checkpointing, configurable fsync — these are database fundamentals that game servers need but ECS frameworks never provided.</p>

<p><strong>Query planning.</strong> When you have both indexes and sequential storage, someone needs to decide which access path to use. Databases have decades of work on cost-based query optimization — selectivity estimation, histogram statistics, index selection. Typhon brings a query planner into the ECS world: given a predicate on a component field, it automatically chooses full scan or B+Tree seek based on estimated selectivity.</p>

<h2 id="purpose-built-for-game-servers">Purpose-Built for Game Servers</h2>

<p>Typhon doesn’t glue ECS and database concepts together with duct tape. It synthesizes them into a single model designed for game server workloads.</p>

<p>A component in Typhon is simultaneously an ECS component and a database schema:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Component</span><span class="p">]</span>
<span class="k">public</span> <span class="k">struct</span> <span class="nc">PlayerComponent</span>
<span class="p">{</span>
    <span class="p">[</span><span class="n">Field</span><span class="p">]</span>
    <span class="k">public</span> <span class="n">String64</span> <span class="n">Name</span><span class="p">;</span>

    <span class="p">[</span><span class="n">Field</span><span class="p">]</span>
    <span class="p">[</span><span class="n">Index</span><span class="p">]</span>                    <span class="c1">// B+Tree for fast lookups</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">AccountId</span><span class="p">;</span>

    <span class="p">[</span><span class="n">Field</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">float</span> <span class="n">Experience</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Blittable, unmanaged, fixed-size, stored contiguously per type — that’s the ECS side. Typed fields with automatic B+Tree indexes on marked fields — that’s the database side. One declaration, both worlds.</p>

<p>The query API makes the synthesis concrete:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">var</span> <span class="n">topPlayers</span> <span class="p">=</span> <span class="n">db</span><span class="p">.</span><span class="n">Query</span><span class="p">&lt;</span><span class="n">Player</span><span class="p">&gt;()</span>
    <span class="p">.</span><span class="nf">Where</span><span class="p">(</span><span class="n">p</span> <span class="p">=&gt;</span> <span class="n">p</span><span class="p">.</span><span class="n">Level</span> <span class="p">&gt;=</span> <span class="m">50</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">OrderByDescending</span><span class="p">(</span><span class="n">p</span> <span class="p">=&gt;</span> <span class="n">p</span><span class="p">.</span><span class="n">Level</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">Take</span><span class="p">(</span><span class="m">10</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">ExecuteOrdered</span><span class="p">(</span><span class="n">tx</span><span class="p">);</span>
</code></pre></div></div>

<p>ECS-style typed component access. Database-style predicate filtering with automatic index selection. Inside a transaction with snapshot isolation. The query planner chooses scan vs B+Tree based on selectivity — the developer doesn’t have to.</p>

<p><a href="https://nockawa.github.io/assets/posts/typhon-query-flow.svg" target="_blank" style="display:block; text-align:center">
  <img src="https://nockawa.github.io/assets/posts/typhon-query-flow.png" alt="How a typed query flows through Typhon — from lambda expression to archetype mask filtering, selectivity estimation, and component reads" style="max-width:360px; width:100%" />
</a></p>

<p>And because game servers have different durability needs for different operations, Typhon lets you choose per unit of work:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Position ticks: game-engine speed, batched durability</span>
<span class="k">using</span> <span class="nn">var</span> <span class="n">uow</span> <span class="p">=</span> <span class="n">dbe</span><span class="p">.</span><span class="nf">CreateUnitOfWork</span><span class="p">(</span><span class="n">DurabilityMode</span><span class="p">.</span><span class="n">Deferred</span><span class="p">);</span>

<span class="c1">// Legendary item drop: database safety, immediate fsync</span>
<span class="k">using</span> <span class="nn">var</span> <span class="n">uow</span> <span class="p">=</span> <span class="n">dbe</span><span class="p">.</span><span class="nf">CreateUnitOfWork</span><span class="p">(</span><span class="n">DurabilityMode</span><span class="p">.</span><span class="n">Immediate</span><span class="p">);</span>
</code></pre></div></div>

<p>Same engine, same API. <code class="language-plaintext highlighter-rouge">Deferred</code> mode gives game-engine-class commit latency for position updates that can be re-simulated on crash. <code class="language-plaintext highlighter-rouge">Immediate</code> mode gives database-class guarantees for a transaction that grants a rare item worth real money. The game server decides per operation — not globally.</p>

<h3 id="storage-modes-not-all-data-is-equal">Storage Modes: Not All Data Is Equal</h3>

<p>A game server doesn’t treat all data the same. Player positions change 60 times per second and can be re-simulated on crash. Inventory mutations are rare but must never be lost. AI runtime state — current targets, threat scores, pathfinding waypoints — is recomputed every tick and worthless after a restart.</p>

<p>Traditional databases treat all data identically. Traditional ECS keeps everything in memory with no durability distinction. Typhon lets you choose per component type:</p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>MVCC History</th>
      <th>Persisted</th>
      <th>Change Tracking</th>
      <th>Best For</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Versioned</strong></td>
      <td>Full revision chains</td>
      <td>Yes (WAL + checkpoint)</td>
      <td>Via MVCC</td>
      <td>Inventory, economy, progression</td>
    </tr>
    <tr>
      <td><strong>SingleVersion</strong></td>
      <td>Current state only</td>
      <td>Yes (WAL + checkpoint)</td>
      <td>DirtyBitmap</td>
      <td>Positions, health, frequently-updated state</td>
    </tr>
    <tr>
      <td><strong>Transient</strong></td>
      <td>Current state only</td>
      <td>No</td>
      <td>DirtyBitmap</td>
      <td>AI blackboard, threat scores, pathfinding scratch</td>
    </tr>
  </tbody>
</table>

<p>SingleVersion components skip the revision chain overhead entirely — no circular buffer, no per-write allocation. They track changes through a DirtyBitmap instead: one bit per entity, flipped on write, scanned on tick fence. This is how game engines track what changed, and it’s the right model for data that updates every tick.</p>

<p>Versioned components get full MVCC with snapshot isolation — readers see consistent historical state, writers don’t block readers, conflicts are detected at commit time. This is how databases protect critical data, and it’s the right model for things that must never be corrupted.</p>

<p>Transient components never touch disk at all — no WAL, no checkpoint, no recovery. Pure in-memory storage with the same query and indexing API as everything else. AI blackboard data that’s recomputed every tick has no business paying persistence overhead.</p>

<p>The same engine, the same transaction API, but the storage layer does exactly what each component type needs. This is what “purpose-built for game servers” means in practice.</p>

<h3 id="views-the-bridge-between-ecs-systems-and-database-queries">Views: The Bridge Between ECS Systems and Database Queries</h3>

<p>In ECS, a “system” runs every tick, processing all matching entities. In a database, a “materialized view” maintains a cached result set and refreshes it incrementally. Typhon’s Views are both:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="nn">var</span> <span class="n">view</span> <span class="p">=</span> <span class="n">db</span><span class="p">.</span><span class="n">Query</span><span class="p">&lt;</span><span class="n">ItemData</span><span class="p">&gt;()</span>
    <span class="p">.</span><span class="nf">Where</span><span class="p">(</span><span class="n">i</span> <span class="p">=&gt;</span> <span class="n">i</span><span class="p">.</span><span class="n">Rarity</span> <span class="p">&gt;=</span> <span class="m">3</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">ToView</span><span class="p">();</span>

<span class="c1">// Game loop</span>
<span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">using</span> <span class="nn">var</span> <span class="n">tx</span> <span class="p">=</span> <span class="n">dbe</span><span class="p">.</span><span class="nf">CreateQuickTransaction</span><span class="p">();</span>
    <span class="n">view</span><span class="p">.</span><span class="nf">Refresh</span><span class="p">(</span><span class="n">tx</span><span class="p">);</span>  <span class="c1">// Microsecond incremental refresh</span>

    <span class="c1">// React to changes — like an ECS system, but only for what changed</span>
    <span class="kt">var</span> <span class="n">delta</span> <span class="p">=</span> <span class="n">view</span><span class="p">.</span><span class="nf">GetDelta</span><span class="p">();</span>
    <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">pk</span> <span class="k">in</span> <span class="n">delta</span><span class="p">.</span><span class="n">Added</span><span class="p">)</span>   <span class="nf">SpawnVisual</span><span class="p">(</span><span class="n">pk</span><span class="p">);</span>
    <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">pk</span> <span class="k">in</span> <span class="n">delta</span><span class="p">.</span><span class="n">Removed</span><span class="p">)</span> <span class="nf">DespawnVisual</span><span class="p">(</span><span class="n">pk</span><span class="p">);</span>
    <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">pk</span> <span class="k">in</span> <span class="n">delta</span><span class="p">.</span><span class="n">Modified</span><span class="p">)</span> <span class="nf">UpdateVisual</span><span class="p">(</span><span class="n">pk</span><span class="p">);</span>
    <span class="n">view</span><span class="p">.</span><span class="nf">ClearDelta</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The initial <code class="language-plaintext highlighter-rouge">ToView()</code> runs a full query. After that, <code class="language-plaintext highlighter-rouge">Refresh()</code> drains a lock-free ring buffer of changes pushed by the commit path — only entities whose indexed fields actually changed are re-evaluated. If 100,000 entities match your view but only 12 changed since last refresh, you do 12 evaluations, not 100,000.</p>

<p>This is the iterate-everything problem solved from the database side: don’t re-scan, track deltas.</p>

<h2 id="trade-offs">Trade-offs</h2>

<p>Specializing for game servers means giving things up.</p>

<p><strong>Blittable components only.</strong> No <code class="language-plaintext highlighter-rouge">string</code>, no object references, no variable-length arrays inside components. Text uses fixed-size types like <code class="language-plaintext highlighter-rouge">String64</code>. This is the price of zero-copy reads and cache-friendly storage — and it’s a constraint game developers are already familiar with from ECS frameworks.</p>

<p><strong>Entity-centric relationships, not SQL JOINs.</strong> Typhon supports navigation links, 1:N and N:M relationships — but they follow entity references, closer to a graph database than a traditional SQL one. This matches how game servers naturally think about data (an entity <em>has</em> components, a guild <em>contains</em> members), but if your mental model is <code class="language-plaintext highlighter-rouge">SELECT ... FROM a JOIN b ON a.x = b.y</code>, it’s a different paradigm.</p>

<p><strong>Schema in code, not SQL.</strong> Components are C# structs with attributes, not DDL statements. Natural for game developers, unfamiliar territory for database administrators. If your team thinks in SQL, this is a paradigm shift.</p>

<h2 id="whats-next">What’s Next</h2>

<p>In the next post, I’ll go deeper into the performance philosophy that makes all of this actually fast — data-oriented design, cache-line awareness, and zero-allocation hot paths. The principles that let a managed language hit microsecond-latency transactions.</p>]]></content><author><name>Loïc Baumann</name></author><category term="csharp" /><category term="database" /><category term="ecs" /><category term="gamedev" /><category term="typhon" /><summary type="html"><![CDATA[Game engines and databases solved the same problem independently. Typhon draws from both traditions to build a database engine purpose-built for game servers.]]></summary></entry><entry><title type="html">Why I’m Building a Database Engine in C#</title><link href="https://nockawa.github.io/blog/why-building-database-engine-in-csharp/" rel="alternate" type="text/html" title="Why I’m Building a Database Engine in C#" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://nockawa.github.io/blog/why-building-database-engine-in-csharp</id><content type="html" xml:base="https://nockawa.github.io/blog/why-building-database-engine-in-csharp/"><![CDATA[<blockquote>
  <p>💡Typhon is an embedded, persistent, ACID database engine written in .NET that speaks the native language of game servers and real-time simulations: entities, components, and systems.<br />
It delivers full transactional safety with MVCC snapshot isolation at sub-microsecond latency, powered by cache-line-aware storage, zero-copy access, and configurable durability.</p>
</blockquote>

<blockquote>
  <p><strong>Series: A Database That Thinks Like a Game Engine</strong></p>
  <ol>
    <li><strong>Why I’m Building a Database Engine in C#</strong> <em>(this post)</em></li>
    <li><a href="https://nockawa.github.io/blog/what-game-engines-know-about-data/">What Game Engines Know About Data That Databases Forgot</a></li>
    <li><a href="https://nockawa.github.io/blog/microsecond-latency-managed-language/">Microsecond Latency in a Managed Language</a></li>
    <li>Deadlock-Free by Construction <em>(coming soon)</em></li>
  </ol>
</blockquote>

<blockquote>
  <p><img class="emoji" src="https://github.githubassets.com/images/icons/emoji/octocat.png" alt="Octocat" height="20" width="20" /> <a href="https://github.com/nockawa/Typhon">GitHub repo</a>  •  :mailbox_with_mail: <a href="https://nockawa.github.io/feed.xml">Subscribe via RSS</a></p>
</blockquote>

<p>When I tell people I’m building an ACID database engine in C#, the first reaction is always the same: <em>“But what about GC pauses?”</em></p>

<p>It’s a fair question. Nobody builds high-performance database engines in .NET. The assumption is that you need C, C++, or Rust for this class of software — that managed languages are fundamentally disqualified from the microsecond-latency club.</p>

<p>After 30 years of building real-time 3D engines and systems software, I chose C# anyway. The project is called <strong>Typhon</strong>: an embedded ACID database engine targeting 1–2 microsecond transaction commits. And the reasons behind that choice might change how you think about what C# can do.</p>

<h2 id="the-case-against-c-lets-steel-man-it">The Case Against C# (Let’s Steel-Man It)</h2>

<p>Before I make my case, let me honestly lay out every argument against choosing C# for this. These are real concerns, not strawmen.</p>

<p><strong>The GC is non-deterministic.</strong> It can pause all your threads whenever it wants. For a database engine that promises microsecond latency, a 10ms Gen2 collection is catastrophic — that’s 10,000x your latency budget.</p>

<p><strong>You don’t control memory layout.</strong> The managed heap decides where objects live. The GC can move them around during compaction. You can’t guarantee that your B+Tree nodes sit on cache-line boundaries, or that your page cache buffer won’t get relocated mid-transaction.</p>

<p><strong>JIT warmup is real.</strong> The first call to any method pays the compilation cost. In a database engine, the first transaction after startup shouldn’t be 100x slower than the steady state.</p>

<p><strong>Virtual dispatch and bounds checking add overhead.</strong> Every array access has a hidden bounds check. Every interface call goes through a vtable. In a hot loop processing millions of entities, these nanoseconds compound.</p>

<p>These are all legitimate problems. I won’t pretend they aren’t. But here’s what most people miss: <strong>modern C# has answers for every single one of them.</strong></p>

<h2 id="what-most-people-dont-know-about-c">What Most People Don’t Know About C#</h2>

<p>The C# that most developers know — classes, garbage collection, LINQ — is only half the language. There’s a whole other side that the .NET runtime team has been quietly building for a decade, and it looks nothing like what you’d expect.</p>

<p><strong><code class="language-plaintext highlighter-rouge">unsafe</code> gives you C-level control.</strong> Raw pointers, pointer arithmetic, <code class="language-plaintext highlighter-rouge">stackalloc</code> for stack buffers, <code class="language-plaintext highlighter-rouge">fixed</code>-size arrays — the JIT generates the same <code class="language-plaintext highlighter-rouge">mov</code>/<code class="language-plaintext highlighter-rouge">cmp</code>/<code class="language-plaintext highlighter-rouge">jne</code> instructions you’d get from C. Not “close to C.” The same instructions.</p>

<p><strong><code class="language-plaintext highlighter-rouge">GCHandle.Alloc(Pinned)</code> makes the GC irrelevant where it matters.</strong> You can pin byte arrays so the GC never moves them. Typhon’s entire page cache is pinned memory — the GC doesn’t touch it, doesn’t scan it, doesn’t move it. It’s just raw bytes at a fixed address, exactly like <code class="language-plaintext highlighter-rouge">malloc</code> in C.</p>

<p><strong><a href="https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/ref-struct"><code class="language-plaintext highlighter-rouge">ref struct</code></a> eliminates heap allocations on hot paths.</strong> A <code class="language-plaintext highlighter-rouge">ref struct</code> can never escape to the heap. It lives on the stack, dies when the scope ends, and the GC never knows it existed. Typhon’s entity accessor (<code class="language-plaintext highlighter-rouge">EntityRef</code>) is a 96-byte <code class="language-plaintext highlighter-rouge">ref struct</code> — zero allocation, zero GC pressure.</p>

<p><strong>Constrained generics give you true monomorphization.</strong> When you write <code class="language-plaintext highlighter-rouge">where T : unmanaged</code>, the JIT generates a separate native code path for each type parameter. <code class="language-plaintext highlighter-rouge">sizeof(T)</code> becomes a constant. Dead branches get eliminated. It’s the same optimization Rust gets from generics — not a runtime dispatch, but compile-time specialization.</p>

<p><strong>Hardware intrinsics are first-class.</strong> <a href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics"><code class="language-plaintext highlighter-rouge">System.Runtime.Intrinsics</code></a> gives you <code class="language-plaintext highlighter-rouge">Vector256</code>, <code class="language-plaintext highlighter-rouge">Sse42.Crc32</code>, <code class="language-plaintext highlighter-rouge">BitOperations.TrailingZeroCount</code> — the same SIMD instructions available in C/C++, with the same performance, and runtime feature detection so you can fall back gracefully.</p>

<p><strong><code class="language-plaintext highlighter-rouge">[StructLayout(Explicit)]</code> gives you exact memory layout.</strong> Field offsets, padding, size — you control every byte. Cache-line alignment, false-sharing prevention, bit-packing — it’s all there.</p>

<p>This isn’t “C# trying to be C.” It’s C# providing a genuine systems programming layer on top of a best-in-class managed ecosystem.</p>

<h2 id="what-typhon-actually-looks-like">What Typhon Actually Looks Like</h2>

<p><a href="https://nockawa.github.io/assets/posts/typhon-blog-architecture.svg" target="_blank" style="display:block">
  <img src="https://nockawa.github.io/assets/posts/typhon-blog-architecture.png" alt="Typhon Engine architecture — five layers from API to Concurrency, with components discussed in this post highlighted with ★" style="width:100%" />
</a></p>

<p>Theory is nice, now let’s look at real code.</p>

<h3 id="hardware-accelerated-wal-checksums">Hardware-accelerated WAL checksums</h3>

<p>Every page written to the Write-Ahead Log needs a CRC32C checksum. Here’s what that looks like in C# — calling CPU instructions by name:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="k">static</span> <span class="kt">uint</span> <span class="nf">ComputePartial</span><span class="p">(</span><span class="kt">uint</span> <span class="n">crc</span><span class="p">,</span> <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">Sse42</span><span class="p">.</span><span class="n">X64</span><span class="p">.</span><span class="n">IsSupported</span><span class="p">)</span>   <span class="k">return</span> <span class="nf">ComputeSse42X64</span><span class="p">(</span><span class="n">crc</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">Sse42</span><span class="p">.</span><span class="n">IsSupported</span><span class="p">)</span>       <span class="k">return</span> <span class="nf">ComputeSse42X32</span><span class="p">(</span><span class="n">crc</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ArmCrc32</span><span class="p">.</span><span class="n">Arm64</span><span class="p">.</span><span class="n">IsSupported</span><span class="p">)</span> <span class="k">return</span> <span class="nf">ComputeArm64</span><span class="p">(</span><span class="n">crc</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
    <span class="k">return</span> <span class="nf">ComputeSoftware</span><span class="p">(</span><span class="n">crc</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">private</span> <span class="k">static</span> <span class="kt">uint</span> <span class="nf">ComputeSse42X64</span><span class="p">(</span><span class="kt">uint</span> <span class="n">crc</span><span class="p">,</span> <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ulong</span> <span class="n">crc64</span> <span class="p">=</span> <span class="n">crc</span><span class="p">;</span>
    <span class="k">ref</span> <span class="kt">byte</span> <span class="n">ptr</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">MemoryMarshal</span><span class="p">.</span><span class="nf">GetReference</span><span class="p">(</span><span class="n">data</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">offset</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">aligned</span> <span class="p">=</span> <span class="n">data</span><span class="p">.</span><span class="n">Length</span> <span class="p">&amp;</span> <span class="p">~</span><span class="m">7</span><span class="p">;</span>

    <span class="k">while</span> <span class="p">(</span><span class="n">offset</span> <span class="p">&lt;</span> <span class="n">aligned</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">crc64</span> <span class="p">=</span> <span class="n">Sse42</span><span class="p">.</span><span class="n">X64</span><span class="p">.</span><span class="nf">Crc32</span><span class="p">(</span><span class="n">crc64</span><span class="p">,</span> <span class="n">Unsafe</span><span class="p">.</span><span class="n">ReadUnaligned</span><span class="p">&lt;</span><span class="kt">ulong</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">Unsafe</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="k">ref</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">offset</span><span class="p">)));</span>
        <span class="n">offset</span> <span class="p">+=</span> <span class="m">8</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">uint</span> <span class="n">crc32</span> <span class="p">=</span> <span class="p">(</span><span class="kt">uint</span><span class="p">)</span><span class="n">crc64</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">offset</span> <span class="p">&lt;</span> <span class="n">data</span><span class="p">.</span><span class="n">Length</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">crc32</span> <span class="p">=</span> <span class="n">Sse42</span><span class="p">.</span><span class="nf">Crc32</span><span class="p">(</span><span class="n">crc32</span><span class="p">,</span> <span class="n">Unsafe</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="k">ref</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">offset</span><span class="p">));</span>
        <span class="n">offset</span><span class="p">++;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">crc32</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">Sse42.X64.Crc32()</code> compiles to a single x86 <code class="language-plaintext highlighter-rouge">crc32</code> instruction. The runtime detects the CPU capabilities, the JIT eliminates the dead branches, and what executes is the same code a C programmer would write — but with automatic fallback on platforms without SSE4.2. Result: <strong>~1.3 µs per 8 KB page</strong>.</p>

<h3 id="the-simd-chunk-accessor">The SIMD chunk accessor</h3>

<p>This is Typhon’s page cache hot path — a 16-slot cache that finds your data in one of three tiers:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// === ULTRA FAST PATH: MRU check ===</span>
<span class="kt">var</span> <span class="n">mru</span> <span class="p">=</span> <span class="n">_mruSlot</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">_pageIndices</span><span class="p">[</span><span class="n">mru</span><span class="p">]</span> <span class="p">==</span> <span class="n">pageIndex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">var</span> <span class="n">headerOffset</span> <span class="p">=</span> <span class="n">pageIndex</span> <span class="p">==</span> <span class="m">0</span> <span class="p">?</span> <span class="n">_rootHeaderOffset</span> <span class="p">:</span> <span class="n">_otherHeaderOffset</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">byte</span><span class="p">*)</span><span class="n">_baseAddresses</span><span class="p">[</span><span class="n">mru</span><span class="p">]</span> <span class="p">+</span> <span class="n">headerOffset</span> <span class="p">+</span> <span class="n">offset</span> <span class="p">*</span> <span class="n">_stride</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// === FAST PATH: SIMD search through all 16 cached slots ===</span>
<span class="k">fixed</span> <span class="p">(</span><span class="kt">int</span><span class="p">*</span> <span class="n">indices</span> <span class="p">=</span> <span class="n">_pageIndices</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">var</span> <span class="n">target</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="n">pageIndex</span><span class="p">);</span>

    <span class="kt">var</span> <span class="n">v0</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Load</span><span class="p">(</span><span class="n">indices</span><span class="p">);</span>
    <span class="kt">var</span> <span class="n">mask0</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Equals</span><span class="p">(</span><span class="n">v0</span><span class="p">,</span> <span class="n">target</span><span class="p">).</span><span class="nf">ExtractMostSignificantBits</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">mask0</span> <span class="p">!=</span> <span class="m">0</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">slot</span> <span class="p">=</span> <span class="n">BitOperations</span><span class="p">.</span><span class="nf">TrailingZeroCount</span><span class="p">(</span><span class="n">mask0</span><span class="p">);</span>
        <span class="k">return</span> <span class="nf">GetFromSlot</span><span class="p">(</span><span class="n">slot</span><span class="p">,</span> <span class="n">pageIndex</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">dirty</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="kt">var</span> <span class="n">v1</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Load</span><span class="p">(</span><span class="n">indices</span> <span class="p">+</span> <span class="m">8</span><span class="p">);</span>
    <span class="kt">var</span> <span class="n">mask1</span> <span class="p">=</span> <span class="n">Vector256</span><span class="p">.</span><span class="nf">Equals</span><span class="p">(</span><span class="n">v1</span><span class="p">,</span> <span class="n">target</span><span class="p">).</span><span class="nf">ExtractMostSignificantBits</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">mask1</span> <span class="p">!=</span> <span class="m">0</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">slot</span> <span class="p">=</span> <span class="m">8</span> <span class="p">+</span> <span class="n">BitOperations</span><span class="p">.</span><span class="nf">TrailingZeroCount</span><span class="p">(</span><span class="n">mask1</span><span class="p">);</span>
        <span class="k">return</span> <span class="nf">GetFromSlot</span><span class="p">(</span><span class="n">slot</span><span class="p">,</span> <span class="n">pageIndex</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="n">dirty</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">_pageIndices</code> array is a <code class="language-plaintext highlighter-rouge">fixed int[16]</code> — 64 bytes, one cache line, packed for SIMD. One <code class="language-plaintext highlighter-rouge">Vector256.Equals</code> compares 8 page indices in a single instruction. The MRU fast path handles the common case (repeated access to the same page) with a single branch — branch predictor friendly, near-zero cost.</p>

<h3 id="zero-copy-entity-reads">Zero-copy entity reads</h3>

<p><code class="language-plaintext highlighter-rouge">EntityRef</code> is a <code class="language-plaintext highlighter-rouge">ref struct</code> — stack-only, 96 bytes, with an inline fixed array caching component locations:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">unsafe</span> <span class="k">ref</span> <span class="k">struct</span> <span class="nc">EntityRef</span>
<span class="p">{</span>
    <span class="k">internal</span> <span class="k">readonly</span> <span class="n">EntityId</span> <span class="n">_id</span><span class="p">;</span>
    <span class="k">internal</span> <span class="k">readonly</span> <span class="n">ArchetypeMetadata</span> <span class="n">_archetype</span><span class="p">;</span>
    <span class="k">internal</span> <span class="k">readonly</span> <span class="n">ArchetypeEngineState</span> <span class="n">_engineState</span><span class="p">;</span>
    <span class="k">internal</span> <span class="k">readonly</span> <span class="n">Transaction</span> <span class="n">_tx</span><span class="p">;</span>
    <span class="k">internal</span> <span class="kt">ushort</span> <span class="n">_enabledBits</span><span class="p">;</span>
    <span class="k">internal</span> <span class="k">readonly</span> <span class="kt">bool</span> <span class="n">_writable</span><span class="p">;</span>
    <span class="k">private</span> <span class="k">fixed</span> <span class="kt">int</span> <span class="n">_locations</span><span class="p">[</span><span class="m">16</span><span class="p">];</span>  <span class="c1">// inline component chunk IDs</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">ref</span> <span class="k">readonly</span> <span class="n">T</span> <span class="n">Read</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">Comp</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">comp</span><span class="p">)</span> <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="n">unmanaged</span>
    <span class="p">{</span>
        <span class="kt">byte</span> <span class="n">slot</span> <span class="p">=</span> <span class="n">_archetype</span><span class="p">.</span><span class="nf">GetSlot</span><span class="p">(</span><span class="n">comp</span><span class="p">.</span><span class="n">_componentTypeId</span><span class="p">);</span>
        <span class="kt">int</span> <span class="n">chunkId</span> <span class="p">=</span> <span class="n">_locations</span><span class="p">[</span><span class="n">slot</span><span class="p">];</span>
        <span class="kt">var</span> <span class="n">table</span> <span class="p">=</span> <span class="n">_engineState</span><span class="p">.</span><span class="n">SlotToComponentTable</span><span class="p">[</span><span class="n">slot</span><span class="p">];</span>
        <span class="k">return</span> <span class="k">ref</span> <span class="n">_tx</span><span class="p">.</span><span class="n">ReadEcsComponentData</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">table</span><span class="p">,</span> <span class="n">chunkId</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That <code class="language-plaintext highlighter-rouge">Read&lt;T&gt;</code> call goes from method call → slot lookup → chunk ID → page cache → pointer arithmetic → <code class="language-plaintext highlighter-rouge">ref readonly T</code> pointing directly into a pinned memory page. Zero copies. Zero allocations. Zero GC involvement. The <code class="language-plaintext highlighter-rouge">where T : unmanaged</code> constraint means the JIT knows the exact layout — it compiles to pointer arithmetic, nothing more.</p>

<h3 id="jit-specialized-hash-functions">JIT-specialized hash functions</h3>

<p>Even the hash functions exploit the JIT. Since <code class="language-plaintext highlighter-rouge">sizeof(TKey)</code> is a compile-time constant for constrained generics, the dead branches vanish:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
<span class="k">internal</span> <span class="k">static</span> <span class="kt">uint</span> <span class="n">ComputeHash</span><span class="p">&lt;</span><span class="n">TKey</span><span class="p">&gt;(</span><span class="n">TKey</span> <span class="n">key</span><span class="p">)</span> <span class="k">where</span> <span class="n">TKey</span> <span class="p">:</span> <span class="n">unmanaged</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">TKey</span><span class="p">)</span> <span class="p">==</span> <span class="m">4</span><span class="p">)</span> <span class="k">return</span> <span class="nf">FastHash32</span><span class="p">(</span><span class="n">Unsafe</span><span class="p">.</span><span class="n">As</span><span class="p">&lt;</span><span class="n">TKey</span><span class="p">,</span> <span class="kt">uint</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">key</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">TKey</span><span class="p">)</span> <span class="p">==</span> <span class="m">8</span><span class="p">)</span> <span class="k">return</span> <span class="nf">XxHash32_8Bytes</span><span class="p">(</span><span class="n">Unsafe</span><span class="p">.</span><span class="n">As</span><span class="p">&lt;</span><span class="n">TKey</span><span class="p">,</span> <span class="kt">long</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">key</span><span class="p">));</span>
    <span class="k">return</span> <span class="nf">XxHash32_Bytes</span><span class="p">((</span><span class="kt">byte</span><span class="p">*)</span><span class="n">Unsafe</span><span class="p">.</span><span class="nf">AsPointer</span><span class="p">(</span><span class="k">ref</span> <span class="n">key</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">TKey</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When you call <code class="language-plaintext highlighter-rouge">ComputeHash&lt;int&gt;(42)</code>, the JIT generates <em>just</em> the 4-byte path. The other two branches are completely eliminated. This is real monomorphization, not runtime dispatch.</p>

<h2 id="the-productivity-argument">The Productivity Argument</h2>

<p>A database engine is more than its hot path. Around the core engine sits a large shell of infrastructure: configuration management, structured logging, telemetry, dependency injection, testing, benchmarking.</p>

<p>In C or Rust, you’d build much of this yourself or stitch together crates/libraries with varying quality. In .NET, this is production-grade and free: <code class="language-plaintext highlighter-rouge">ILogger</code> and <a href="https://opentelemetry.io/docs/languages/net/">OpenTelemetry</a> for observability, <a href="https://github.com/dotnet/BenchmarkDotNet">BenchmarkDotNet</a> for rigorous micro-benchmarks, NUnit for testing, <code class="language-plaintext highlighter-rouge">IConfiguration</code> for settings. All well-documented, all interoperable, all maintained by Microsoft or battle-tested OSS communities.</p>

<p>For a solo developer building a database engine, this is a genuine competitive advantage. I spend my time on concurrency primitives and page cache eviction, not on reinventing a logging framework.</p>

<h2 id="its-the-memory-layout-not-the-language">It’s the Memory Layout, Not the Language</h2>

<p>Here’s the insight that years of real-time 3D engines taught me: <strong>the bottleneck in a database engine is memory access patterns, not instruction throughput.</strong></p>

<p>A cache miss to DRAM on a Ryzen 7950X costs 61–73 nanoseconds. That’s ~250 CPU cycles doing <em>nothing</em>, waiting for data. A CAS operation hitting L1 costs 1.4 nanoseconds. The ratio is <strong>50:1</strong>.</p>

<p>No amount of “zero-cost abstractions” in your language can save you if your data structures cause cache misses. Conversely, if your data layout is cache-friendly — contiguous, aligned, predictable access patterns — the language barely matters. C# with <code class="language-plaintext highlighter-rouge">unsafe</code> generates identical machine code to C on hot paths. The JIT is that good.</p>

<p>What matters is:</p>
<ul>
  <li><strong>Cache-line awareness</strong>: Typhon’s B+Tree nodes are 128 bytes — two cache lines. The stride prefetcher on Zen4 covers the second line automatically. This alone cut insert latency by <strong>53%</strong> and lookup latency by <strong>30%</strong> versus 64-byte nodes.</li>
  <li><strong>Data-oriented design</strong>: Structure of Arrays over Array of Structures. SIMD-friendly layouts. Blittable types only.</li>
  <li><strong>Minimizing indirections</strong>: Every pointer chase is a potential cache miss. The SIMD chunk accessor’s MRU hit avoids the chase entirely.</li>
</ul>

<p>The language you write in matters far less than the memory layout you design.</p>

<h2 id="the-numbers">The Numbers</h2>

<p>All measurements on a Ryzen 9 7950X, .NET 10.0, BenchmarkDotNet, release configuration.</p>

<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Latency</th>
      <th>Throughput</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CRUD lifecycle MVCC (spawn, read, update, destroy, commit)</td>
      <td><strong>1.2 µs</strong></td>
      <td>830K ops/sec</td>
    </tr>
    <tr>
      <td>90 reads/10 updates workload (100 ops per tx, MVCC)</td>
      <td><strong>22 µs</strong></td>
      <td>~4.5M entity-ops/sec</td>
    </tr>
    <tr>
      <td>B+Tree lookup (hit)</td>
      <td><strong>267 ns</strong></td>
      <td>3.7M ops/sec</td>
    </tr>
    <tr>
      <td>B+Tree sequential scan (per key)</td>
      <td><strong>2.1 ns</strong></td>
      <td>479M keys/sec</td>
    </tr>
    <tr>
      <td>Uncontended lock acquire</td>
      <td><strong>7.8 ns</strong></td>
      <td>128M ops/sec</td>
    </tr>
    <tr>
      <td>Page cache hit</td>
      <td><strong>5.3 ns</strong></td>
      <td>—</td>
    </tr>
  </tbody>
</table>

<p>Context: an uncontended CAS on Zen4 costs 1.4 ns. A DRAM round-trip costs 61–73 ns. Typhon’s lock acquire (7.8 ns) is about 5 CAS operations — tight, considering it handles shared/exclusive arbitration with waiter tracking. The 267 ns B+Tree lookup implies 6–7 memory accesses, which matches a tree traversal through L2/L3 cache.</p>

<p>These are early alpha numbers. There’s room to improve. But they validate the core thesis: <strong>C# is not the bottleneck.</strong></p>

<h2 id="trade-offs">Trade-offs</h2>

<p>No choice is without cost. Here’s what I’d tell someone considering the same path.</p>

<p><strong>Memory safety is on you.</strong> In <code class="language-plaintext highlighter-rouge">unsafe</code> blocks, you can corrupt memory, dereference bad pointers, overflow buffers — the compiler won’t save you. <a href="https://learn.microsoft.com/en-us/dotnet/api/system.span-1"><code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code></a> is a slightly slower but totally safe alternative.</p>

<p><strong>The GC hasn’t been a problem — but it could be.</strong> By pinning the page cache and using <code class="language-plaintext highlighter-rouge">ref struct</code> on hot paths, Gen2 collections are rare and cheap. But I won’t pretend this is guaranteed. A workload that allocates heavily in managed code between transactions could still see pauses. The answer is discipline: <strong>don’t allocate on hot paths</strong>. The language lets you — it just doesn’t force you.</p>

<p><strong>“But Rust would give you compile-time safety.”</strong> True — the borrow checker catches ownership and lifetime bugs that <code class="language-plaintext highlighter-rouge">unsafe</code> C# can’t. But C# has a trick Rust doesn’t: <strong><a href="https://learn.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/tutorials/how-to-write-csharp-analyzer-code-fix">Roslyn analyzers</a></strong>. I wrote a custom analyzer suite (TYPHON001–007) that enforces domain-specific safety rules as compiler errors:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">[NoCopy]</code> attribute + analyzer: performance-critical structs like <code class="language-plaintext highlighter-rouge">ChunkAccessor</code> <strong>cannot be passed by value</strong> — the compiler errors if you forget <code class="language-plaintext highlighter-rouge">ref</code>. This is the same guarantee Rust’s borrow checker gives for move semantics, but scoped to the types that actually matter.</li>
  <li>Ownership tracking: if you create a <code class="language-plaintext highlighter-rouge">ChunkAccessor</code> or <code class="language-plaintext highlighter-rouge">Transaction</code> and don’t dispose it, that’s a <strong>compiler error</strong> — not a runtime leak. The analyzer tracks ownership transfers through assignments, returns, and <code class="language-plaintext highlighter-rouge">ref</code>/<code class="language-plaintext highlighter-rouge">out</code> parameters, <code class="language-plaintext highlighter-rouge">[return: TransfersOwnership]</code> on a method helps to express ownership transfer for the analyzer to act accordingly.</li>
  <li>Disposal completeness: if your type holds a critical disposable field and your <code class="language-plaintext highlighter-rouge">Dispose()</code> method misses it or has an early return that skips it — compiler error.</li>
</ul>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// This is a compile-time error in Typhon — TYPHON001</span>
<span class="k">void</span> <span class="nf">Process</span><span class="p">(</span><span class="n">ChunkAccessor</span> <span class="n">accessor</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span>  <span class="c1">// ✗ Error: must be passed by ref</span>

<span class="k">void</span> <span class="nf">Process</span><span class="p">(</span><span class="k">ref</span> <span class="n">ChunkAccessor</span> <span class="n">accessor</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span>  <span class="c1">// ✓ OK</span>
</code></pre></div></div>

<p>You don’t get Rust’s safety for free in C#. But you can <strong>build the exact subset you need</strong> as compiler errors, tailored to your domain. And unlike Rust’s borrow checker, these rules carry domain context in the diagnostics: “causes page cache deadlock” is more actionable than “value moved here.”</p>

<p>Rust’s ecosystem for the surrounding infrastructure (logging, DI, configuration, testing) is also less mature than .NET’s, and as a solo developer, my velocity matters. I chose the language where I ship faster.</p>

<p><strong>JIT warmup is real but manageable.</strong> The first few transactions after cold start are slower. For an embedded engine (no separate server process), this is acceptable — the host application typically has its own warmup. For a server database, you’d want tiered compilation or AOT.</p>

<h2 id="whats-next">What’s Next</h2>

<p>In the next post, I’ll explain why an ACID database engine borrows its storage architecture from game engines — specifically the Entity-Component-System pattern. Game engines and databases are solving the same fundamental problem: managing structured data with extreme performance constraints. They just evolved completely different solutions.</p>]]></content><author><name>Loïc Baumann</name></author><category term="csharp" /><category term="dotnet" /><category term="database" /><category term="performance" /><category term="typhon" /><summary type="html"><![CDATA[Everyone says you need C, C++, or Rust for a high-performance database engine. I chose C# — here's why that's not as crazy as it sounds.]]></summary></entry><entry><title type="html">Introduction of working with struct</title><link href="https://nockawa.github.io/introduction-of-working-with-struct/" rel="alternate" type="text/html" title="Introduction of working with struct" /><published>2018-04-04T18:24:33+00:00</published><updated>2025-07-18T01:00:00+00:00</updated><id>https://nockawa.github.io/introduction-of-working-with-struct</id><content type="html" xml:base="https://nockawa.github.io/introduction-of-working-with-struct/"><![CDATA[<h3 id="introduction">Introduction</h3>

<p>Before C# 7.2 and .net core 2.1, you could improve .net performance with a good dose of conscious effort and relying on code that would not necessarily be nice to look at (and certainly maintainable). Microsoft made several improvements to make sure that you could design &amp; write faster code, not at the sake of the good practices.</p>

<h3 id="struct-struct-and-more-struct">Struct, struct and more struct!</h3>

<p>It is important to get rid of this reflex of choosing the <code class="language-plaintext highlighter-rouge">class</code> keyword every time you design a new type.</p>

<p>Question yourself about if object-oriented programming is really necessary or if you should use another paradigm that would be more data driven.</p>

<p>Using <code class="language-plaintext highlighter-rouge">struct</code> has game changing advantages: <strong>You don’t directly allocate on the heap, so you’re not using the GC</strong>.</p>

<p>You can design a memory friendly layout for your type, avoiding many memory indirection that would increase the chances of cache miss!</p>

<p>Before C# 7.2, relying on <code class="language-plaintext highlighter-rouge">struct</code> were not necessarily a performance win, the reason was that each time you passed/return a <code class="language-plaintext highlighter-rouge">struct</code> based object: <strong>a copy would be made</strong>, on the stack, but still a copy is a copy: it takes time!</p>

<p>It is now possible to pass/return <code class="language-plaintext highlighter-rouge">struct</code> based objects using reference to the initial object: avoiding an unnecessary and costly copy.</p>

<p>Two know languages keywords <code class="language-plaintext highlighter-rouge">ref</code> and <code class="language-plaintext highlighter-rouge">in</code> enable many new patterns to speed things up!</p>

<p>Relying on <code class="language-plaintext highlighter-rouge">struct</code> will also enable a linear memory layout for your data, making things way more CPU cache friendly.</p>

<p>Let’s take an example:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">A</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">val1</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">val2</span><span class="p">;</span>
<span class="p">}</span>
 
<span class="k">public</span> <span class="k">class</span> <span class="nc">B</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="kt">float</span> <span class="n">f1</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">float</span> <span class="n">f2</span><span class="p">;</span>
 
    <span class="k">public</span> <span class="n">A</span> <span class="n">a1</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">A</span><span class="p">();</span>   <span class="c1">// Point to another object: another memory location</span>
    <span class="k">public</span> <span class="n">A</span> <span class="n">a2</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">A</span><span class="p">();</span>   <span class="c1">// Same here</span>
<span class="p">}</span>

<span class="c1">// Allocate an array of 256 pointers to 256 distinct instances of B</span>
<span class="kt">var</span> <span class="n">data</span> <span class="p">=</span> <span class="k">new</span> <span class="n">B</span><span class="p">[</span><span class="m">256</span><span class="p">];</span>

</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">data</code> is one object allocated on the heap (GC), it references 256 instances of <code class="language-plaintext highlighter-rouge">B</code>, each also allocated on the heap. Each instance of <code class="language-plaintext highlighter-rouge">B</code> references two instances of <code class="language-plaintext highlighter-rouge">A</code>, also on the heap.</p>

<p>So we have a total of 1 + 256 + 2*256 objects allocate on the heap: 769 objects, each located somewhere in the memory, that will be eventually garbage collected when no longer needed.</p>

<p>Things to note:</p>

<ol>
  <li>You stress your GC. It could be fine if the life time of all these objects is big, close to static. But if you’re doing some high frequency code and you allocate <code class="language-plaintext highlighter-rouge">data</code> hundred, thousand of time per second: it will have an impact on performances.</li>
  <li>Let’s pretend you want to access all fields (direct and indirect) for <code class="language-plaintext highlighter-rouge">data[0]</code> and <code class="language-plaintext highlighter-rouge">data[1]</code>. You will have to fetch 7 separate memory locations (the <code class="language-plaintext highlighter-rouge">data</code> array, <code class="language-plaintext highlighter-rouge">data[0]</code>, <code class="language-plaintext highlighter-rouge">data[0].a1</code>, <code class="language-plaintext highlighter-rouge">data[0].a2</code>, <code class="language-plaintext highlighter-rouge">data[1]</code>, <code class="language-plaintext highlighter-rouge">data[1].a1</code>, <code class="language-plaintext highlighter-rouge">data[1].a2</code>).</li>
</ol>

<p>Let’s make the following changes: we no longer use <code class="language-plaintext highlighter-rouge">class</code>, but <code class="language-plaintext highlighter-rouge">struct</code> instead.</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">struct</span> <span class="nc">A</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">val1</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">val2</span><span class="p">;</span>
<span class="p">}</span>
 
<span class="k">public</span> <span class="k">struct</span> <span class="nc">B</span>         <span class="c1">// Size of the type: 24 bytes</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="kt">float</span> <span class="n">f1</span><span class="p">;</span>    <span class="c1">// Offset 0</span>
    <span class="k">public</span> <span class="kt">float</span> <span class="n">f2</span><span class="p">;</span>    <span class="c1">// Offset 4</span>
 
    <span class="k">public</span> <span class="n">A</span> <span class="n">a1</span><span class="p">;</span>        <span class="c1">// Offset 8</span>
    <span class="k">public</span> <span class="n">A</span> <span class="n">a2</span><span class="p">;</span>        <span class="c1">// Offset 16</span>
<span class="p">}</span>

<span class="c1">// One single memory block of 256 * 24 bytes</span>
<span class="kt">var</span> <span class="n">data</span> <span class="p">=</span> <span class="k">new</span> <span class="n">B</span><span class="p">[</span><span class="m">256</span><span class="p">];</span>

</code></pre></div></div>

<p>Ok, this is a naive explanation, internally .net will make things a bit different, but you get the point:</p>

<ul>
  <li>We now have <strong>1</strong> object allocated on the heap (<code class="language-plaintext highlighter-rouge">data</code>), which allocates <strong>a continuous memory surface block</strong> to sequentially store all instances of <code class="language-plaintext highlighter-rouge">B</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">B</code> no longer reference other objects: the <code class="language-plaintext highlighter-rouge">a1</code> and <code class="language-plaintext highlighter-rouge">a2</code> fields are <strong>part of <code class="language-plaintext highlighter-rouge">B</code></strong>, not referenced by <code class="language-plaintext highlighter-rouge">B</code>.</li>
</ul>

<p>A <code class="language-plaintext highlighter-rouge">foreach</code> on the <code class="language-plaintext highlighter-rouge">class</code> version with access to all the fields would have lead to deal with 769 distinct memory locations, with a CPU that would have hard time to prefetch to reduce the time to access the data.</p>

<p>A <code class="language-plaintext highlighter-rouge">foreach</code> on the <code class="language-plaintext highlighter-rouge">struct</code> version with access to all the fields would be as fast as it could be: there’s one memory block, the CPU understand pretty quickly that we’re <strong>sequentially</strong> accessing the data, so the prefetch and cache loads are very efficient, because everything was design for this!</p>

<h3 id="benchmark-of-class-versus-struct">Benchmark of class versus struct</h3>

<p>I’ve created a small project in order to demonstrate what was explained above, you can go grab it and play with it or just keep on reading.</p>

<p>There are two implementations of a simple program which has to deal with a Financial Stock, containing a list of Trades, each Trade also contains a list of Tickets.</p>

<h4 id="diagram-of-the-class-version">Diagram of the class version</h4>

<p><img src="/assets/uploads/2018/04/Working-with-structClassDiagram-6.png" alt="" /></p>

<h4 id="diagram-of-the-struct-version">Diagram of the struct version</h4>

<p><img src="/assets/uploads/2018/04/Working-with-structStructDiagram-4.png" alt="" /></p>

<p>(don’t mind about the TradeType enum, it’s not important here)</p>

<h4 id="the-program">The program</h4>

<p>The program file is fairly simple:</p>

<ul>
  <li>It creates one Stock.</li>
  <li>Generates 1000 Trades to buy or sell some quantity of this Stock.</li>
  <li>Each Trade resulted in one to many Tickets with a given quantity for a given price. The sum of all the ticket’s quantity match the requested one for the Trade.</li>
</ul>

<p>The program creates the <code class="language-plaintext highlighter-rouge">class</code> version and the <code class="language-plaintext highlighter-rouge">struct</code> one.</p>

<p>We are going to bench an operation that will compute the average buy price and average sell price for all the Tickets.</p>

<p>So basically:</p>

<ul>
  <li>We parse all the tickets of all the trades</li>
  <li>Multiply their price with their quantity</li>
  <li>Divide the total buy price by the total buy quantity, same for sell.</li>
</ul>

<p>In other words we parse the whole tree hierarchy of instances and perform basic computation on it.</p>

<p>Here is a result of the benchmark comparing the <code class="language-plaintext highlighter-rouge">class</code> version against the <code class="language-plaintext highlighter-rouge">struct</code> (using <a href="http://benchmarkdotnet.org/">BenchmarkDotNet</a>)</p>

<p><img src="/assets/uploads/2018/04/Working-with-structBenchStructVsClass01-2.png" alt="" /></p>

<p>Few facts:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">struct</code> version is <strong>4 times faster</strong> than the <code class="language-plaintext highlighter-rouge">class</code> one, take a look at the Scaled column, the <code class="language-plaintext highlighter-rouge">class</code> version is the baseline, so its value is <code class="language-plaintext highlighter-rouge">1.00</code>, the <code class="language-plaintext highlighter-rouge">struct</code> version is running at <code class="language-plaintext highlighter-rouge">0.24</code> time compared to the baseline.</li>
  <li>There’s no Garbage Collection on the <code class="language-plaintext highlighter-rouge">struct</code> version, for a pretty obvious reason.</li>
  <li>The <code class="language-plaintext highlighter-rouge">class</code> version has some Garbage Collection and extra allocated memory.</li>
</ul>

<p><strong>Let’s be clear:</strong> this benchmark is <strong>not</strong> including the construction of the objects, this is done in a setup phase that is not benchmarked. Here, we are only profiling the computation of the average prices.</p>

<p>So why is this 4 times faster considering the fact we’re not creating objects, only parsing them? Well the reason is the one explained in the <a href="http://loicbaumann.fr/en/2018/04/02/how-to-optimize-net-development-using-net-core-2-1-and-c-7-2/">first post</a> of the series: <code class="language-plaintext highlighter-rouge">struct</code> are more memory friendly.</p>

<h3 id="lets-explain-a-bit">Let’s explain a bit</h3>

<h4 id="memory-layout-for-the-struct-version">Memory layout for the <code class="language-plaintext highlighter-rouge">struct</code> version</h4>

<p>In the diagram above, each color represent a memory location. <img src="/assets/uploads/2018/04/Working-with-structStructInMemory-1.png" alt="" /></p>

<p>What is important to understand is:</p>

<ul>
  <li>All Trades (Tr1…Tr6) objects are stored in an array <strong>(stored, not referenced!)</strong>, so they are in a <strong>contiguous memory zone</strong>. A For/Loop on them will be pretty efficient as the CPU will quickly fetch the <em>Trade n+1</em> while we’re processing the <em>Trade n</em>.</li>
  <li>Same thing for the Tickets, but only for the ones that are owned by the same Trade: each Trade has an array containing the Tickets it owns.</li>
</ul>

<p>In our case there’s 1000 Trades objects that are in a contiguous memory location: this is very memory friendly!</p>

<p>In the program, on average there are 5 Tickets per Trade, it is also apparently enough to be memory friendly.</p>

<p>We could have pushed things further and store all Tickets for all Trades in the same array, but things would be a bit more difficult, let’s keep it simple for now.</p>

<h4 id="memory-layout-for-the-class-version">Memory layout for the <code class="language-plaintext highlighter-rouge">class</code> version</h4>

<p>Well, no need of color this time, each object is stored in a distinct memory location, determined by heap manager of the .net CLR.</p>

<p><img src="/assets/uploads/2018/04/Working-with-structClassInMemory-1.png" alt="" /></p>

<p>What is important to understand here is:</p>

<ul>
  <li>You have no control/guarantee of where the objects are stored compared to the others, which is not good when you care about performances.</li>
  <li>Each object has a distinct lifetime, which is good, but it comes with a price.</li>
</ul>

<h4 id="a-design-choice-to-make">A design choice to make</h4>

<p>Again, there are no silver bullet: to gain something you have to give up something else in return.</p>

<p>In our case this more about a design decision to make:</p>

<ul>
  <li>You could easily have everything stored as objects in the heap, this is easy and it’s very <em>“C#”</em>, but performances will be what they are: average for .net.</li>
  <li>You can decide from the start how your objects will be stored, to improve performances, at the expense of some programming flexibility/simplicity.</li>
</ul>

<p>There’s a saying out there which warns every programmer:</p>

<blockquote>
  <p>“Early optimization is the root of all evil.”</p>
</blockquote>

<p>This is a simplified version of a quote from the great <a href="https://en.wikipedia.org/wiki/Donald_Knuth">Donald Knuth</a>:</p>

<blockquote>
  <p>“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”</p>
</blockquote>

<p>Early optimization is <strong>not</strong> the root of all evil, most of the time, it will be for sure. Optimizing something that won’t worth it is one of the biggest mistakes we all did (and still do, because, you know, it’s fun, it’s challenging!).</p>

<p>However, there are some <strong>profound design choices</strong> that have to be made <strong>from the start</strong>, because after, <strong>it will be too late!</strong></p>

<p>Ok, that’s all for this post, in the next one we will take a closer look at the code, how to design and program things in order to achieve better performances!</p>

<h3 id="update-1-on-april-the-5th">UPDATE #1 on April the 5th</h3>

<p>As Marko Lahma pointed out in the comment, the class/struct benchmark is not a fair one, I rely on foreach for classes, because, well, daily habits. This is what generated the 2040 B of Allocation and the Gen 0 GC. The speed difference was bigger than I expected, but mainly because the test is doing pretty much nothing in the nested for/loops (and the GC surely impacts overall performances).</p>

<p>Here’s the result.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">Method</th>
      <th style="text-align: right">Mean</th>
      <th style="text-align: right">Error</th>
      <th style="text-align: right">StdDev</th>
      <th style="text-align: right">Scaled</th>
      <th style="text-align: right">Gen 0</th>
      <th style="text-align: right">Allocated</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">ComputeAveragePriceOnClass</td>
      <td style="text-align: right">4,083.4 ns</td>
      <td style="text-align: right">13.766 ns</td>
      <td style="text-align: right">12.877 ns</td>
      <td style="text-align: right">1.00</td>
      <td style="text-align: right">0.4807</td>
      <td style="text-align: right">2040 B</td>
    </tr>
    <tr>
      <td style="text-align: right">ComputeAveragePriceOnClassNoEnumerator</td>
      <td style="text-align: right">2,807.9 ns</td>
      <td style="text-align: right">27.279 ns</td>
      <td style="text-align: right">25.517 ns</td>
      <td style="text-align: right">0.69</td>
      <td style="text-align: right">–</td>
      <td style="text-align: right">0 B</td>
    </tr>
    <tr>
      <td style="text-align: right">ComputeAveragePriceOnStruct</td>
      <td style="text-align: right">850.5 ns</td>
      <td style="text-align: right">4.484 ns</td>
      <td style="text-align: right">3.500 ns</td>
      <td style="text-align: right">0.21</td>
      <td style="text-align: right">–</td>
      <td style="text-align: right">0 B</td>
    </tr>
  </tbody>
</table>]]></content><author><name>Loïc Baumann</name></author><category term=".net" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Working with struct, a closer look</title><link href="https://nockawa.github.io/working-with-struct-a-closer-look/" rel="alternate" type="text/html" title="Working with struct, a closer look" /><published>2018-04-03T08:25:34+00:00</published><updated>2018-04-03T08:25:34+00:00</updated><id>https://nockawa.github.io/working-with-struct-a-closer-look</id><content type="html" xml:base="https://nockawa.github.io/working-with-struct-a-closer-look/"><![CDATA[<h3 id="introduction">Introduction</h3>

<p>It’s time to take a closer look at the code and find out the core mechanics of working with struct in C# 7.2 and .net core 2.1.</p>

<p>First, we will make a quick recap of the new <code class="language-plaintext highlighter-rouge">ref</code> and <code class="language-plaintext highlighter-rouge">in</code> keywords.</p>

<p>Then, we will take a look at a class that will be used to store and retrieve easily the objects and see how we use it to manipulate the objects.</p>

<p>Finally we will see some pitfalls to avoid.</p>

<h3 id="quick-recap-of-the-ref-and-in-keywords">Quick recap of the <code class="language-plaintext highlighter-rouge">ref</code> and <code class="language-plaintext highlighter-rouge">in</code> keywords</h3>

<p>For those who are not familiar with the new feature of C# 7.2, let’s make a quick recap of the <code class="language-plaintext highlighter-rouge">ref</code> and <code class="language-plaintext highlighter-rouge">in</code> keywords. You can also find the full documentation about this <a href="https://docs.microsoft.com/en-us/dotnet/csharp/reference-semantics-with-value-types">here</a>.</p>

<p>Say you have this class:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">partial</span> <span class="k">struct</span> <span class="nc">Vector3</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="kt">double</span> <span class="n">X</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">double</span> <span class="n">Y</span><span class="p">;</span>
    <span class="k">public</span> <span class="kt">double</span> <span class="n">Z</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>

<p>Now you want to code a method that makes an addition of two vectors:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">partial</span> <span class="k">struct</span> <span class="nc">Vector3</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="n">Vector3</span> <span class="nf">Add</span><span class="p">(</span><span class="n">Vector3</span> <span class="n">a</span><span class="p">,</span> <span class="n">Vector3</span> <span class="n">b</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="n">Vector3</span>
        <span class="p">{</span>
            <span class="n">X</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">X</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">X</span><span class="p">,</span>
            <span class="n">Y</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Y</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Y</span><span class="p">,</span>
            <span class="n">Z</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Z</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Z</span>
        <span class="p">};</span>
    <span class="p">}</span>
<span class="p">}</span>


</code></pre></div></div>

<p>In this implementation we have two similar problems:</p>

<ol>
  <li>When you will call the <code class="language-plaintext highlighter-rouge">Add()</code> method, the <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> objects you will pass will be <strong>copied</strong>: that’s the basic behavior of value types. You may argue that it’s not a big deal for such a small type considering a 64-bits pointer will be 2/3 of it, but that’s not the point right now.</li>
  <li>You will also have to create a new instance that will store the result and return it to the caller. This instance will also be duplicated at call site.</li>
</ol>

<p>So we are likely dealing with 4 allocations, for a simple addition, these allocations will be made on the stack rather than the heap because we’re dealing with value types, but nevertheless: it’s not the fastest way.</p>

<p>Now let’s take a look at a different implementation of the <code class="language-plaintext highlighter-rouge">Add()</code> method.</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">partial</span> <span class="k">struct</span> <span class="nc">Vector3</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">AddByRef</span><span class="p">(</span><span class="k">ref</span> <span class="n">Vector3</span> <span class="n">a</span><span class="p">,</span> <span class="k">ref</span> <span class="n">Vector3</span> <span class="n">b</span><span class="p">,</span> <span class="k">out</span> <span class="n">Vector3</span> <span class="n">res</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">res</span><span class="p">.</span><span class="n">X</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">X</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">X</span><span class="p">;</span>
        <span class="n">res</span><span class="p">.</span><span class="n">Y</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Y</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Y</span><span class="p">;</span>
        <span class="n">res</span><span class="p">.</span><span class="n">Z</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Z</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Z</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<p>Small changes, but big time differences:</p>

<ul>
  <li>Adding the <code class="language-plaintext highlighter-rouge">ref</code> keyword no longer copies the objects passed during the method call but passes <strong>a reference</strong> to them.</li>
  <li>Using the <code class="language-plaintext highlighter-rouge">out</code> keyword that already exists for quite some time will avoid a new allocation, by storing the result directly in the destination object.</li>
</ul>

<p>We got rid of these 4 allocations, fairly easy. The arguable trade-off here is not returning the result but using an out parameters, which is less convenient to use, but again, fast.</p>

<p>This implementation is not quite good yet, the <code class="language-plaintext highlighter-rouge">ref</code> keyword allows the <code class="language-plaintext highlighter-rouge">AddByRef()</code> method to modify the content of <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> (remember, they are references now), which is not appropriate in our case. This is why we should rely on the new <code class="language-plaintext highlighter-rouge">in</code> keyword instead, which passes a <strong>read-only reference</strong> of the object.</p>

<p>The correct implementation should be:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">partial</span> <span class="k">struct</span> <span class="nc">Vector3</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">AddByRef</span><span class="p">(</span><span class="k">in</span> <span class="n">Vector3</span> <span class="n">a</span><span class="p">,</span> <span class="k">in</span> <span class="n">Vector3</span> <span class="n">b</span><span class="p">,</span> <span class="k">out</span> <span class="n">Vector3</span> <span class="n">res</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">res</span><span class="p">.</span><span class="n">X</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">X</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">X</span><span class="p">;</span>
        <span class="n">res</span><span class="p">.</span><span class="n">Y</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Y</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Y</span><span class="p">;</span>
        <span class="n">res</span><span class="p">.</span><span class="n">Z</span> <span class="p">=</span> <span class="n">a</span><span class="p">.</span><span class="n">Z</span> <span class="p">+</span> <span class="n">b</span><span class="p">.</span><span class="n">Z</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<p>This is not the place of in-depth explanation about how the <code class="language-plaintext highlighter-rouge">in</code> keyword behaves, but be aware that you may not always get a performance improvement because of the so-called <a href="https://blogs.msdn.microsoft.com/seteplia/2018/03/07/the-in-modifier-and-the-readonly-structs-in-c/">defensive copy</a> mechanism.</p>

<h3 id="the-refarrayt-class">The <code class="language-plaintext highlighter-rouge">RefArray&lt;T&gt;</code> class</h3>

<p>I’ve quickly developed a small class <code class="language-plaintext highlighter-rouge">RefArray&lt;T&gt;</code> that wraps an array and allow access using the new <code class="language-plaintext highlighter-rouge">ref</code> keyword.</p>

<p>Here the implementation:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">RefArray</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="k">struct</span>
<span class="err">{</span>
    <span class="nc">public</span> <span class="nf">RefArray</span> <span class="p">(</span><span class="kt">int</span> <span class="n">initialSize</span> <span class="p">=</span> <span class="m">16</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">Count</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="n">_data</span> <span class="p">=</span> <span class="k">new</span> <span class="n">T</span><span class="p">[</span><span class="n">initialSize</span><span class="p">];</span>
        <span class="n">_dataLength</span> <span class="p">=</span> <span class="n">initialSize</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="kt">int</span> <span class="n">Count</span> <span class="p">{</span> <span class="k">get</span> <span class="p">;</span> <span class="k">private</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>

    <span class="k">public</span> <span class="kt">int</span> <span class="nf">Add</span><span class="p">(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">data</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="c1">// Check grow</span>
        <span class="nf">CheckGrow</span><span class="p">();</span>

        <span class="n">_data</span><span class="p">[</span><span class="n">Count</span><span class="p">]</span> <span class="p">=</span> <span class="n">data</span><span class="p">;</span>

        <span class="k">return</span> <span class="n">Count</span><span class="p">++;</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="k">ref</span> <span class="n">T</span> <span class="k">this</span><span class="p">[</span><span class="kt">int</span> <span class="n">index</span><span class="p">]</span>
    <span class="p">{</span>
        <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
        <span class="k">get</span>
        <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="p">&lt;</span> <span class="m">0</span> <span class="p">||</span> <span class="n">index</span> <span class="p">&gt;=</span> <span class="n">_dataLength</span><span class="p">)</span>
            <span class="p">{</span>
                <span class="k">throw</span> <span class="k">new</span> <span class="nf">IndexOutOfRangeException</span><span class="p">();</span>
            <span class="p">}</span>

            <span class="k">return</span> <span class="k">ref</span> <span class="n">_data</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="k">void</span> <span class="nf">CheckGrow</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">Count</span> <span class="p">==</span> <span class="n">_dataLength</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="kt">var</span> <span class="n">newLength</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)(</span><span class="n">_data</span><span class="p">.</span><span class="n">Length</span> <span class="p">*</span> <span class="m">1.5f</span><span class="p">);</span>
            <span class="kt">var</span> <span class="n">newArray</span> <span class="p">=</span> <span class="k">new</span> <span class="n">T</span><span class="p">[</span><span class="n">newLength</span><span class="p">];</span>
            <span class="n">_data</span><span class="p">.</span><span class="nf">CopyTo</span><span class="p">(</span><span class="n">newArray</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span>
            <span class="n">_data</span> <span class="p">=</span> <span class="n">newArray</span><span class="p">;</span>
            <span class="n">_dataLength</span> <span class="p">=</span> <span class="n">newLength</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="n">T</span><span class="p">[]</span> <span class="n">_data</span><span class="p">;</span>
    <span class="k">private</span> <span class="kt">int</span> <span class="n">_dataLength</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>

<p>The code is fairly easy, internally it’s an array of T and you have two methods to interact with the array:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">public int Add(ref T data)</code> to add an item to the array.</li>
  <li><code class="language-plaintext highlighter-rouge">public ref T this[int index]</code> to retrieve a reference to the item (to access or modify it).</li>
</ul>

<p>Let’s take a closer look at the array accessor implementation</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">ref</span> <span class="n">T</span> <span class="k">this</span><span class="p">[</span><span class="kt">int</span> <span class="n">index</span><span class="p">]</span>
<span class="p">{</span>
    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
    <span class="k">get</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">index</span> <span class="p">&lt;</span> <span class="m">0</span> <span class="p">||</span> <span class="n">index</span> <span class="p">&gt;=</span> <span class="n">_dataLength</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="k">throw</span> <span class="k">new</span> <span class="nf">IndexOutOfRangeException</span><span class="p">();</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="k">ref</span> <span class="n">_data</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<p>You may notice something that wouldn’t be obvious to understand at first: there’s only one get method, no set!</p>

<p>The reason is simple, as a reference to the object is returned you don’t need a setter, you will modify the object directly.</p>

<p><code class="language-plaintext highlighter-rouge">RefArray&lt;T&gt;</code> is the class that I used in the benchmark of the post #2 of this series, you could elaborate something more feature complete, but it serves the primary purpose.</p>

<h4 id="a-concrete-example-of-using-the-arrayt-class">A concrete example of using the <code class="language-plaintext highlighter-rouge">Array&lt;T&gt;</code> class</h4>

<p>The struct version of the Stock type is using the <code class="language-plaintext highlighter-rouge">Array&lt;T&gt;</code> class to store all the trades the stock owns.</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">struct</span> <span class="nc">StockStruct</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">StockStruct</span><span class="p">(</span><span class="kt">string</span> <span class="n">name</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">Name</span> <span class="p">=</span> <span class="n">name</span><span class="p">;</span>
        <span class="n">_tradeArray</span> <span class="p">=</span> <span class="k">new</span> <span class="n">RefArray</span><span class="p">&lt;</span><span class="n">TradeStruct</span><span class="p">&gt;();</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="kt">string</span> <span class="n">Name</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span>  <span class="p">}</span>

    <span class="k">private</span> <span class="k">readonly</span> <span class="n">RefArray</span><span class="p">&lt;</span><span class="n">TradeStruct</span><span class="p">&gt;</span> <span class="n">_tradeArray</span><span class="p">;</span>
    
    <span class="k">public</span> <span class="kt">int</span> <span class="n">TradeCount</span> <span class="p">=&gt;</span> <span class="n">_tradeArray</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span>

    <span class="k">public</span> <span class="k">ref</span> <span class="n">TradeStruct</span> <span class="nf">GetTrade</span><span class="p">(</span><span class="kt">int</span> <span class="n">index</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="k">ref</span> <span class="n">_tradeArray</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="kt">int</span> <span class="nf">AddTrade</span><span class="p">(</span><span class="k">ref</span> <span class="n">TradeStruct</span> <span class="n">trade</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="n">_tradeArray</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="k">ref</span> <span class="n">trade</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<p>As you can see the code is fairly simple, the <code class="language-plaintext highlighter-rouge">Array&lt;T&gt;</code> class is encapsulated and we make sure we use the <code class="language-plaintext highlighter-rouge">ref</code> keyword to add/get Trades.</p>

<p>It worth to mention that <code class="language-plaintext highlighter-rouge">Array&lt;T&gt;</code> is a class, so it’s stored in the heap, which is what we want, what matter is that all struct objects are stored sequentially in the the <code class="language-plaintext highlighter-rouge">private T[] _data;</code> field, which is what we’re looking for to speed things up.</p>

<h3 id="the-ref-readonly-pitfall">The <code class="language-plaintext highlighter-rouge">ref readonly</code> pitfall</h3>

<p>When passing or returning a reference to an object, you have the possibility to specify the <code class="language-plaintext highlighter-rouge">readonly</code> keyword to make sure the callee won’t be able to modify the given instance.</p>

<p>This will work fine with struct containing mainly public fields, as we demonstrated in the <code class="language-plaintext highlighter-rouge">Vector3</code> struct. In this case the compiler can check that any attempt to modify any field and return an error during compilation.</p>

<p>If your struct is using properties, things get trickier, internally a property is a method, so you’re able to modify the content of a given instance even if you’re accessing a property through its getter.</p>

<p>As of today, the compiler reacts in such case with a pretty brutal approach: a copy of your readonly object is made and returned to the callee in order to make sure the initial object won’t be modified.</p>

<p>As we can see below, the benchmark run with a readonly version of the structure access is slower than both ref struct or struct copy !</p>

<p><img src="/assets/uploads/2018/04/Working-with-struct-a-closer-lookBench01.png" alt="" /></p>

<h3 id="general-rules">General rules</h3>

<ol>
  <li>Use a struct to store plain, publicly exposed data, if you get into a more complex type with properties, then working with struct may not be the best fit for you.</li>
  <li>If you’re willing to expose read-only object, consider the <code class="language-plaintext highlighter-rouge">readonly struct</code> keywords during the declaration of a new type, it will be considered as immutable, then the compiler will stay away from defensive copies, ensuring you the best performances.</li>
  <li><strong>Profile!</strong> The theory is what it is: theory. I won’t replace the reality of a profiled piece of code! Sometime the copy of a struct will be faster than a reference, especially for small objects.</li>
</ol>

<p>Immutability in C# is not managed as well as in C++, for instance, if you consider advanced scenarios with struct and <code class="language-plaintext highlighter-rouge">readonly</code> or <code class="language-plaintext highlighter-rouge">in</code> keywords, I strongly encourage you to thoroughly read the official documentation about <code class="language-plaintext highlighter-rouge">ref</code>/<code class="language-plaintext highlighter-rouge">in</code>/<code class="language-plaintext highlighter-rouge">readonly struct</code> <a href="https://blogs.msdn.microsoft.com/seteplia/2018/03/07/the-in-modifier-and-the-readonly-structs-in-c/">keywords</a>.</p>]]></content><author><name>Loïc Baumann</name></author><category term=".net" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">How to optimize .net development using .net Core 2.1 and C# 7.2</title><link href="https://nockawa.github.io/how-to-optimize-net-development-using-net-core-2-1-and-c-7-2/" rel="alternate" type="text/html" title="How to optimize .net development using .net Core 2.1 and C# 7.2" /><published>2018-04-02T15:29:45+00:00</published><updated>2018-04-02T15:29:45+00:00</updated><id>https://nockawa.github.io/how-to-optimize-net-development-using-net-core-2-1-and-c-7-2</id><content type="html" xml:base="https://nockawa.github.io/how-to-optimize-net-development-using-net-core-2-1-and-c-7-2/"><![CDATA[<h3 id="forewords">Forewords</h3>

<p>This is the first blog post of a series about understanding how to improve performances developing with .net core and C# 7.2.</p>

<p>Some parts are purely theory, so it’s not about .net or C# 7.2, but it’s mostly the first post of the series, for the reader to understand the basics of CPU and Memory.</p>

<p>This series is intended for any kind of readers, especially the ones who a not familiar with the topic and are willing to understand the basics.</p>

<p>For the experts on the matter, you may find these posts are lacking depth, but it’s on purpose: the goal is not to thoroughly explain everything, it would be too big and ending up confusing most readers, but instead explaining what matters, why it matters and how to deal with it.</p>

<ol>
  <li>Understanding the memory.</li>
  <li>The benefits of working with <code class="language-plaintext highlighter-rouge">struct</code>.</li>
  <li>Working with Data Stores.</li>
  <li>Working with <code class="language-plaintext highlighter-rouge">Memory&lt;T&gt;</code> and <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code>.</li>
</ol>

<p>If you have remarks, typo corrections, or simply read posts still in progress, you can check my dedicated <a href="https://github.com/nockawa/BlogPosts/tree/Optimize.net/Optimize%20.net">GitHub repo</a>.</p>

<h3 id="introduction">Introduction</h3>

<p>It’s not a secret that Microsoft decided to focus on improving performances for the 2.1 release of .net core.</p>

<p>The main driver is to improve asp.net core but it doesn’t mean the new features only target the web server. Most of the time when it’s about performances you have to dig to the lowest layer in order to bring game changers and this time was no exception.</p>

<p>What is interesting, from my point of view, is that we’re starting to see some features that bring us closer to low level/high performance language such as C++.</p>

<p>The goal of this post series is to :</p>

<ol>
  <li>Explain what matters when we’re dealing with optimization.</li>
  <li>How you can use the new features (and also some of the old ones) to improve the code speed while keeping things clean and well designed.</li>
</ol>

<p>C# is about writing clean code to achieve high maintainability and meet good programming practices/standards. Writing optimized code often drives you away from these principles, finding the right balance is definitely a key aspect for the programmer.</p>

<h3 id="why-net-is-slower-than-c-">Why .net is slower than C++ ?</h3>

<p>Well, there are many reasons and I won’t detail all of them, mostly because I couldn’t, but there’s some of them we can focus on:</p>

<ol>
  <li>Seamless control over the lifetime of object through the use of Garbage Collection. It scares people who are performance/real-time driven.</li>
  <li>No direct access to memory, through pointers and boundless checks (we are not considering unsafe .net, of course).</li>
  <li>It is easy to not pay attention to the layout of the data.</li>
  <li>A lot of implicit memory copy. Things are easy to develop, but under the hood you don’t realize all the memory bandwidth that is consumed.</li>
  <li>A JIT that doesn’t generate code as efficient as a pre-compiled language.</li>
</ol>

<p>C# is a pretty high level programming language, it’s pretty easy/safe to use, that’s why you have things like the bullets #1 to #4 above. On the other hand it’s also easy to not being aware of what matters to optimize things up.</p>

<p>Let’s not focus on the #5, because there’s few things we can do about it. If we take a close look at #1 to #4 we will see there’s a common theme: memory!</p>

<p>Is memory important? <strong>Yes, you bet!</strong></p>

<h4 id="a-bit-of-talk-about-memory">A bit of talk about memory</h4>

<p>CPUs are getting more and more powerful the years passing by, but we don’t see the same trend going on for memory, see below:</p>

<p><img src="https://assets.bitbashing.io/images/mem_gap.png" alt="Processor vs. memory speeds" /></p>

<p><em><cite>Computer Architecture: A Quantitative Approach</cite> by John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau</em></p>

<p>It means that in order to keep the CPU busy, we have to develop our code &amp; data in a memory friendly way, because accessing data directly to memory will cost more than you may think!</p>

<p>There’s a very good analogy that you can <a href="http://www.prowesscorp.com/computer-latency-at-a-human-scale/">read here</a> that basically gives you crucial information.</p>

<p><strong>Let’s summarize it.</strong></p>

<p>Today, most of the CPU instructions that don’t involve memory access or very complex computation will take one cycle to execute, you have a 4Ghz CPU so it’s 4 billions instructions per second per logical core (so 32 billions for a hyper-threaded quad cores).</p>

<p>Let’s scale things to understand their impact better:</p>

<table>
  <thead>
    <tr>
      <th>Access type</th>
      <th>Real duration</th>
      <th>Scaled duration</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>One CPU Cycle</td>
      <td>0.4ns</td>
      <td>1 second</td>
    </tr>
    <tr>
      <td>Cache L1 Access</td>
      <td>0.9ns</td>
      <td>2 seconds</td>
    </tr>
    <tr>
      <td>Cache L2 Access</td>
      <td>2.8ns</td>
      <td>7 seconds</td>
    </tr>
    <tr>
      <td>Cache L3 Access</td>
      <td>28ns</td>
      <td>1 minute</td>
    </tr>
    <tr>
      <td>Main memory Access</td>
      <td>~100ns</td>
      <td>4 minutes</td>
    </tr>
  </tbody>
</table>

<p>Compared to one CPU cycle:</p>

<ul>
  <li>L1 access is 2x slower</li>
  <li>L2 access is 7x</li>
  <li>L3 access is 60x</li>
  <li>Memory access is <strong>250x</strong> slower!</li>
</ul>

<p>So yes, you can understand that the more you are memory friendly (we’ll explain roughly what it implies) the better you’ll have chances to hit the CPU cache, getting you significant performance boost!</p>

<p>Put it differently, compared to main memory access:</p>

<ul>
  <li>A L1 access is 125 times faster.</li>
  <li>A L2 access is 38 times faster.</li>
  <li>A L3 is 3.5 times faster.</li>
</ul>

<p>So worrying about the JIT not being fast enough may not be the main reason, you can leverage things yourself by being aware of what the CPU needs to execute as fast as possible.</p>

<h3 id="being-memory-friendly-means-being-cache-friendly">Being memory friendly, means being cache friendly!</h3>

<p>There are a lot of good, in-depth articles/posts out there explaining why the CPU cache is important and how to work with it. This topic can get really complex very quickly, here, again, we will try to keep things simple.</p>

<p><img src="/assets/uploads/2018/03/CPU-Z.png" alt="CPU Info" /></p>

<p>Few explanations/remarks:</p>

<ul>
  <li>Level 1 cache has dedicated cache for Data and Instructions (running assembly code), this is important because we don’t want one to compete against the other.</li>
  <li><code class="language-plaintext highlighter-rouge">4 x 32KBytes</code>, here the ‘4 x’ means we’ve a dedicated cache for each Core of the CPU: that’s right L1/L2 have dedicated caches for each CPU Core. ’32KBytes’ is the size of each for one CPU Core.</li>
  <li><code class="language-plaintext highlighter-rouge">8-way</code> is about <a href="https://en.wikipedia.org/wiki/CPU_cache#Associativity">‘associativity’</a>, which is a rather complex topic. Follow the link is you’re curious and brave!</li>
  <li>Data in a CPU Cache are organized by ‘Line’ (or Block), which are now most of the time 64 bytes wide. It means that whatever you do, when a data is loaded in the cache, it will fill a whole Line of 64 bytes and the starting address will also be a multiple of 64 bytes (hence the importance of allocating memory with a starting address being a multiple of 64 bytes).</li>
  <li>The CPU likes to prefetch data. Prefetch means that it will read ahead data that follows the one you’re accessing, hoping that you will access your data <strong>sequentially</strong>. Which is why it is a good thing to pack the data you often access at the same time in the same memory zone.</li>
</ul>

<p>More about <a href="https://en.wikipedia.org/wiki/CPU_cache">how a CPU cache works</a>.</p>

<h3 id="enough-of-the-theory-how-could-we-make-things-faster-in-net">Enough of the theory, how could we make things faster in .net?</h3>

<h4 id="minimizing-the-usage-of-the-garbage-collection">Minimizing the usage of the Garbage Collection</h4>

<p>Yes, the GC is a very nice and handy feature, but as each feature, it’s not a silver bullet, it’s not something you have to rely on 100% of the time, <strong>and definitely not in .net!</strong> The GC is only used when you involve <code class="language-plaintext highlighter-rouge">class</code> based types, <code class="language-plaintext highlighter-rouge">struct</code> ones are not. So yes, there’re ways to minimize pressure on the GC and you should know about them!</p>

<h4 id="directfast-memory-access-avoiding-copies">Direct/fast memory access, avoiding copies</h4>

<p>It’s easy to copy data, to isolate it for the sake of a good design (or easy and well readable code), it may not harm when the size is small and the frequency of the operation is low, but when one of these two factor increase, things amplify and performances are dropping.</p>

<p>One of the best example is the <code class="language-plaintext highlighter-rouge">String</code> class, it’s allocated on the heap and it’s immutable, which means all methods that change the string will return a new object! It’s a lot of memory traffic and the developer is most of the time not aware of this.</p>

<p>Luckily for us we have new weapons to improve things on this area.</p>

<h4 id="designing-the-data-in-a-more-memory-friendly-way">Designing the data in a more memory friendly way</h4>

<p>C# is a high-level language, we don’t pay attention to how we define the data in the types we design and it’s a big mistake when we want things to be driven by performances. Again, this is more about convenience, because the language don’t prevent you to improve things: you just don’t know/care to do it.</p>

<h2 id="to-be-followed-">To be followed !</h2>

<p>This was just the first post of the series and we talked mostly about theory, it was important to lay these foundations for the posts to come.</p>

<p>Starting the next post we’ll start talking concrete stuffs with examples.</p>]]></content><author><name>Loïc Baumann</name></author><category term=".net" /><summary type="html"><![CDATA[Forewords]]></summary></entry><entry><title type="html">Microservice or not microservice…</title><link href="https://nockawa.github.io/microservice-or-not-microservice/" rel="alternate" type="text/html" title="Microservice or not microservice…" /><published>2018-01-28T09:44:21+00:00</published><updated>2018-01-28T09:44:21+00:00</updated><id>https://nockawa.github.io/microservice-or-not-microservice</id><content type="html" xml:base="https://nockawa.github.io/microservice-or-not-microservice/"><![CDATA[<p>It is always a good thing to benefit from the point of view of others, things are never either black or white and to find your way in this grey area that you’ll have to define is certainly not easy.<br />
I really enjoyed reading <a href="http://www.dwmkerr.com/the-death-of-microservice-madness-in-2018/">this article</a> from <a href="https://twitter.com/dwmkerr">@dwmkerr</a> because it highlight many good points, the article’s title was carefully chosen to generate some “hype”, it certainly delivered.</p>

<p>But I have to say that after this, I saw a wave of negative opinions (even if many people defended the concept by writing comments on Dave’s blog) toward Microservice and I thought it could be useful for me to share my experience on the matter.</p>

<p>The trigger for me to get back at blogging after so many years is this <a href="https://twitter.com/KatNovakovic/status/957427701555527682">twitter post</a> from Katrina Novakovic, which basically summarized Dave key arguments:</p>

<ul>
  <li>Complexity for developers, operators and devops</li>
  <li>Requires expertise</li>
  <li>Poorly defined boundaries of real world systems</li>
  <li>Complexities of state and communication often ignored</li>
  <li>Versioning can be hard</li>
  <li>Monoliths in disguise</li>
  <li>Distributed Transactions</li>
</ul>

<p>Looking this way, I think it’d be harder to find more extreme than this point of view. Summarized this way make Microservice scary for sure!</p>

<h3 id="few-facts">Few facts</h3>

<ul>
  <li><strong>Silver bullet don’t exist</strong>, in the real world and also in the programming/architecture world. Microservices won’t be the solution to everything! Every time there’s a hyper on something, some people become “expert” at it and then try to push the concept to solve everything. It leads the less experienced people to believe than a given pattern is the <strong>ultimate</strong> one, a mere dream…Always ending the same way.</li>
  <li>
    <p><strong>Doing complex stuffs is easy while achieving simplicity is very hard</strong>. One of my all-time favorite quote comes from Leonardo da Vinci:</p>

    <p><em>“Simplicity is the ultimate sophistication”</em> It became one of my mantra in my daily work, because when programming/architecture is concerned, something easy is really hard to achieve and when you create something complex you have to realize something: you’ve made it wrong.<br />
  Of course, we are not perfect, we will always make mistakes we won’t have the time to correct, but just acknowledging this is very important to improve yourself for the next opportunity you will have. Otherwise you will go deeper in the complexity, even embracing it, because you will feel “superior” compared to others “mortal people” that won’t be able to catch what you’ve done.</p>
  </li>
  <li>
    <p>If you directly transitioned from a monolith architecture to a Microservice one : <strong>you will suffer!</strong> We have here almost two extremes, would you think it would be that easy to switch? Hell no!</p>

    <p>Unfortunately a lot of people make this mistake, most of the time they don’t have the choice: they are young pro and certainly inherit from an old monolithic architecture and when they finally get the chance to start from a clean slate, they go toward the extreme opposite, not realizing what will be the consequences which will lead them to a very complex solution. For this reason…</p>
  </li>
  <li><strong>Shifting to Microservice is hard</strong>, Rome wasn’t built in a day, neither your experience on the matter, although reading things may be able to save you from making some mistakes.</li>
</ul>

<h3 id="things-to-realize">Things to realize</h3>

<p>A Microservice architecture requires a lot of tooling and best practices to operate it correctly, if you’re not experienced nor ready in all of them, you will suffer:</p>

<ul>
  <li>A <strong>Continuous Delivery Chain</strong> (CDC) is required, if you don’t commit/build/test/push package in an automated and versioned fashion you will certainly fail.</li>
  <li>
    <p>One of the key principle to respect in architecture is “<strong>low coupling of components</strong>“.</p>

    <p>You know, the thing that didn’t exist in your monolithic architecture which made you wanna die many times due to the famous butterfly effect: you touch one little thing at a given place and <strong>bam</strong>, you have regressions on other parts you never thought they were related.<br />
  If you don’t have experience designing and writing a low coupled architecture and code, then you will certainly fail.</p>
  </li>
  <li>I’m not a pattern freak, someone who live by the dogma at all cost, but my advice is: try to learn <a href="https://en.wikipedia.org/wiki/Domain-driven_design"><strong>Domain Driven Design</strong></a>, if you don’t get it, it will be hard to switch to Microservice. The very small grained approach of Microservice will requires most of the same constraints, so it will certainly be a good starting point to familiarize yourself with DDD.</li>
  <li>Having a Continuous Delivery Chain is a start, but it won’t be enough, having an orchestrator to automatically deploy, components to monitor the health and load of each service instance is almost mandatory.<br />
  Otherwise the operating cost (and complexity) will be certainly unbearable. As Dave Kerr mentioned in his article, there is a correlation between Microservice and DevOps. I would say more precisely: you can’t do Microservice if you’re not already good at DevOps.</li>
</ul>

<h3 id="my-point-of-view-more-detailed">My point of view, more detailed</h3>

<p>Here is what I have to say on each of the main points of Dave’s article:</p>

<ul>
  <li><strong>Complexity for developers, operators and #devops</strong>: nothing is complex when you master them. I stated above, you have to be good at DevOps principles if you want to have a chance. You won’t build a Microservice architecture in one day.</li>
  <li><strong>Requires expertise</strong>: well another open door to blast… Everything requires expertise, even a Monolithic architecture, it’s just that some things are easier to gain expertise at than others.</li>
  <li><strong>Poorly defined boundaries of real world systems</strong>: this is not relevant to Microservice, look at DDD first…</li>
  <li><strong>Complexities of state and communication often ignored</strong>: this one deserve its own explanation below.</li>
  <li><strong>Versioning can be hard</strong>: because versioning can be easy? More from this below.</li>
  <li><strong>Monoliths in disguise</strong>: Your Microservice architecture certainly doesn’t sound like mine, but I believe that failing at designing low coupled services, lack of knowledge at DDD and Versioning issues lead you to an architecture of many services that end up be tied up ones to others, then forcing you an update of most of your architecture every time you make a minor upgrade. Microservice architecture didn’t fail you, you failed it.</li>
  <li><strong>Distributed Transactions</strong>: more below…</li>
</ul>

<h3 id="complexities-of-state-and-communication">Complexities of state and communication</h3>

<p>For me, most of your Microservice architecture has to be stateless, easier said than done I agree, but you have to employ at least two layers in your architecture:</p>

<ol>
  <li>The top level one, which answers the request from your client, whatever it’s a rich desktop, web or third-party service, here you will take care of things like: authentication, authorization, session state management (using Redis, NCache, Geode or whatever suits you) and security based cross-cuttings. This layer will be the visible surface of your architecture, you certainly don’t have to expose the whole architecture to the rest of the world (yes, your rich desktop, web client are part of “another world” for the sake of achieving low-coupling.)</li>
  <li>The rest of your architecture will be stateless services where you can communicate freely, without fear of security issues and always carrying the minimal state (which will <strong>always</strong> be a subset of the actual client’s state) to perform the operation.</li>
</ol>

<h3 id="versioning-can-be-hard">Versioning can be hard</h3>

<p>Yes, it’s always is… That’s why we have ALM, DevOps and tones of disciplines I won’t mention (everyone will be his/her favorite) to deal with this. But whatever you’re doing you have to realize/accept few things:</p>

<ul>
  <li><strong>Backward compatibility is a must</strong>, it must be maintained at the Service’s Interface declaration at run-time level. If you’re from the .net world, <a href="https://stackoverflow.com/questions/1456785/a-definitive-guide-to-api-breaking-changes-in-net">this post</a> will be helpful. If you fail to do so, yes, you will end up with a “monoliths in disguise”, but that’s the same of everything else: if you need to recompile the client when you make an upgrade at an interface it uses: you failed. You don’t need microservice to fail that, just two DLL talking are enough.</li>
  <li>So please respect the <a href="https://semver.org/"><strong>SemVer</strong></a> <strong>principles</strong> and don’t be afraid of doing a brand new interface as a new Major Version when you will break backward compatibility. It should not harm your Microservice architecture if you have a CDC and everything that monitors, handles scale up <strong>and down</strong> through automated deployment (and cleanup). If nobody uses your V1 of a given service anymore, it will end up in your architecture with a couple of instances, running for almost nothing, being decommissioned eventually by a human or a machine.</li>
  <li>Low-coupling, DDD, again, will be essential to success.</li>
</ul>

<h3 id="distributed-transactions">Distributed Transactions</h3>

<p>I don’t see why distributed transaction are a requirement doing Microservices, I never used that and always find my way of doing things.</p>

<p>One of the key fundamental of Microservice is a very fine grained solution, a Service operation shouldn’t last forever executing, hence, when you really require transaction through many nested call, if you rely on synchronous call with each operation declaring its own transaction, then reporting as it should the success or failure of its execution, then the caller can commit or rollback its own transaction: no big deal.</p>

<p>If you start async everywhere, even when it’s not needed, ok, things are going to be tougher…</p>

<h3 id="conclusion">Conclusion</h3>

<p>For me, server-side architecture and development is definitely challenging, you can do a very simple (and working at some extend) architecture that will be for on-premise solutions, but when you’ll go for SaaS you will definitely stumble upon things like: high availability, scaling, mutualized architecture, using PaaS solutions and it will be a whole different world.</p>

<p>Whatever the architecture you employ, if you don’t do it the right way, you will fail because the result will be overly complex. I agree that Microservice is a complex architecture so far, so jumping to this should be done with extreme caution. Is it a silver bullet? Nope, it won’t be the solution to everything, big companies rely on it because they have the skills, they can afford it and above all: they need it.</p>

<p>That being said I don’t rule out this architecture for smaller companies, it will be certainly harder for them, but there are more and more solutions that will assist you (<a href="https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-overview-microservices">read this</a> for instance).</p>

<p>I always welcome all the point of views because, even if the tone of this post doesn’t suggest it <em>at all</em>, I will always be open minded and learn from others. The hype around new techs in our industry is something that bring us as many good things as bad ones. One day a given tech is the solution to everything, the day after it will be crucified by everyone, for the sake of a new tech, most of the time.</p>

<p>Microservice are being crucified and we’re entering the era of serverless architecture: the true, first, and only one silver bullet that will ease all our pain!</p>

<p>Does it look like something we’ve already saw before? Hum… 🙂</p>]]></content><author><name>Loïc Baumann</name></author><category term="Architecture" /><summary type="html"><![CDATA[It is always a good thing to benefit from the point of view of others, things are never either black or white and to find your way in this grey area that you’ll have to define is certainly not easy. I really enjoyed reading this article from @dwmkerr because it highlight many good points, the article’s title was carefully chosen to generate some “hype”, it certainly delivered.]]></summary></entry><entry><title type="html">Two months later…</title><link href="https://nockawa.github.io/two-months-later/" rel="alternate" type="text/html" title="Two months later…" /><published>2004-10-19T19:24:00+00:00</published><updated>2025-07-18T01:00:00+00:00</updated><id>https://nockawa.github.io/two-months-later</id><content type="html" xml:base="https://nockawa.github.io/two-months-later/"><![CDATA[<p>I was working on something else (and took holidays), didn’t have the time to go back to the renderer until three weeks ago. <br />
At first I wasn’t considering these three weeks work as a part of the SM3 Renderer, so I didn’t want to update this page.</p>

<p>But well, even if it’s not talking about a cool rendering technique, it’s still part of this project, and this is something I’d like to share too.</p>

<h3 id="here-we-go-lets-catching-up-with-my-new-in-viewport-gui">Here we go, let’s catching up with my new in-viewport GUI</h3>

<h4 id="windowing-system-and-redraw">Windowing System and redraw</h4>

<p>There were three criteria to pay attention to: fast display of windows, make good use of the Alpha, having the whole system as flexible as possible.</p>

<p>The GUI is system is like the others, you have windows organized in a hierarchical way. There’re notions of active windows, focus window, “hover” window. You can capture mouse events (and stack the captures). There’s a global alpha constant for the GUI and for each top level windows, which is used by all the low level drawing methods (DrawRect, FillRect, DrawLineList, DrawMesh, DrawTexture, etc…) for fading effects. Redraw had to be optimal so I’m also using clipping region (using D3D Scissor Rect).</p>

<p>Rendering the windows’ content was obviously a big concern and not an easy task when you want things to be fast, flexible and with transparency. The most important feature of this system is you can decide for EACH windows (even child ones) if you want it to be cached in a texture. This way, if the window’s content doesn’t change, the cached texture will be used instead of redrawing everything. When a window is cached in a texture, it will also render into this cache the content of its child windows as long as the given child is not a also cached window. You can think about the benefits of having a hierarchical caching system.</p>

<p>Optimal redraw requests are made for transparent windows, using optimal computing of the transparent regions across the hierarchy.</p>

<h4 id="draw-text-font-and-stuffs">Draw text, font and stuffs</h4>

<p>I had to extend the font system Ihad (which was fairly simple). More font styles are supported, there’s a font pool now used to avoid redundant creation of identical font pages. I also recorded properties such as font ascenders, descenders, spacing, etc. Added methods to compute the size taken by a given letter, word, line or text (bound by a logical area).</p>

<p>Drawing text is now more complete, you can specify a bounding zone, an alignemnt, auto wrapping, and auto display of an ellipsis if the line is truncated..</p>

<h4 id="html-document-display">HTML Document display</h4>

<p>Ok, I have to admit, it was not a necessary thing, but well, I thought it would be a good test for the GUI (and also a challenge for me). At first I only wanted to do a multi-line edit control, but after I wanted to display more complex text formatting (color, underline, bold, font change), so I looked the Rich Text format. And when I realized it was more messy than the HTML one, I choosed the HTML (also because it’s way more popular now).</p>

<p>I won’t explained in detail the structure, but I’m kind of proud of it. It’s very efficient and flexible (and I’ll be able later to upgrade it to edit HTML content). I of course don’t support all the HTML tokens (far from it), but the structure is open and it’s easy to add the support of new ones.</p>

<h4 id="other-controls">Other controls</h4>
<p>So I have now the following high level classes:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">BaseWnd</code>: the abstract class that all other windows are derived from.</li>
  <li><code class="language-plaintext highlighter-rouge">Window</code>: a top level window.</li>
  <li><code class="language-plaintext highlighter-rouge">Control</code>: abstract class for control typed windows.</li>
  <li><code class="language-plaintext highlighter-rouge">Menu</code>: to display a menu.</li>
  <li><code class="language-plaintext highlighter-rouge">ObjIcon</code>: display an icon of a given object, the icon is a 3D render of a mesh lighted with a global light. The mesh is chose from the type of the object which is viewed.</li>
  <li><code class="language-plaintext highlighter-rouge">ObjExplorer</code>: a little object browser to walk through a given object database (a 3D scene, the SM3 rendering architecture, IML Framework for instances)</li>
  <li><code class="language-plaintext highlighter-rouge">EditCtrl</code>: single/multi line display, HTML display, encoding from raw or C style text, raw text editing (stored in HTML).</li>
</ul>

<p>The ObjExplorer class is still not finished, I’m currently working on the generic Drag n Drop system (which will be heavily used).</p>

<h3 id="screenshots">Screenshots:</h3>

<p>This is how it looks like when I start my Test3DE.exe now.</p>

<p><img src="/assets/fromcs/GUI_1.jpg" alt="" /></p>

<p>The Object explorer with a nice tool-tip that displays the content of a DirectX Texture.</p>

<p><img src="/assets/fromcs/GUI_2.jpg" alt="" /></p>

<p>The object explorer displays the content of a Resource Pack (the main one of the scene)</p>

<p><img src="/assets/fromcs/GUI_3.jpg" alt="" /></p>

<p>The tool-tip displays the content of a DirectX texture which is… the IML Console’s one.</p>

<p><img src="/assets/fromcs/GUI_4.jpg" alt="" /></p>

<p>Just to show what a menu looks like</p>

<p><img src="/assets/fromcs/GUI_5.jpg" alt="" /></p>]]></content><author><name>Loïc Baumann</name></author><category term="3D Programming" /><summary type="html"><![CDATA[After a short break and holidays, let's resume the work on the renderer]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://nockawa.github.io/assets/fromcs/GUI_2.jpg" /><media:content medium="image" url="https://nockawa.github.io/assets/fromcs/GUI_2.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">New rendering features !</title><link href="https://nockawa.github.io/new-rendering-features/" rel="alternate" type="text/html" title="New rendering features !" /><published>2004-08-24T19:25:00+00:00</published><updated>2025-07-18T01:00:00+00:00</updated><id>https://nockawa.github.io/new-rendering-features</id><content type="html" xml:base="https://nockawa.github.io/new-rendering-features/"><![CDATA[<h3 id="i-added-gamma-correction-bumpnormal-mapping-and-depth-of-field">I added Gamma Correction, bump/normal mapping, and Depth of Field.</h3>

<h3 id="i-also-fixed-few-bugs">I also fixed few bugs.</h3>

<h3 id="screenshots-of-gamma-correction">ScreenShots of Gamma Correction</h3>

<p>It’s brighter where it should be, and still dark where it should be too.</p>

<p>The picture was took from the ATI’s sRGB sample.</p>

<p>No correction</p>

<p><img src="/assets/fromcs/PicNoGammaCorrected.jpg" alt="" /></p>

<p>Gamma corrected</p>

<p><img src="/assets/fromcs/PicGammaCorrected.jpg" alt="" /></p>

<h3 id="screenshots-of-normal-mapping">ScreenShots of Normal Mapping</h3>

<p>The left sphere is the high poly one (40K faces). The right is the low poly version (960 faces) with the normal map applied. <br />
The normal map was created with our 3D Studio Max Bump-o-matic plugin.</p>

<p><img src="/assets/fromcs/NormalMap_Solid.jpg" alt="" /></p>

<p>Wire version of the first screenshot.
<img src="/assets/fromcs/NormalMap_Wire.jpg" alt="" /></p>

<p>Rendering of the normals.
<img src="/assets/fromcs/NormalMap_Normals.jpg" alt="" /></p>

<h3 id="screenshots-of-depth-of-field">ScreenShots of Depth of Field</h3>

<p>The white AABBs symbolize the Plane in Focus. Check their intersection with the scene to get a better idea of their position.</p>

<p><img src="/assets/fromcs/DepthOfField_1.jpg" alt="" /></p>

<p><img src="/assets/fromcs/DepthOfField_2.jpg" alt="" /></p>

<p><img src="/assets/fromcs/DepthOfField_3.jpg" alt="" /></p>

<h3 id="more-about-depth-of-field">More about depth-of-field:</h3>

<p>I read many things about Depth of Field, the article in GPU Gems for instance, saw many formula without really knowing how to practically implement them.</p>

<p>So I came out with an in-house one, really simple:<br />
 <strong>Df</strong> = <strong>DP</strong> * abs(<strong>PosZ</strong> – <strong>PiF</strong>) / <strong>PosZ</strong>.<br />
 <strong>DP</strong> is the Depth of Field Power. 0 to disable it, 1 for standard result, &gt;1 to get something really blurry.<br />
 <strong>PosZ</strong> is the position in camera space of the pixel to compute.<br />
 <strong>PiF</strong> is the Plane in Focus position in camera space.<br />
 <strong>Df</strong> is the result, I clamp it to <code class="language-plaintext highlighter-rouge">[0,1]</code> and use it in the lerp from the accumulation buffer and the blurred one during the ToneMapping.</p>]]></content><author><name>Loïc Baumann</name></author><category term="3D Programming" /><summary type="html"><![CDATA[Adding normal mapping, tone mapping to the renderer.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://nockawa.github.io/assets/fromcs/NormalMap_Solid.jpg" /><media:content medium="image" url="https://nockawa.github.io/assets/fromcs/NormalMap_Solid.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Parallax mapping, more ambient occlusion and stuffs</title><link href="https://nockawa.github.io/parallax-mapping-more-ambient-occlusion-n-stuffs/" rel="alternate" type="text/html" title="Parallax mapping, more ambient occlusion and stuffs" /><published>2004-08-16T19:26:00+00:00</published><updated>2025-07-18T01:00:00+00:00</updated><id>https://nockawa.github.io/parallax-mapping-more-ambient-occlusion-n-stuffs</id><content type="html" xml:base="https://nockawa.github.io/parallax-mapping-more-ambient-occlusion-n-stuffs/"><![CDATA[<h3 id="parallax-mapping-is-finished">Parallax mapping is finished</h3>

<p>The whole production pipeline is now ready for that technique. The 3D Studio MAX plugin now computes the correct scale/bias and can also display the result in a custom view.</p>

<h3 id="screenshots">Screenshots</h3>

<p><img src="/assets/fromcs/sc_para_nobump.jpg" alt="" /></p>

<p>As you can see, the specular highlight is not ‘real’ for that kind of material (supposed to be rocks…)</p>

<p><img src="/assets/fromcs/sc_para_with.jpg" alt="" /></p>

<p>Wireframe mode!</p>

<p><img src="/assets/fromcs/sc_para_wire.jpg" alt="" /></p>

<h3 id="i-added-a-new-parameter-in-the-ambient-occlusion-map-creation">I added a new parameter in the Ambient Occlusion Map creation</h3>

<p>which is the length of the rays used to perform the occlusion test. This way the occlusion map builder can now produce maps for indoor meshes.</p>

<h3 id="screenshots-1">Screenshots</h3>

<p>Ambient occlusion off</p>

<p><img src="/assets/fromcs/sc_ambocc_off.jpg" alt="" /></p>

<p>Ambient occlusion on</p>

<p><img src="/assets/fromcs/sc_ambocc_on.jpg" alt="" /></p>

<p>Ambient occluion off</p>

<p><img src="/assets/fromcs/sc_ambocc2_off.jpg" alt="" /></p>

<p>Ambient occlusion on
<img src="/assets/fromcs/sc_ambocc2_on.jpg" alt="" /></p>

<p>Ambient occlusion map
<img src="/assets/fromcs/GrRoomAOM.jpg" alt="" /></p>

<p>3DS Max UVW unwrap modifier
<img src="/assets/fromcs/sc_ambocc_max_uvw.jpg" alt="" /></p>

<p>The original mesh of the room wasn’t mapped, so I used the flatten mapping of the UVW Unwrap modifier of 3DS MAX to generate mapping coordinates, then use the Bum-o-matic plugin to generate the Ambient Occlusion Map.</p>

<p><img src="/assets/fromcs/sc_ambocc_bumpo.jpg" alt="" /></p>

<p>The result speaks itself.</p>

<h3 id="light-volume-rendering">Light volume rendering</h3>

<p>Before, for each light was lighting every pixel on the viewport, which was quite slow/wasteful. Now for point and spot lights, their bounding volume is rendered to perform the lighting, as you can guess, this is much faster for small area lights.</p>

<h4 id="screenshots-2">Screenshots</h4>

<p>Without</p>

<p><img src="/assets/fromcs/Indoor_17.jpg" alt="" /></p>

<p>With</p>

<p><img src="/assets/fromcs/Indoor_17V.jpg" alt="" /></p>

<p>Without</p>

<p><img src="/assets/fromcs/sc_lightvol_off.jpg" alt="" /></p>

<p>With</p>

<p><img src="/assets/fromcs/sc_lightvol_on.jpg" alt="" /></p>

<h3 id="i-added-an-iml-console-right-in-the-viewport">I Added an IML Console right in the viewport</h3>

<p>Having more and more rendering parameters I’d like to tweak in real-time, I’ve decided to take advantage of the whole IML architecture to interact with the renderer (and the 3D Scene) in run-time.</p>

<h4 id="screenshots-3">Screenshots</h4>

<p><img src="/assets/fromcs/sc_IMLConsole.jpg" alt="" /></p>

<p><strong>More about Ambient Occlusion builder:</strong></p>

<p>For each pixel on the map we’re created, its position into the mesh is located, and a series of rays are thrown to perform occlusion tests (intersection) with other part of the mesh itself.</p>

<p>The problem for indoor environments is there’s always a intersection found (because the mesh is closed), making it impossible to produce an accurate map.</p>

<p>By letting the graphist set a length for the rays that are cast, the occlusion can be perform on a limited area, then producing the expected result.</p>

<p><strong>More about IML:</strong></p>

<p>IML stands for <em>Irion Micro Language</em>, it’s a run-time wrapper to the C++ components, for each Irion component one is developing, he can create an IML Class that will be used to expose the component to the IML Framework.</p>

<p>Using IML via an IML Console, you can create/edit/delete new components or existing ones. For instance, I developed an IML Class to wrap the SM3Viewport C++ class, I exposed a set of properties (rendering modes, rendering attributes, stats display, etc.) that can be later modified via an IML Console or Script.</p>]]></content><author><name>Loïc Baumann</name></author><category term="3D Programming" /><summary type="html"><![CDATA[Parallax mapping is done, improving occlusion mapping]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://nockawa.github.io/assets/fromcs/sc_para_with.jpg" /><media:content medium="image" url="https://nockawa.github.io/assets/fromcs/sc_para_with.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>