<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged optimization at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/optimization/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/optimization/feed/"/>
  <updated>2026-04-09T13:25:45Z</updated>
  <id>urn:uuid:6022e81a-277c-4339-b204-59145c368baf</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>An easy-to-implement, arena-friendly hash map</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/09/30/"/>
    <id>urn:uuid:4a457832-7d23-4dab-80f2-31f683379d7b</id>
    <updated>2023-09-30T23:18:40Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>My last article had <a href="/blog/2023/09/27/">tips for for arena allocation</a>. This next
article demonstrates a technique for building bespoke hash maps that
compose nicely with arena allocation. In addition, they’re fast, simple,
and automatically scale to any problem that could reasonably be solved
with an in-memory hash map. To avoid resizing — both to better support
arenas and to simplify implementation — they have slightly above average
memory requirements. The design, which we’re calling a <em>hash-trie</em>, is the
result of <a href="https://nrk.neocities.org/articles/hash-trees-and-tries">fruitful collaboration with NRK</a>, whose sibling article
includes benchmarks. It’s my new favorite data structure, and has proven
incredibly useful. With a couple well-placed acquire/release atomics, we
can even turn it into a <em>lock-free concurrent hash map</em>.</p>

<p>I’ve written before about <a href="/blog/2022/08/08/">MSI hash tables</a>, a simple, <em>very</em> fast
map that can be quickly implemented from scratch as needed, tailored to
the problem at hand. The trade off is that one must know the upper bound
<em>a priori</em> in order to size the base array. Scaling up requires resizing
the array — an impedance mismatch with arena allocation. Search trees
scale better, as there’s no underlying array, but tree balancing tends to
be finicky and complex, unsuitable to rapid, on-demand implementation.
<strong>We want the ease of an MSI hash table with the scaling of a tree.</strong></p>

<p>I’ll motivate the discussion with example usage. Suppose we have an array
of pointer+length strings, as defined last time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint8_t</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">str</span><span class="p">;</span>
</code></pre></div></div>

<p>And we need a function that removes duplicates in place, but (for the
moment) we’re not worried about preserving order. This could be done
naively in quadratic time. Smarter is to sort, then look for runs.
Instead, I’ve used a hash map to track seen strings. It maps <code class="language-plaintext highlighter-rouge">str</code> to
<code class="language-plaintext highlighter-rouge">bool</code>, and it is represented as type <code class="language-plaintext highlighter-rouge">strmap</code> and one insert+lookup
function, <code class="language-plaintext highlighter-rouge">upsert</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Insert/get bool value for given str key.</span>
<span class="n">bool</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">strmap</span> <span class="o">**</span><span class="p">,</span> <span class="n">str</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">ptrdiff_t</span> <span class="nf">unique</span><span class="p">(</span><span class="n">str</span> <span class="o">*</span><span class="n">strings</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">strmap</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">bool</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">seen</span><span class="p">,</span> <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">b</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// previously seen (discard)</span>
            <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">strings</span><span class="p">[</span><span class="o">--</span><span class="n">len</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="c1">// newly-seen (keep)</span>
            <span class="n">count</span><span class="o">++</span><span class="p">;</span>
            <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In particular, note:</p>

<ul>
  <li>
    <p>A null pointer is an empty hash map and initialization is trivial. As
discussed in the last article, one of my arena allocation principles is
default zero-initializion. Put together, that means any data structure
containing a map comes with a ready-to-use, empty map.</p>
  </li>
  <li>
    <p>The map is allocated out of the scratch arena so it’s automatically
freed upon any return. It’s as care-free as garbage collection.</p>
  </li>
  <li>
    <p>The map directly uses strings in the input array as keys, without making
copies nor worrying about ownership. Arenas own objects, not references.
If I wanted to carve out some fixed keys ahead of time, I could even
insert static strings.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">upsert</code> returns a pointer to a value. That is, a pointer into the map.
This is not strictly required, but usually makes for a simple interface.
When an entry is new, this value will be false (zero-initialized).</p>
  </li>
</ul>

<p>So, what is this wonderful data structure? Here’s the basic shape:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">hashmap</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">keytype</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">valtype</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">}</span> <span class="n">hashmap</span><span class="p">;</span>
</code></pre></div></div>

<p>They <code class="language-plaintext highlighter-rouge">child</code> and <code class="language-plaintext highlighter-rouge">key</code> fields are essential to the map. Adding a <code class="language-plaintext highlighter-rouge">child</code>
to any data structure turns it into a hash map over whatever field you
choose as the key. In other words, a hash-trie can serve as an <em>intrusive
hash map</em>. In several programs I’ve combined intrusive lists and hash maps
to create an insert-ordered hash map. Going the other direction, omitting
<code class="language-plaintext highlighter-rouge">value</code> turns it into a hash set. (Which is what <code class="language-plaintext highlighter-rouge">unique</code> <em>really</em> needs!)</p>

<p>As you probably guessed, this hash-trie is a 4-ary tree. It can easily be
2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but
4-ary strikes a good balance, if a bit bulky. In the example above,
<code class="language-plaintext highlighter-rouge">keytype</code> would be <code class="language-plaintext highlighter-rouge">str</code> and <code class="language-plaintext highlighter-rouge">valtype</code> would be <code class="language-plaintext highlighter-rouge">bool</code>. The most general
form of <code class="language-plaintext highlighter-rouge">upsert</code> looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">valtype</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">hashmap</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">keytype</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">perm</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">hashmap</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This will take some unpacking. The first argument is a pointer to a
pointer. That’s the destination for any newly-allocated element. As it
travels down the tree, this points into the parent’s <code class="language-plaintext highlighter-rouge">child</code> array. If
it points to null, then it’s an empty tree which, by definition, does not
contain the key.</p>

<p>We need two “methods” for keys: <code class="language-plaintext highlighter-rouge">hash</code> and <code class="language-plaintext highlighter-rouge">equals</code>. The hash function
should return a uniformly distributed integer. As is usually the case,
less uniform fast hashes generally do better than highly-uniform slow
hashes. For hash maps under ~100K elements a 32-bit hash is fine, but
larger maps should use a 64-bit hash state and result. Hash collisions
revert to linear, linked list performance and, per the birthday paradox,
that will happen often with 32-bit hashes on large hash maps.</p>

<p>If you’re worried about pathological inputs, add a seed parameter to
<code class="language-plaintext highlighter-rouge">upsert</code> and <code class="language-plaintext highlighter-rouge">hash</code>. Or maybe even use the address <code class="language-plaintext highlighter-rouge">m</code> as a seed. The
specifics depend on your security model. It’s not an issue for most hash
maps, so I don’t demonstrate it here.</p>

<p>The top two bits of the hash are used to select a branch. These tend to be
higher quality for <a href="/blog/2018/07/31/">multiplicative hash functions</a>. At each level
two bits are shifted out. This is what gives it its name: a <em>trie</em> of the
<em>hash bits</em>. Though it’s un-trie-like in the way it deposits elements at
the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a
time.</p>

<p>I initially tried a <a href="/blog/2019/11/19/">Multiplicative Congruential Generator</a> (MCG) to
select the next branch at each trie level, instead of bit shifting, but
NRK noticed it was consistently slower than shifting.</p>

<p>While “delete” could be handled using gravestones, many deletes would not
work well. After all, the underlying allocator is an arena. A combination
of uniformly distributed branching and no deletion means that rebalancing
is unnecessary. This is what grants it its simplicity!</p>

<p>If no arena is provided, it reverts to a lookup and returns null when the
key is not found. It allows one function to flexibly serve both modes. In
<code class="language-plaintext highlighter-rouge">unique</code>, pure lookups are unneeded, so this condition could be skipped in
its <code class="language-plaintext highlighter-rouge">strmap</code>.</p>

<p>Sometimes it’s useful to return the entire <code class="language-plaintext highlighter-rouge">hashmap</code> object itself rather
than an internal pointer, particularly when it’s intrusive. Use whichever
works best for the situation. Regardless, exploit zero-initialization to
detect newly-allocated elements when possible.</p>

<p>In some cases we may deep copy the key in its arena before inserting it
into the map. The provided key may be a temporary (e.g. <code class="language-plaintext highlighter-rouge">sprintf</code>) which
the map outlives, and the caller doesn’t want to allocate a longer-lived
key unless it’s needed. It’s all part of tailoring the map to the problem,
which we can do because it’s so short and simple!</p>

<h3 id="fleshing-it-out">Fleshing it out</h3>

<p>Putting it all together, <code class="language-plaintext highlighter-rouge">unique</code> could look like the following, with
<code class="language-plaintext highlighter-rouge">strmap</code>/<code class="language-plaintext highlighter-rouge">upsert</code> renamed to <code class="language-plaintext highlighter-rouge">strset</code>/<code class="language-plaintext highlighter-rouge">ismember</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash</span><span class="p">(</span><span class="n">str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mh">0x100</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111u</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">bool</span> <span class="nf">equals</span><span class="p">(</span><span class="n">str</span> <span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="o">==</span><span class="n">b</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">memcmp</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">strset</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">str</span>     <span class="n">key</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strset</span><span class="p">;</span>

<span class="n">bool</span> <span class="nf">ismember</span><span class="p">(</span><span class="n">strset</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">str</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">strset</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">unique</span><span class="p">(</span><span class="n">str</span> <span class="o">*</span><span class="n">strings</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">strset</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">count</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">ismember</span><span class="p">(</span><span class="o">&amp;</span><span class="n">seen</span><span class="p">,</span> <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">strings</span><span class="p">[</span><span class="o">--</span><span class="n">len</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">count</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with
an xorshift finalizer because the bits are used most-significant first.
Exercise for the reader: Support retaining the original input order using
an intrusive linked list on <code class="language-plaintext highlighter-rouge">strset</code>.</p>

<h3 id="relative-pointers">Relative pointers?</h3>

<p>As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes
these hash-tries a bit heavier than average. It’s not an issue for smaller
hash maps, but has practical consequences for huge hash maps.</p>

<p>In attempt to address this, I experimented with <a href="https://www.youtube.com/watch?v=Z0tsNFZLxSU">relative pointers</a>
(example: <a href="https://github.com/skeeto/scratch/blob/master/misc/markov.c"><code class="language-plaintext highlighter-rouge">markov.c</code></a>). That is, instead of pointers I use signed
integers whose value indicates an offset <em>relative to itself</em>. Because
relative pointers can only refer to nearby memory, a custom allocator is
imperative, and arenas fit the bill perfectly. Range can be extended by
exploiting memory alignment. In particular, 32-bit relative pointers can
reference up to 8GiB in either direction. Zero is reserved to represent a
null pointer, and relative pointers cannot refer to themselves.</p>

<p>As a bonus, data structures built out of relative pointers are <em>position
independent</em>. A collection of them — perhaps even a whole arena — can be
dumped out to, say, a file, loaded back at a different position, then
continue to operate as-is. Very cool stuff.</p>

<p>Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead
in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are
guaranteed to work. No object is ever too far away. It’s a compounding
effect, too. Smaller map nodes means a larger number of them are in reach
of a relative pointer. Also very cool.</p>

<p>However, as far as I know, no generally available programming language
implementation supports this concept well enough to put into practice. You
could implement relative pointers with language extension facilities, such
as C++ operator overloads, but <em>no tools will understand them</em> — a major
bummer. You can no longer use a debugger to examine such structures, and
it’s just not worth that cost. If only arena allocation was more popular…</p>

<h3 id="as-a-concurrent-hash-map">As a concurrent hash map</h3>

<p>For the finale, let’s convert <code class="language-plaintext highlighter-rouge">upsert</code> into a concurrent, lock-free hash
map. That is, multiple threads can call upsert concurrently on the same
map. Each must still have its own arena, probably per-thread arenas, and
so no implicit locking for allocation.</p>

<p>The structure itself requires no changes! Instead we need two atomic
operations: atomic load (acquire), and atomic compare-and-exchange
(acquire/release). They operate only on <code class="language-plaintext highlighter-rouge">child</code> array elements and the
tree root. To illustrate I will use <a href="https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html">GCC atomics</a>, also supported by
Clang.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">valtype</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">map</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">keytype</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">map</span> <span class="o">*</span><span class="n">n</span> <span class="o">=</span> <span class="n">__atomic_load_n</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">perm</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="n">arena</span> <span class="n">rollback</span> <span class="o">=</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
            <span class="n">map</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">map</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
            <span class="n">new</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="kt">int</span> <span class="n">pass</span> <span class="o">=</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">;</span>
            <span class="kt">int</span> <span class="n">fail</span> <span class="o">=</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">__atomic_compare_exchange_n</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">n</span><span class="p">,</span> <span class="n">new</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pass</span><span class="p">,</span> <span class="n">fail</span><span class="p">))</span> <span class="p">{</span>
                <span class="k">return</span> <span class="o">&amp;</span><span class="n">new</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="o">*</span><span class="n">perm</span> <span class="o">=</span> <span class="n">rollback</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">n</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">n</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="n">n</span><span class="o">-&gt;</span><span class="n">child</span> <span class="o">+</span> <span class="p">(</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First an atomic load retrieves the current node. If there is no such node,
then attempt to insert one using atomic compare-and-exchange. The <a href="/blog/2014/09/02/">ABA
problem</a> is not an issue thanks again to lack of deletion: Once set,
a pointer never changes. Before allocating a node, take a snapshot of the
arena so that the allocation can be reverted on failure. If another thread
got there first, continue tumbling down the tree <em>as though a null was
never observed</em>.</p>

<p>On compare-and-swap failure, it turns into an acquire load, just as it
began. On success, it’s a release store, synchronizing with acquire loads
on other threads.</p>

<p>The <code class="language-plaintext highlighter-rouge">key</code> field does not require atomics because it’s synchronized by the
compare-and-swap. That is, the assignment will happen before the node is
inserted, and keys do not change after insertion. The same goes for any
zeroing done by the arena.</p>

<p><strong>Loads and stores through the returned pointer are the caller’s
responsibility.</strong> These likely require further synchronization. If
<code class="language-plaintext highlighter-rouge">valtype</code> is a shared counter then an atomic increment is sufficient. In
other cases, <code class="language-plaintext highlighter-rouge">upsert</code> should probably be modified to accept an initial
value to be assigned alongside the key so that the entire key/value pair
inserted atomically. Alternatively, <a href="/blog/2022/05/14/">break it into two steps</a>. The
details depend on the needs of the program.</p>

<p>On small trees there will much contention near the root of the tree during
inserts. Fortunately, a contentious tree will not stay small for long! The
hash function will spread threads around a large tree, generally keeping
them off each other’s toes.</p>

<p>A complete demo you can try yourself: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/concurrent-hash-trie.c"><code class="language-plaintext highlighter-rouge">concurrent-hash-trie.c</code></a></strong>.
It returns a value pointer like above, and store/load is synchronized by
the thread join. Each thread is given a per-thread subarena allocated out
of the main arena, and the final tree is built from these subarenas.</p>

<p>For a practical example: a <a href="https://github.com/skeeto/scratch/blob/master/misc/rainbow.c"><strong>multithreaded rainbow table</strong></a> to find
hash function collisions. Threads are synchronized solely through atomics
in the shared hash-trie.</p>

<p>A complete fast, concurrent, lock-free hash map in under 30 lines of C
sounds like a sweet deal to me!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Solving "Two Sum" in C with a tiny hash table</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/06/26/"/>
    <id>urn:uuid:5d15318f-6915-4f72-8690-74a84d43d2f7</id>
    <updated>2023-06-26T19:38:18Z</updated>
    <category term="c"/><category term="go"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I came across a question: How does one efficiently solve <a href="https://leetcode.com/problems/two-sum/">Two Sum</a> in C?
There’s a naive quadratic time solution, but also an amortized linear time
solution using a hash table. Without a built-in or standard library hash
table, the latter sounds onerous. However, a <a href="/blog/2022/08/08/">mask-step-index table</a>,
a hash table construction suitable for many problems, requires only a few
lines of code. This approach is useful even when a standard hash table is
available, because by <a href="https://vimeo.com/644068002">exploiting the known problem constraints</a>, it
beats typical generic hash table performance by an order of magnitude
(<a href="https://gist.github.com/skeeto/7119cf683662deae717c0d4e79ebf605">demo</a>).</p>

<p>The Two Sum exercise, restated:</p>

<blockquote>
  <p>Given an integer array and target, return the distinct indices of two
elements that sum to the target.</p>
</blockquote>

<p>In particular, the solution doesn’t find elements, but their indices. The
exercise also constrains input ranges — important but easy to overlook:</p>

<ul>
  <li>2 &lt;= <code class="language-plaintext highlighter-rouge">count</code> &lt;= 10<sup>4</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">nums[i]</code> &lt;= 10<sup>9</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">target</code> &lt;= 10<sup>9</sup></li>
</ul>

<p>Notably, indices fit in a 16-bit integer with lots of room to spare. In
fact, it will fit in a 14-bit address space (16,384) with still plenty of
overhead. Elements fit in a signed 32-bit integer, and we can add and
subtract elements without overflow, if just barely. The last constraint
isn’t redundant, but it’s not readily exploitable either.</p>

<p>The naive solution is to linearly search the array for the complement.
With nested loops, it’s obviously quadratic time. At 10k elements, we
expect an abysmal 25M comparisons on average.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">count</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span> <span class="o">=</span> <span class="p">...;</span>

<span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">target</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// found</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">nums</code> array is “keyed” by index. It would be better to also have the
inverse mapping: key on elements to obtain the <code class="language-plaintext highlighter-rouge">nums</code> index. Then for each
element we could compute the complement and find its index, if any, using
this second mapping.</p>

<p>The input range is finite, so an inverse map is simple. Allocate an array,
one element per integer in range, and store the index there. However, the
input range is 2 billion, and even with 16-bit indices that’s a 4GB array.
Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed
to make it so. This array would be very sparse, at most less than half a
percent of its elements populated. That’s a hint: Associative arrays are
far more appropriate for representing such sparse mappings. That is, a
hash table.</p>

<p>Using Go’s built-in hash table:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithMap</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">seen</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">int32</span><span class="p">]</span><span class="kt">int16</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="k">if</span> <span class="n">j</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">seen</span><span class="p">[</span><span class="n">complement</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
            <span class="k">return</span> <span class="kt">int</span><span class="p">(</span><span class="n">j</span><span class="p">),</span> <span class="n">i</span><span class="p">,</span> <span class="no">true</span>
        <span class="p">}</span>
        <span class="n">seen</span><span class="p">[</span><span class="n">num</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In essence, the hash table folds the sparse 2 billion element array onto a
smaller array, with collision resolution when elements inevitably land in
the same slot. For this exercise, that small array could be as small as
10,000 elements because that’s the most we’d ever need to track. For
folding the large key space onto the smaller, we could use modulo. For
collision resolution, we could keep walking the table.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">10000</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

<span class="c1">// Find or insert nums[index].</span>
<span class="kt">int16_t</span> <span class="nf">lookup</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">index</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// empty slot</span>
            <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// insert biased index</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">j</span><span class="p">;</span>  <span class="c1">// match found</span>
        <span class="p">}</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>  <span class="c1">// keep looking</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Take note of a few details:</p>

<ol>
  <li>
    <p>An empty slot is zero, and an empty table is a zero-initialized array.
Since zero is a valid value, and all values are non-negative, it biases
values by 1 in the table.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">nums</code> array is part of the table structure, necessary for lookups.
<strong>The two mappings — element-by-index and index-by-element — share
structure.</strong></p>
  </li>
  <li>
    <p>It uses <em>open addressing</em> with <em>linear probing</em>, and so walks the table
until it either either finds the element or hits an empty slot.</p>
  </li>
  <li>
    <p>The “hash” function is modulo. If inputs are not random, they’ll tend
to bunch up in the table. Combined with linear probing makes for lots
of collisions. For the worst case, imagine sequentially ordered inputs.</p>
  </li>
  <li>
    <p>Sometimes the table will almost completely fill, and lookups will be no
better than the linear scans of the naive solution.</p>
  </li>
  <li>
    <p>Most subtle of all: This hash table is not enough for the exercise. The
keyed-on element may not even be in <code class="language-plaintext highlighter-rouge">nums</code>, and when lookup fails, that
element is not inserted in the table. Instead, a different element is
inserted. The conventional solution has at least two hash table
lookups. <strong>In the Go code, it’s <code class="language-plaintext highlighter-rouge">seen[complement]</code> for lookups and
<code class="language-plaintext highlighter-rouge">seen[num]</code> for inserts.</strong></p>
  </li>
</ol>

<p>To solve (4) we’ll use a hash function to more uniformly distribute
elements in the table. We’ll also probe the table in a random-ish order
that depends on the key. In practice there will be little bunching even
for non-random inputs.</p>

<p>To solve (5) we’ll use a larger table: 2<sup>14</sup> or 16,384 elements.
This has breathing room, and with a power of two we can use a fast mask
instead of a slow division (though in practice, compilers usually
implement division by a constant denominator with modular multiplication).</p>

<p>To solve (6) we’ll key complements together under the same key. It looks
for the complement, but on failure it inserts the current element in the
empty slot. In other words, <strong>this solution will only need a single hash
table lookup per element!</strong></p>

<p>Laying down some groundwork:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int16_t</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">TwoSum</span><span class="p">;</span>

<span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">seen</code> array is a 32KiB hash table large enough for all inputs, small
enough that it can be a local variable. In the loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>

<p>Compute the complement, then apply a “max” operation to derive a key. Any
commutative operation works, though obviously addition would be a poor
choice. XOR is similar enough to cause many collisions. Multiplication
works well, and is probably better if the ternary produces a branch.</p>

<p>The hash function is multiplication with <a href="/blog/2019/11/19/">a randomly-chosen prime</a>.
As we’ll see in a moment, <code class="language-plaintext highlighter-rouge">step</code> will also add-shift the hash before use.
The initial index will be the bottom 14 bits of this hash. For <code class="language-plaintext highlighter-rouge">step</code>,
recall from the MSI article that it must be odd so that every slot is
eventually probed. I shift out 13 bits and then override the 14th bit, so
<code class="language-plaintext highlighter-rouge">step</code> effectively skips over the 14 bits used for the initial table
index.</p>

<p>I used <code class="language-plaintext highlighter-rouge">unsigned</code> because I don’t really care about the width of the hash
table index, but more importantly, I want defined overflow from all the
bit twiddling, even in the face of implicit promotion. As a bonus, it can
help in reasoning about indirection: <code class="language-plaintext highlighter-rouge">seen</code> indices are <code class="language-plaintext highlighter-rouge">unsigned</code>, <code class="language-plaintext highlighter-rouge">nums</code>
indices are <code class="language-plaintext highlighter-rouge">int16_t</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
</code></pre></div></div>

<p>The step is added before using the index the first time, helping to
scatter the start point and reduce collisions. If it’s an empty slot,
insert the <em>current</em> element, not the complement — which wouldn’t be
possible anyway. Unlike conventional solutions, this doesn’t require
another hash and lookup. If it finds the complement, problem solved,
otherwise keep going.</p>

<p>Putting it all together, it’s only slightly longer than solutions using a
generic hash table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Applying this technique to Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithBespoke</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="n">seen</span> <span class="p">[</span><span class="m">1</span> <span class="o">&lt;&lt;</span> <span class="m">14</span><span class="p">]</span><span class="kt">int16</span>
    <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="n">hash</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">num</span> <span class="o">*</span> <span class="n">complement</span> <span class="o">*</span> <span class="m">489183053</span><span class="p">)</span>
        <span class="n">mask</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span>
        <span class="n">step</span> <span class="o">:=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="m">13</span> <span class="o">|</span> <span class="m">1</span>
        <span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="n">hash</span><span class="p">;</span> <span class="p">;</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span>
            <span class="n">j</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="c">// unbias</span>
            <span class="k">if</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="m">0</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">+</span> <span class="m">1</span> <span class="c">// bias</span>
                <span class="k">break</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span> <span class="p">{</span>
                <span class="k">return</span> <span class="n">j</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="no">true</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With Go 1.20 this is an order of magnitude faster than <code class="language-plaintext highlighter-rouge">map[int32]int16</code>,
which isn’t surprising. I used multiplication as the key operator because,
in my first take, Go produced a branch for the “max” operation — at a 25%
performance penalty on random inputs.</p>

<p>A full-featured, generic hash table may be overkill for your problem, and
a bit of hashed indexing with collision resolution over a small array
might be sufficient. The problem constraints might open up such shortcuts.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Practical libc-free threading on Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/03/23/"/>
    <id>urn:uuid:631a8107-2eef-420b-9594-752e6f013048</id>
    <updated>2023-03-23T05:32:41Z</updated>
    <category term="c"/><category term="optimization"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re <a href="/blog/2023/02/15/">not using a C runtime</a> on Linux, and instead you’re
programming against its system call API. It’s long-term and stable after
all. <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">Memory management</a> and <a href="/blog/2023/02/13/">buffered I/O</a> are easily
solved, but a lot of software benefits from concurrency. It would be nice
to also have thread spawning capability. This article will demonstrate a
simple, practical, and robust approach to spawning and managing threads
using only raw system calls. It only takes about a dozen lines of C,
including a few inline assembly instructions.</p>

<p>The catch is that there’s no way to avoid using a bit of assembly. Neither
the <code class="language-plaintext highlighter-rouge">clone</code> nor <code class="language-plaintext highlighter-rouge">clone3</code> system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architectures supported by Linux. The <a href="https://man7.org/linux/man-pages/man2/clone.2.html">glibc <code class="language-plaintext highlighter-rouge">clone(2)</code>
wrapper</a> fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.</p>

<p>Before diving in, the complete, working demo: <a href="https://github.com/skeeto/scratch/blob/master/misc/stack_head.c"><strong><code class="language-plaintext highlighter-rouge">stack_head.c</code></strong></a></p>

<h3 id="the-clone-system-call">The clone system call</h3>

<p>On Linux, threads are spawned using the <code class="language-plaintext highlighter-rouge">clone</code> system call with semantics
like the classic unix <code class="language-plaintext highlighter-rouge">fork(2)</code>. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">clone</span><span class="p">(</span><span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Sounds kind of elegant, but it has an annoying problem: The new thread
begins life in the <em>middle</em> of a function without any established stack
frame. Its stack is a blank slate. It’s not ready to do anything except
jump to a function prologue that will set up a stack frame. So besides the
assembly for the system call itself, it also needs more assembly to get
the thread into a C-compatible state. In other words, <strong>a generic system
call wrapper cannot reliably spawn threads</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">brokenclone</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">threadentry</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_clone</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">stack</span><span class="p">);</span>
    <span class="c1">// DANGER: new thread may access non-existant stack frame here</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">threadentry</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For odd historical reasons, each architecture’s <code class="language-plaintext highlighter-rouge">clone</code> has a slightly
different interface. The newer <code class="language-plaintext highlighter-rouge">clone3</code> unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.</p>

<h3 id="the-stack-header">The stack “header”</h3>

<p>I <a href="/blog/2015/05/15/">figured out a neat trick eight years ago</a> which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the <code class="language-plaintext highlighter-rouge">ret</code> instruction
do the work.</p>

<p>Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the <code class="language-plaintext highlighter-rouge">stack_head</code>,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own <code class="language-plaintext highlighter-rouge">stack_head</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using <code class="language-plaintext highlighter-rouge">sizeof</code>
to place the structure, as I’ll demonstrate later.</p>

<p>Now for the cool part: The <code class="language-plaintext highlighter-rouge">...</code> can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">join_futex</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>More details on that futex shortly.</p>

<h3 id="the-clone-wrapper">The clone wrapper</h3>

<p>I call the <code class="language-plaintext highlighter-rouge">clone</code> wrapper <code class="language-plaintext highlighter-rouge">newthread</code>. It has the inline assembly for the
system call, and since it includes a <code class="language-plaintext highlighter-rouge">ret</code> to diverge the threads, it’s a
“naked” function <a href="/blog/2023/02/12/">just like with <code class="language-plaintext highlighter-rouge">setjmp</code></a>. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like <code class="language-plaintext highlighter-rouge">clone</code>, it doesn’t accept a thread
entry point. Instead it accepts a <code class="language-plaintext highlighter-rouge">stack_head</code> seeded with the entry
point. The whole wrapper is just six instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">long</span> <span class="nf">newthread</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"mov  %%rdi, %%rsi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// arg2 = stack</span>
        <span class="s">"mov  $0x50f00, %%edi</span><span class="se">\n</span><span class="s">"</span>  <span class="c1">// arg1 = clone flags</span>
        <span class="s">"mov  $56, %%eax</span><span class="se">\n</span><span class="s">"</span>       <span class="c1">// SYS_clone</span>
        <span class="s">"syscall</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  %%rsp, %%rdi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// entry point argument</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="o">:</span> <span class="o">:</span> <span class="s">"rax"</span><span class="p">,</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"rsi"</span><span class="p">,</span> <span class="s">"rdi"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x86-64, both function calls and system calls use <code class="language-plaintext highlighter-rouge">rdi</code> and <code class="language-plaintext highlighter-rouge">rsi</code> for
their first two parameters. Per the reference <code class="language-plaintext highlighter-rouge">clone(2)</code> prototype above:
the first system call argument is <code class="language-plaintext highlighter-rouge">flags</code> and the second argument is the
new <code class="language-plaintext highlighter-rouge">stack</code>, which will point directly at the <code class="language-plaintext highlighter-rouge">stack_head</code>. However, the
stack pointer arrives in <code class="language-plaintext highlighter-rouge">rdi</code>. So I copy <code class="language-plaintext highlighter-rouge">stack</code> into the second argument
register, <code class="language-plaintext highlighter-rouge">rsi</code>, then load the flags (<code class="language-plaintext highlighter-rouge">0x50f00</code>) into the first argument
register, <code class="language-plaintext highlighter-rouge">rdi</code>. The system call number goes in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<p>Where does that <code class="language-plaintext highlighter-rouge">0x50f00</code> come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">long</span> <span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FILES</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FS</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SIGHAND</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SYSVSEM</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_THREAD</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_VM</span><span class="p">;</span>
</code></pre></div></div>

<p>When the system call returns, it copies the stack pointer into <code class="language-plaintext highlighter-rouge">rdi</code>, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as <code class="language-plaintext highlighter-rouge">stack</code>, of course. In the old thread this is a
harmless no-op because <code class="language-plaintext highlighter-rouge">rdi</code> is a volatile register in this ABI. Finally,
<code class="language-plaintext highlighter-rouge">ret</code> pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (<a href="/blog/2016/09/23/">negative errno</a>) or the new thread ID. In the new thread
<strong>it pops the first element of <code class="language-plaintext highlighter-rouge">stack_head</code></strong> which, of course, is the
entry point. That’s why it must be first!</p>

<p>The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the <code class="language-plaintext highlighter-rouge">exit</code> (<em>not</em> <code class="language-plaintext highlighter-rouge">exit_group</code>)
system call to terminate itself.</p>

<h3 id="caller-point-of-view">Caller point of view</h3>

<p>The caller side looks something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">threadentry</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do work ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
    <span class="n">futex_wake</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">);</span>
    <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span> <span class="o">=</span> <span class="n">newstack</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">);</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">entry</span> <span class="o">=</span> <span class="n">threadentry</span><span class="p">;</span>
    <span class="c1">// ... assign other thread data ...</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">newthread</span><span class="p">(</span><span class="n">stack</span><span class="p">);</span>

    <span class="c1">// ... do work ...</span>

    <span class="n">futex_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Despite the minimalist, 6-instruction clone wrapper, this is taking the
shape of a conventional threading API. It would only take a bit more to
hide the futex, too. Speaking of which, what’s going on there? The <a href="/blog/2022/10/05/">same
principal as a WaitGroup</a>. The futex, an integer, is zero-initialized,
indicating the thread is running (“not done”). The joiner tells the kernel
to wait until the integer is non-zero, which it may already be since I
don’t bother to check first. When the child thread is done, it atomically
sets the futex to non-zero and wakes all waiters, which might be nobody.</p>

<p>Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its <code class="language-plaintext highlighter-rouge">SIGCHLD</code> (or use <code class="language-plaintext highlighter-rouge">CLONE_CHILD_CLEARTID</code>). If this
sounds like a problem, consider <a href="https://vimeo.com/644068002">your context</a> more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.</p>

<p>What’s with the <code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code>? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, <a href="/blog/2023/02/15/#stack-alignment-on-32-bit-x86">just like on Windows</a>.
If you want to access <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> you’ll need <a href="/blog/2022/02/18/">more
assembly</a>. (I wish doing <em>really basic things</em> without libc on Linux
didn’t require so much assembly.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span> <span class="p">(</span>
    <span class="s">".global _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start:</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  (%rsp), %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsp), %rsi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsi,%rdi,8), %rdx</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   call  main</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  %eax, %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  $60, %eax</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   syscall</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Getting back to the example usage, it has some regular-looking system call
wrappers. Where do those come from? Start with this 6-argument generic
system call wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">syscall6</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">long</span> <span class="n">d</span><span class="p">,</span> <span class="kt">long</span> <span class="n">e</span><span class="p">,</span> <span class="kt">long</span> <span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r10</span> <span class="n">asm</span><span class="p">(</span><span class="s">"r10"</span><span class="p">)</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r8</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r8"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r9</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r9"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r10</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r8</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r9</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I could define <code class="language-plaintext highlighter-rouge">syscall5</code>, <code class="language-plaintext highlighter-rouge">syscall4</code>, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
</span></code></pre></div></div>

<p>Now we can have some exits:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit_group</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit_group</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simplified futex wrappers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">expect</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL4</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">expect</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL3</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mh">0x7fffffff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And so on.</p>

<p>Finally I can talk about that <code class="language-plaintext highlighter-rouge">newstack</code> function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="nf">newstack</span><span class="p">(</span><span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">p</span> <span class="o">=</span> <span class="n">SYSCALL6</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x22</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">size</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span> <span class="o">+</span> <span class="n">count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">aligned</code> attribute comes into play here: I treat the result like an
array of <code class="language-plaintext highlighter-rouge">stack_head</code> and return the last element. The attribute ensures
each individual elements is aligned.</p>

<p>That’s it! There’s not much to it other than a few thoughtful assembly
instructions. It took doing this a few times in a few different programs
before I noticed how simple it can be.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>I solved the Dandelions paper-and-pencil game</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/12/"/>
    <id>urn:uuid:14edf491-dcdd-4c2f-a75f-5e89838e6b40</id>
    <updated>2022-10-12T03:02:27Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I’ve been reading <a href="https://mathwithbaddrawings.com/2022/01/19/math-games-with-bad-drawings-2/"><em>Math Games with Bad Drawings</em></a>, a great book
well-aligned to my interests. It’s given me a lot of new, interesting
programming puzzles to consider. The first to truly nerd snipe me was
<a href="https://mathwithbaddrawings.com/dandelions/">Dandelions</a> (<a href="https://mathwithbaddrawings.com/wp-content/uploads/2020/06/game-5-dandelions-1.pdf">full rules</a>), an asymmetric paper-and-pencil game
invented by the book’s author, Ben Orlin. Just as with <a href="/blog/2020/10/19/">British Square two
years ago</a> — and essentially following the same technique — I wrote a
program that explores the game tree sufficiently to play either side
perfectly, “solving” the game in its standard 5-by-5 configuration.</p>

<p>The source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/dandelions.c"><code class="language-plaintext highlighter-rouge">dandelions.c</code></a></strong></p>

<p>The game is played on a 5-by-5 grid where one player plays the dandelions,
the other plays the wind. Players alternate, dandelions placing flowers
and wind blowing in one of the eight directions, spreading seeds from all
flowers along the direction of the wind. Each side gets seven moves, and
the wind cannot blow in the same direction twice. The dandelions’ goal is
to fill the grid with seeds, and the wind’s goal is to prevent this.</p>

<p>Try playing a few rounds with a friend, and you will probably find that
dandelions is difficult, at least in your first games, as though it cannot
be won. However, my engine proves the opposite: <strong>The dandelions always
win with perfect play.</strong> In fact, it’s so lopsided that the dandelions’
first move is irrelevant. Every first move is winnable. If the dandelions
blunder, typically wind has one narrow chance to seize control, after
which wind probably wins with any (or almost any) move.</p>

<p>For reasons I’ll discuss later, I only solved the 5-by-5 game, and the
situation may be different for the 6-by-6 variant. Also, unlike British
Square, my engine does not exhaustively explore the entire game tree
because it’s far too large. Instead it does a minimax search to the bottom
of the tree and stops when it finds a branch where all leaves are wins for
the current player. Because of this, it cannot maximize the outcome —
winning as early as possible as dandelions or maximizing the number of
empty grid spaces as wind. I also can’t quantify the exact size of tree.</p>

<p>Like with British Square, my game engine only has a crude user interface
for interactively exploring the game tree. While you can “play” it in a
sense, it’s not intended to be played. It also takes a few seconds to
initially explore the game tree, so wait for the <code class="language-plaintext highlighter-rouge">&gt;&gt;</code> prompt.</p>

<h3 id="bitboard-seeding">Bitboard seeding</h3>

<p>I used <a href="https://www.chessprogramming.org/Bitboards">bitboards</a> of course: a 25-bit bitboard for flowers, a 25-bit
bitboard for seeds, and an 8-bit set to track which directions the wind
has blown. It’s especially well-suited for this game since seeds can be
spread in parallel using bitwise operations. Shift the flower bitboard in
the direction of the wind four times, ORing it into the seeds bitboard
on each shift:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int wind;
uint32_t seeds, flowers;

flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
</code></pre></div></div>

<p>Of course it’s a little more complicated than this. The flowers must be
masked to keep them from wrapping around the grid, and wind may require
shifting in the other direction. In order to “negative shift” I actually
use a rotation (notated with <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code> below). Consider, to rotate an N-bit
integer <em>left</em> by R, one can <em>right</em>-rotate it by <code class="language-plaintext highlighter-rouge">N-R</code> — ex. on a 32-bit
integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a
negative <code class="language-plaintext highlighter-rouge">wind</code> that goes in the other direction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>flowers &gt;&gt;&gt; (wind &amp; 31);
</code></pre></div></div>

<p>With such a “programmable shift” I can implement the bulk of the game
rules using a couple of tables and no branches:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
</code></pre></div></div>

<p>The masks clear out the column/row about to be shifted “out” so that it
doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5
times.</p>

<h3 id="bitboard-packing-and-canonicalization">Bitboard packing and canonicalization</h3>

<p>The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58
bits, which fits in a 64-bit integer with bits to spare. How incredibly
convenient! So I represent the game state using a 64-bit integer, using a
packing like I did with British Square. The bottom 25 bits are the seeds,
the next 25 bits are the flowers, and the next 8 is the wind set.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS
</code></pre></div></div>

<p>Even more convenient, I could reuse my bitboard canonicalization code from
British Square, also a 5-by-5 grid packed in the same way, saving me the
trouble of working out all the bit sieves. I only had to figure out how to
transpose and flip the wind bitset. Turns out that’s pretty easy, too.
Here’s how I represent the 8 wind directions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>567
4 0
321
</code></pre></div></div>

<p>Flipping this vertically I get:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>321
4 0
567
</code></pre></div></div>

<p>Unroll these to show how old maps onto new:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>old: 01234567
new: 07654321
</code></pre></div></div>

<p>The new is just the old rotated and reversed. Transposition is the same
story, just a different rotation. I use a small lookup table to reverse
the bits, and then an 8-bit rotation. (See <code class="language-plaintext highlighter-rouge">revrot</code>.)</p>

<p>To determine how many moves have been made, popcount the flower bitboard
and wind bitset.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int moves = POPCOUNT64(g &amp; 0x3fffffffe000000);
</code></pre></div></div>

<p>To test if dandelions have won:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int win = (g&amp;0x1ffffff) == 0x1ffffff;
</code></pre></div></div>

<p>Since the plan is to store all the game states in a big hash table — an
<a href="/blog/2022/08/08/">MSI double hash</a> in this case — I’d like to reserve the zero value
as a “null” board state. This lets me zero-initialize the hash table. To
do this, I invert the wind bitset such that a 1 indicates the direction is
still available. So the initial game state looks like this (in the real
program this is accounted for in the previously-discussed turn popcount):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define GAME_INIT ((uint64_t)255 &lt;&lt; 50)
</span></code></pre></div></div>

<p>The remaining 6 bits can be used to cache information about the rest of
tree under this game state, namely who wins from this position, and this
serves as the “value” in the hash table. Turns out the bitboards are
already noisy enough that a <a href="/blog/2018/07/31/">single xorshift</a> makes for a great hash
function. The hash table, including hash function, is under a dozen lines
of code.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Find the hash table slot for the given game state.</span>
<span class="kt">uint64_t</span> <span class="o">*</span><span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">g</span> <span class="o">^</span> <span class="n">g</span><span class="o">&gt;&gt;</span><span class="mi">32</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1L</span> <span class="o">&lt;&lt;</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span><span class="o">&amp;</span><span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="mh">0x3ffffffffffffff</span> <span class="o">==</span> <span class="n">g</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">ht</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To explore a 6-by-6 grid I’d need to change my representation, which is
part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit
integer, so I’d need to double my storage requirements, which are already
strained.</p>

<h3 id="computational-limitations">Computational limitations</h3>

<p>Due to the way seeds spread, game states resulting from different moves
rarely converge back to a common state later in the tree, so the hash
table isn’t doing much deduplication. Exhaustively exploring the entire
game tree, even cutting it down to an 8th using canonicalization, requires
substantial computing resources, more than I personally have available for
this project. So I had to stop at the slightly weaker form, find a winning
branch rather than maximizing a “score.”</p>

<p>I configure the program to allocate 2GiB for the hash table, but if you
run just a few dozen games off the same table (same program instance),
each exploring different parts of the game tree, you’ll exhaust this
table. A 6-by-6 doubles the memory requirements just to represent the
game, but it also slows the search and substantially increases the width
of the tree, which grows 44% faster. I’m sure it can be done, but it’s
just beyond the resources available to me.</p>

<h3 id="dandelion-puzzles">Dandelion Puzzles</h3>

<p>As a side effect, I wrote a small routine to randomly play out games in
search for “mate-in-two”-style puzzles. The dandelions have two flowers to
place and can force a win with two specific placements — and only those
two placements — regardless of how the wind blows. Here are two of the
better ones, each involving a small trick that I won’t give away here
(note: arrowheads indicate directions wind can still blow):</p>

<p><img src="/img/dandelions/puzzle1.svg" alt="" /></p>

<p><img src="/img/dandelions/puzzle2.svg" alt="" /></p>

<p>There are a variety of potential single-player puzzles of this form.</p>

<ul>
  <li>Cooperative: place a dandelion <em>and</em> pick the wind direction</li>
  <li>Avoidance: <em>don’t</em> seed a particular tile</li>
  <li>Hard ground: certain tiles can’t grow flowers (but still get seeded)</li>
  <li>Weeding: as wind, figure out which flower to remove before blowing</li>
</ul>

<p>There could be a whole “crossword book” of such dandelion puzzles.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>The quick and practical "MSI" hash table</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/08/08/"/>
    <id>urn:uuid:4a7d8c3d-3bcf-4b10-b50a-64227c02b254</id>
    <updated>2022-08-08T23:57:08Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Follow-up: <a href="/blog/2023/06/26/">Solving “Two Sum” in C with a tiny hash table</a></em></p>

<p>I <a href="https://skeeto.s3.amazonaws.com/share/onward17-essays2.pdf">generally prefer C</a>, so I’m accustomed to building whatever I need
on the fly, such as heaps, <a href="/blog/2022/05/22/#inverting-the-tree-links">linked lists</a>, and especially hash
tables. Few programs use more than a small subset of a data structure’s
features, making their implementation smaller, simpler, and <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">more
efficient</a> than the general case, which must handle every edge
case. A typical hash table tutorial will describe a relatively lengthy
program, but in practice, bespoke hash tables are <a href="/blog/2020/10/19/#hash-table-memoization">only a few lines of
code</a>. Over the years I’ve worked out some basic principles for hash
table construction that aid in quick and efficient implementation. This
article covers the technique and philosophy behind what I’ve come to call
the “mask-step-index” (MSI) hash table, which is my standard approach.</p>

<!--more-->

<p>MSI hash tables are nothing novel, just a <a href="https://en.wikipedia.org/wiki/Double_hashing">double hashed</a>, <a href="https://en.wikipedia.org/wiki/Open_addressing">open
address</a> hash table layered generically atop an external array. It’s
best regarded as a kind of database index — <em>a lookup index over an
existing array</em>. The array exists independently, and the hash table
provides an efficient lookup into that array over some property of its
entries.</p>

<p>The core of the MSI hash table is this iterator function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Compute the next candidate index. Initialize idx to the hash.</span>
<span class="kt">int32_t</span> <span class="nf">ht_lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">hash</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">uint32_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">idx</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The name should now make sense. I literally sound it out in my head when I
type it, like a mnemonic. Compute a mask, then a step size, finally an
index. The <code class="language-plaintext highlighter-rouge">exp</code> parameter is a power-of-two exponent for the hash table
size, <a href="/blog/2022/05/14/">which may look familiar</a>. I’ve used <code class="language-plaintext highlighter-rouge">int32_t</code> for the index,
but it’s easy to substitute, say, <code class="language-plaintext highlighter-rouge">size_t</code>. I try to optimize for the
common case, where a 31-bit index is more than sufficient, and a signed
type since <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">subscripts should be signed</a>. Internally it uses unsigned
types since overflow is both expected and harmless thanks to the
power-of-two hash table size.</p>

<p>It’s the caller’s responsibility to compute the hash, and the MSI iterator
tells the caller <em>where to look next</em>. For insertion, the caller (maybe)
looks either for an existing entry to override, or an empty slot. For
lookup, the caller looks for a matching entry, giving up as soon as it
find an empty slot. An insertion loop looks like this string intern table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EXP 15
</span>
<span class="c1">// Initialize all slots to an "empty" value (null)</span>
<span class="cp">#define HT_INIT { {0}, 0 }
</span><span class="k">struct</span> <span class="n">ht</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">key</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">EXP</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="c1">// empty, insert here</span>
            <span class="k">if</span> <span class="p">((</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="mi">1</span> <span class="o">==</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
            <span class="p">}</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">key</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="c1">// found, return canonical instance</span>
            <span class="k">return</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The caller initializes the iterator to the hash result. This will probably
be out of range, even negative, but that doesn’t matter. The iterator
function will turn it into a valid index before use. This detail is key to
<em>double hashing</em>: The low bits of the hash tell it where to start, and the
high bits tell it how to step. The hash table size is a power of two, and
the step size is forced to an odd number (via <code class="language-plaintext highlighter-rouge">| 1</code>), so it’s guaranteed
to visit each slot in the table exactly once before restarting. It’s
important that the search halts before looping, such as by guaranteeing
the existence of an empty slot (i.e. the “out of memory” check).</p>

<p>Note: The example out of memory check pushes the hash table to the
absolute limit, and in practice you’d want to stop at a smaller load
factor — perhaps even as low as 50% since that’s simple and fast.
Otherwise it degrades into a linear search as the table approaches
capacity.</p>

<p>Even if two keys start or land at the same place, they’ll quickly diverge
due to differing steps. For awhile I used plain linear probing — i.e.
<code class="language-plaintext highlighter-rouge">step=1</code> — but double hashing came out ahead every time I benchmarked,
steering me towards this “MSI” construction. Ideally <code class="language-plaintext highlighter-rouge">ht_lookup</code> would be
placed so that it’s inlined — e.g. in the same translation unit — so that
the mask and step are not actually recomputed each iteration.</p>

<h3 id="deletion">Deletion</h3>

<p>What about deletion? First, consider how infrequently you delete entries
from a hash table. When was the last time you used <code class="language-plaintext highlighter-rouge">del</code> on a dictionary
in Python, or <code class="language-plaintext highlighter-rouge">delete</code> on a <code class="language-plaintext highlighter-rouge">map</code> in Go? This operation is rarely needed.
However, when you <em>do</em> need it, reserve a gravestone value in addition to
the empty value.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">char</span> <span class="n">gravestone</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"(deleted)"</span><span class="p">;</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">dest</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="c1">// ...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="c1">// ...</span>
            <span class="n">dest</span> <span class="o">=</span> <span class="n">dest</span> <span class="o">?</span> <span class="n">dest</span> <span class="o">:</span> <span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="o">*</span><span class="n">dest</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">key</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">gravestone</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">dest</span> <span class="o">=</span> <span class="n">dest</span> <span class="o">?</span> <span class="n">dest</span> <span class="o">:</span> <span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(...))</span> <span class="p">{</span>
            <span class="c1">// ...</span>
        <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">unintern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">gravestone</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// skip over</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(...))</span> <span class="p">{</span>
            <span class="kt">char</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">gravestone</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">old</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When searching, skip over gravestones. Note that gravestones are compared
with <code class="language-plaintext highlighter-rouge">==</code> (identity), so this does not preclude a string <code class="language-plaintext highlighter-rouge">"(deleted)"</code>.
When inserting, use the first gravestone found if no entry was found.</p>

<h3 id="as-a-database-index">As a database index</h3>

<p>Iterating over the example string intern table is simple: Iterate over the
underlying array, skipping empty slots (and maybe gravestones). Entries
will be in a random order rather than, say, insertion order. This is a
useful introductory example, but this isn’t where MSI most shines. As
mentioned, it’s best when treated like a database index.</p>

<p>Let’s take a step back and consider the caller of <code class="language-plaintext highlighter-rouge">intern</code>. How does it
allocate these strings? Perhaps they’re <a href="/blog/2022/05/22/">appended to a buffer</a>, and
<code class="language-plaintext highlighter-rouge">intern</code> indicates whether or not the string is unique so far.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="c1">// lookup table over the buffer</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ht</span><span class="p">;</span>

    <span class="c1">// a collection of strings</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">BUFLEN</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Strings are only appended to the buffer when unique, and the hash table
can make that determination in constant time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">buf_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">len</span> <span class="o">&gt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
    <span class="p">}</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">candidate</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">+</span> <span class="n">buf</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">candidate</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="n">candidate</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// string is unique, keep it</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In my first example, <code class="language-plaintext highlighter-rouge">EXP</code> was fixed. This could be converted into a
dynamic allocation and the hash table resized as needed. Here’s a new
constructor, which I’m including since I think it’s instructive:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">ht</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">exp</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">ht</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">ht</span>
<span class="nf">ht_new</span><span class="p">(</span><span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ht</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">exp</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="n">assert</span><span class="p">(</span><span class="n">exp</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">exp</span> <span class="o">&gt;=</span> <span class="mi">32</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">ht</span><span class="p">;</span>  <span class="c1">// request too large</span>
    <span class="p">}</span>

    <span class="n">ht</span><span class="p">.</span><span class="n">ht</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">exp</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="k">return</span> <span class="n">ht</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">intern</code> fails, the hash table can be replaced with a new table twice
as large, and since, like a database index, its contents are entirely
redundant, <em>the hash table can be discarded and rebuilt from scratch</em>. The
new and old table don’t need to exist simultaneously. Here’s a routine to
populate an empty hash table from the buffer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">buf_rehash</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">off</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">off</span> <span class="o">&lt;</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">+</span> <span class="n">off</span><span class="p">;</span>
        <span class="kt">int32_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">off</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
        <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
                <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how this iterates in insertion order, which may be useful in other
cases, too. On the rehash it doesn’t need to check for existing entries,
as all entries are already known to be unique. Later when <code class="language-plaintext highlighter-rouge">intern</code> hits
its capacity:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">result</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">);</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span> <span class="o">=</span> <span class="n">ht_new</span><span class="p">(</span><span class="n">ht</span><span class="p">.</span><span class="n">exp</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
        <span class="p">}</span>
        <span class="n">buf_rehash</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>  <span class="c1">// cannot fail</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>I freed and reallocated the table, but it would be trivial to use a
<code class="language-plaintext highlighter-rouge">realloc</code> instead, unlike the case where the old table <em>isn’t</em> redundant.</p>

<h3 id="multimaps">Multimaps</h3>

<p>An MSI hash table is trivially converted into a multimap, a hash table
with multiple values per key. Callers just make one small change: <em>Don’t
stop searching until an empty slot is found</em>. Each match is an additional
multimap value. The “value array” is stored along the hash table itself,
in insertion order, without additional allocations.</p>

<p>For example, imagine the strings in the string buffer have a namespace
prefix, delimited by a colon, like <code class="language-plaintext highlighter-rouge">city:Austin</code> and <code class="language-plaintext highlighter-rouge">state:Texas</code>. We’d
like a fast lookup of all strings under a particular namespace. The
solution is to add another hash table as you would an index to a database
table.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="c1">// ..</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ns</span><span class="p">;</span>
    <span class="c1">// ..</span>
<span class="p">};</span>
</code></pre></div></div>

<p>When a unique string is appended it’s also registered in the namespace
multimap. It doesn’t check for an existing key, only for an empty slot,
since it’s a multimap:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// Check outside the loop since it always inserts.</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* ... ns multimap lacks capacity ... */</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ... grow+rehash ns mutilmap ...</span>
    <span class="p">}</span>

    <span class="kt">int32_t</span> <span class="n">nslen</span> <span class="o">=</span> <span class="n">strcspn</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">":"</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">nslen</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>It includes the <code class="language-plaintext highlighter-rouge">:</code> as a terminator which simplifies lookups. Here’s a
lookup loop to print all strings under a namespace (includes terminal <code class="language-plaintext highlighter-rouge">:</code>
in the key):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">ns</span> <span class="o">=</span> <span class="s">"city:"</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">nslen</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">ns</span><span class="p">);</span>
    <span class="c1">// ...</span>

    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span> <span class="n">nslen</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">ns</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">ns</span><span class="p">,</span> <span class="n">nslen</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">puts</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">nslen</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>An alternative approach to multimaps is to additionally key over a value
subscript. For example, the first city is keyed <code class="language-plaintext highlighter-rouge">{"city", 0}</code>, the next
<code class="language-plaintext highlighter-rouge">{"city", 1}</code>, etc. The value subscript could be mixed into the string
hash with an <a href="/blog/2018/07/31/">integer permutation</a> (more on this below):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">val_idx</span> <span class="o">^</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">nslen</span><span class="p">));</span>
</code></pre></div></div>

<p>The lookup loop would compare both the string and the value subscript, and
stop when it finds a match. The underlying hash table is not truly a
multimap, but rather a plain hash table with a larger key. This requires
extra bookkeeping — tracking individual subscripts and the number of
values per key — but provides constant time random access on the multimap
value array.</p>

<h3 id="hash-functions">Hash functions</h3>

<p>The MSI iterator leaves hashing up to the caller, who has better knowledge
about the input and how to hash it, though this takes a bit of knowledge
of how to build a hash function. The good news is that it’s easy, and less
is more. Better to do too little than too much, and a faster, weaker hash
function is worth a few extra collisions.</p>

<p>The first rule is to never lose sight of the goal: The purpose of the hash
function is to uniformly distribute entries over a table. The better you
know and exploit your input, the less you need to do in the hash function.
Sometimes your keys already contain random data, and so your hash function
can be the identity function! For example, if your keys are <a href="https://www.rfc-editor.org/rfc/rfc4122#section-4.4">“version 4”
UUIDs</a>, don’t waste time hashing them, just load a few bytes from the
end as an integer and you’re done.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// "Hash" a v4 UUID</span>
<span class="kt">uint64_t</span> <span class="nf">uuid4_hash</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">uuid</span><span class="p">[</span><span class="mi">16</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">h</span><span class="p">,</span> <span class="n">uuid</span><span class="o">+</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A reasonable start for strings is <a href="https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function">FNV-1a</a>, such as this possible
implementation for my <code class="language-plaintext highlighter-rouge">hash()</code> function above:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mh">0x100</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mi">255</span><span class="p">;</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span> <span class="o">^</span> <span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">32</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The hash state is initialized to a <em>basis</em>, some arbitrary value. This a
useful place to introduce a seed or hash key. It’s best that at least one
bit above the low mix-in bits is set so that it’s not trivially stuck at
zero. Above, I’ve chosen the most trivial basis with reasonable results,
though often I’ll use the digits of π.</p>

<p>Next XOR some input into the low bits. This could be a byte, a Unicode
code point, etc. More is better, since otherwise you’re stuck doing more
work per unit, the main weakness of FNV-1a. Carefully note the byte mask,
<code class="language-plaintext highlighter-rouge">&amp; 255</code>, which inhibits sign extension. <strong>Do not mix sign-extended inputs
into FNV-1a</strong> — a widespread implementation mistake.</p>

<p>Multiply by a large, odd random-ish integer. A prime is a reasonable
choice, and I usually pick my favorite prime, shown above: 19 ones in base
10.</p>

<p>Finally, my own touch, an xorshift finalizer. The high bits are much
better mixed than the low bits, so this improves the overall quality.
Though if you take time to benchmark, you might find that this finalizer
isn’t necessary. Remember, do <em>just</em> enough work to keep the number of
collisions low — not <em>lowest</em> — and no more.</p>

<p>If your input is made of integers, or is a short, fixed length, use an
<a href="/blog/2018/07/31/">integer permutation</a>, particularly multiply-xorshift. It takes very
little to get a sufficient distribution. Sometimes one multiplication does
the trick. Fixed-sized, integer-permutation hashes tend to be the fastest,
easily beating fancier SIMD-based hashes, <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">including AES-NI</a>. For
example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Hash a timestamp-based, version 1 UUID</span>
<span class="kt">uint64_t</span> <span class="nf">uuid1_hash</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">uuid</span><span class="p">[</span><span class="mi">16</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">uuid</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="mh">0x3243f6a8885a308d</span><span class="p">;</span>  <span class="c1">// digits of pi</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">33</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">33</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I benchmarked this in a real program, I would probably cut it down even
further, deleting hash operations one at a time and measuring the overall
hash table performance. This <code class="language-plaintext highlighter-rouge">memcpy</code> trick works well with floats, too,
especially packing two single precision floats into one 64-bit integer.</p>

<p>If you ever <a href="https://mort.coffee/home/tar/">hesitate to build a hash table</a> when the situation
calls, I hope the MSI technique will make the difference next time. I have
more hash table tricks up my sleeve, but since they’re not specific to MSI
I’ll save them for a future article.</p>

<h3 id="benchmarks">Benchmarks</h3>

<p>There have been objections to my claims about performance, so <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">I’ve
assembled some benchmarks</a>. These demonstrate that:</p>

<ul>
  <li>AES-NI slower than an integer permutation, at least for short keys.</li>
  <li>A custom, 10-line MSI hash table is easily an order of magnitude faster
than a typical generic hash table from your language’s standard library.
This isn’t because the standard hash table is inferior, but because <a href="https://vimeo.com/644068002">it
wasn’t written for your specific problem</a>.</li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>My take on "where's all the code"</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/05/22/"/>
    <id>urn:uuid:2eb07dcf-0d4c-44e7-9133-fd9cf8e83227</id>
    <updated>2022-05-22T23:59:59Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://lobste.rs/s/ny4ymx">on Lobsters</a>.</em></p>

<p>Earlier this month Ted Unangst researched <a href="https://flak.tedunangst.com/post/compiling-an-openbsd-kernel-50-faster">compiling the OpenBSD kernel
50% faster</a>, which involved stubbing out the largest, extraneous
branches of the source tree. To find the lowest-hanging fruit, he <a href="https://flak.tedunangst.com/post/watc">wrote a
tool</a> called <a href="https://humungus.tedunangst.com/r/watc">watc</a> — <em>where’s all the code</em> — that displays an
interactive “usage” summary of a source tree oriented around line count. A
followup post <a href="https://flak.tedunangst.com/post/parallel-tree-running">about exploring the tree in parallel</a> got me thinking
about the problem, especially since <a href="/blog/2022/05/14/">I had just written about a concurrent
queue</a>. Turning it over in my mind, I saw opportunities for interesting
data structures and memory management, and so I wanted to write my own
version of the tool, <a href="https://github.com/skeeto/scratch/blob/master/windows/watc.c"><strong><code class="language-plaintext highlighter-rouge">watc.c</code></strong></a>, which is the subject of this
article.</p>

<!--more-->

<p>The original <code class="language-plaintext highlighter-rouge">watc</code> is interactive and written in idiomatic Go. My version
is non-interactive, written in C, and currently only supports Windows. Not
only do I prefer batch programs generally, building an interactive user
interface would be complicated and distract from the actual problem I
wanted to tackle. As for the platform restriction, it has some convenient
constraints (for implementers), and my projects are often about shooting
multiple birds with one stone:</p>

<ul>
  <li>
    <p>The longest path is <code class="language-plaintext highlighter-rouge">MAX_PATH</code>, a meager 260 pseudo-UTF-16 code points,
is nice and short. Technically users can now opt-in to a <a href="https://docs.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">maximum path
length of 32,767</a>, but so little software supports it, including
much of Windows itself, that it’s not worth considering. Even with the
upper limit, each path component is still restricted by <code class="language-plaintext highlighter-rouge">MAX_PATH</code>. I
can rely on this platform restriction in my design.</p>
  </li>
  <li>
    <p>Symbolic links, an annoying edge case, are outside of consideration.
Technically Windows has them, but they’re sufficiently locked away that
they don’t come up in practice.</p>
  </li>
  <li>
    <p>After years of deliberating, I <a href="https://www.youtube.com/watch?v=r9eQth4Q5jg">was finally convinced</a> to buy and
try <a href="https://remedybg.handmade.network/">RememdyBG</a>, a super slick Windows debugger. I especially wanted
to try out its multi-threading support, and I knew I’d be using multiple
threads in this project. Since it’s incompatible with <a href="/blog/2020/05/15/">my development
kit</a>, my program also supports the MSVC compiler.</p>
  </li>
  <li>
    <p>The very same day I <a href="https://github.com/skeeto/w64devkit/commit/1513aa7">improved GDB support</a> in my development kit,
and this was a great opportunity to dogfood the changes. I’ve used my
kit <em>so much</em> these past two years, especially since both it and I have
matured enough that I’m nearly as productive in it as I am on Linux.</p>
  </li>
  <li>
    <p>It’s practice and experience with <a href="/blog/2021/12/30/">the wide API</a>, and the tool
fully supports Unicode paths. Perhaps a bit unnecessary considering how
few source trees stray beyond ASCII, even just in source text — just too
many ways things go wrong otherwise.</p>
  </li>
</ul>

<p>Running my tool on nearly the same source tree as the original example
yields:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\openbsd&gt;watc sys
. 6.89MLOC 364.58MiB
├─dev 5.69MLOC 332.75MiB
│ ├─pci 4.46MLOC 293.80MiB
│ │ ├─drm 3.99MLOC 280.25MiB
│ │ │ ├─amd 3.33MLOC 261.24MiB
│ │ │ │ ├─include 2.61MLOC 238.48MiB
│ │ │ │ │ ├─asic_reg 2.53MLOC 235.07MiB
│ │ │ │ │ │ ├─nbio 689.56kLOC 69.33MiB
│ │ │ │ │ │ ├─dcn 583.67kLOC 58.60MiB
│ │ │ │ │ │ ├─gc 290.26kLOC 28.90MiB
│ │ │ │ │ │ ├─dce 210.16kLOC 16.81MiB
│ │ │ │ │ │ ├─mmhub 155.60kLOC 16.03MiB
│ │ │ │ │ │ ├─dpcs 123.90kLOC 12.97MiB
│ │ │ │ │ │ ├─gca 105.91kLOC 5.87MiB
│ │ │ │ │ │ ├─bif 71.45kLOC 4.41MiB
│ │ │ │ │ │ ├─gmc 64.24kLOC 3.41MiB
│ │ │ │ │ │ └─(other) 230.99kLOC 18.73MiB
│ │ │ │ │ └─(other) 2.10kLOC 139.29kiB
│ │ │ │ └─(other) 718.93kLOC 22.76MiB
│ │ │ └─(other) 583.63kLOC 16.86MiB
│ │ └─(other) 8.53kLOC 259.07kiB
│ └─(other) 1.20MLOC 38.34MiB
└─(other) 1.20MLOC 31.83MiB
</code></pre></div></div>

<p>In place of interactivity it has <code class="language-plaintext highlighter-rouge">-n</code> (lines) and <code class="language-plaintext highlighter-rouge">-d</code> (depth) switches to
control tree pruning, where branches are summarized as <code class="language-plaintext highlighter-rouge">(other)</code> entries.
My idea is for users to run the tool repeatedly with different cutoffs and
filters to get a feel for <em>where’s all the code</em>. (It could really use
more such knobs.) Repeated counting makes performance all the more
important. On my machine, and a hot cache, the above takes ~180ms to count
those 6.89 million lines of code across 8,607 source files.</p>

<p>Each directory is treated like one big source file of its recursively
concatenated contents, so the tool only needs to track directories. Each
directory entry comprises a variable-length string name, line and byte
totals, and tree linkage such that it can be later navigated for sorting
and printing. That linkage has a clever solution, which I’ll get to later.
First, lets deal with strings.</p>

<h3 id="string-management">String management</h3>

<p>It’s important to get out of the null-terminated string business early,
only reverting to their use at system boundaries, such as constructing
paths for the operating system. Better to handle strings as offset/length
pairs into a buffer. Definitely avoid silly things like <a href="https://www.youtube.com/watch?v=f4ioc8-lDc0&amp;t=4407s">allocating many
individual strings</a>, as encouraged by <code class="language-plaintext highlighter-rouge">strdup</code> — and most other
programming language idioms — and certainly avoid <a href="/blog/2021/07/30/">useless functions like
<code class="language-plaintext highlighter-rouge">strcpy</code></a>.</p>

<p>When the operating system provides a path component that I need to track
for later, I intern it into a single, large buffer. That buffer looks like
so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define BUF_MAX  (1 &lt;&lt; 22)
</span><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">wchar_t</span> <span class="n">buf</span><span class="p">[</span><span class="n">BUF_MAX</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Empirically I determined that even large source trees cumulatively total
on the order of 10,000 characters of directory names. The OpenBSD kernel
source tree is only 2,992 characters of names.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find sys -type d -printf %f | wc -c
2992
</code></pre></div></div>

<p>The biggest I found was the LLVM source tree at 121,720 characters, not
only because of its sheer volume but also because it has generally has
relatively long names. So for my maximum buffer size I just maxed it out
(explained in a moment) and called it good. Even with UTF-16, that’s only
8MiB which is perfectly reasonable to allocate all at once up front. Since
my <a href="https://floooh.github.io/2018/06/17/handles-vs-pointers.html">string handles</a> don’t contain pointers, this buffer could be freely
relocated in the case of <code class="language-plaintext highlighter-rouge">realloc</code>.</p>

<p>The operating system provides a null-terminated string. The buffer makes a
copy and returns a handle. A handle is a 32-bit integer encoding offset
and length.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">buf_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">off</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">len</span> <span class="o">&gt;</span> <span class="n">BUF_MAX</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>  <span class="c1">// out of memory</span>
    <span class="p">}</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="o">+</span><span class="n">off</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">s</span><span class="p">));</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">len</span><span class="o">&lt;&lt;</span><span class="mi">22</span> <span class="o">|</span> <span class="n">off</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The negative range is reserved for errors, leaving 31 bits. I allocate 9
to the length — enough for <code class="language-plaintext highlighter-rouge">MAX_PATH</code> of 260 — and the remaining 22 bits
for the buffer offset, exactly matching the range of my <code class="language-plaintext highlighter-rouge">BUF_MAX</code>.
Splitting on a nibble boundary would have displayed more nicely in
hexadecimal during debugging, but oh well.</p>

<p>A couple of helper functions are in order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>     <span class="nf">str_len</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">22</span><span class="p">;</span>      <span class="p">}</span>
<span class="kt">int32_t</span> <span class="nf">str_off</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0x3fffff</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>

<p>Rather than allocate the string buffer on the heap, it’s a <code class="language-plaintext highlighter-rouge">static</code> (read:
too big for the stack) scoped to <code class="language-plaintext highlighter-rouge">main</code>. I consistently call it <code class="language-plaintext highlighter-rouge">b</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">buf</span> <span class="n">b</span><span class="p">;</span>
</code></pre></div></div>

<p>That’s string management solved efficiently in a dozen lines of code. I
briefly considered a hash table to de-duplicate strings in the buffer, but
real source trees aren’t redundant enough to make up for the hash table
itself, plus there’s no reason here to make that sort of time/memory
trade-off.</p>

<h3 id="directory-entries">Directory entries</h3>

<p>I settled on 24-byte directory entries:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">dir</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">nbytes</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">nlines</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">name</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">link</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">nsubdirs</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">nbytes</code> I teetered between 32 bits and 64 bits for the byte count. No
source tree I found overflows an unsigned 32-bit integer, but LLVM comes
close, just barely overflowing a signed 31-bit integer as of this year.
Since I wanted 10x over the worst case I could find, that left me with a
64-bit integer for bytes.</p>

<p>For <code class="language-plaintext highlighter-rouge">nlines</code>, 32 bits has plenty of overhead. More importantly, this field
is updated concurrently and atomically by multiple threads — line counting
is parallelized — and I want this program to work on 32-bit hosts limited
to 32-bit atomics.</p>

<p>The <code class="language-plaintext highlighter-rouge">name</code> is the string handle for that directory’s name.</p>

<p>The <code class="language-plaintext highlighter-rouge">link</code> and <code class="language-plaintext highlighter-rouge">nsubdirs</code> is the tree linkage. The <code class="language-plaintext highlighter-rouge">link</code> field is an
index, and serves two different purposes at different times. Initially it
will identify the directory’s parent directory, and I had originally named
it <code class="language-plaintext highlighter-rouge">parent</code>. <code class="language-plaintext highlighter-rouge">nsubdirs</code> is the number of subdirectories, but there is
initially no link to a directory’s children.</p>

<p>Like with the buffer, I pre-allocate all the directory entries I’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define DIRS_MAX  (1 &lt;&lt; 17)
</span><span class="kt">int32_t</span> <span class="n">ndirs</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">dir</span> <span class="n">dirs</span><span class="p">[</span><span class="n">DIRS_MAX</span><span class="p">];</span>
</code></pre></div></div>

<p>A directory handle is just an index into <code class="language-plaintext highlighter-rouge">dirs</code>. The <code class="language-plaintext highlighter-rouge">link</code> field is one
such handle. Like string handles, directory entries contain no pointers,
and so this <code class="language-plaintext highlighter-rouge">dirs</code> buffer could be freely relocated, <em>a la</em> <code class="language-plaintext highlighter-rouge">realloc</code>, if
the context called for such flexibility. In my program, rather than
allocate this on the heap, it’s just a <code class="language-plaintext highlighter-rouge">static</code> (read: too big for the
stack) scoped to <code class="language-plaintext highlighter-rouge">main</code>.</p>

<p>For <code class="language-plaintext highlighter-rouge">DIRS_MAX</code>, I again looked at the worst case I could find, LLVM, which
requires 12,163 entries. I had hoped for 16-bit directory handles, but
that would limit source trees to 32,768 directories — not quite 10x over
the worst case. I settled on 131,072 entries: 3MiB. At only 11MiB total so
far, in the very worst case, it hardly matters that I couldn’t shave off
these extra few bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find llvm-project -type d | wc -l
12163
</code></pre></div></div>

<p>Allocating a directory entry is just a matter of bumping the <code class="language-plaintext highlighter-rouge">ndirs</code>
counter. Reading a directory into <code class="language-plaintext highlighter-rouge">dirs</code> looks roughly like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="n">glob</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">L"*"</span><span class="p">);</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">dir</span> <span class="n">dirs</span><span class="p">[</span><span class="n">DIRS_MAX</span><span class="p">];</span>

<span class="kt">int32_t</span> <span class="n">parent</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// an existing directory handle</span>
<span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
<span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">glob</span><span class="p">);</span>

<span class="n">WIN32_FIND_DATAW</span> <span class="n">fd</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">FindFirstFileW</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fd</span><span class="p">);</span>

<span class="k">do</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">FILE_ATTRIBUTE_DIRECTORY</span> <span class="o">&amp;</span> <span class="n">fd</span><span class="p">.</span><span class="n">dwFileAttributes</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">fd</span><span class="p">.</span><span class="n">cFileName</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">name</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">ndirs</span> <span class="o">==</span> <span class="n">DIRS_MAX</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// out of memory</span>
        <span class="p">}</span>
        <span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ndirs</span><span class="o">++</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="n">parent</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nsubdirs</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="c1">// ... process file ...</span>
    <span class="p">}</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">FindNextFileW</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fd</span><span class="p">));</span>

<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>Mentally bookmark that “process file” part. It will be addressed later.</p>

<p>The <code class="language-plaintext highlighter-rouge">buildpath</code> function walks the <code class="language-plaintext highlighter-rouge">link</code> fields, copying (<code class="language-plaintext highlighter-rouge">memcpy</code>) path
components from the string buffer into the <code class="language-plaintext highlighter-rouge">path</code>, separated by
backslashes.</p>

<h3 id="breadth-first-tree-traversal">Breadth-first tree traversal</h3>

<p>At the top-level the program must first traverse a tree. There are two
strategies for traversing a tree (or any graph):</p>

<ul>
  <li>Depth-first: stack-oriented (lends to recursion)</li>
  <li>Breadth-first: queue-oriented</li>
</ul>

<p>Recursion makes me nervous, but besides this, a queue is already a natural
fit for this problem. The tree I build in <code class="language-plaintext highlighter-rouge">dirs</code> is also the breadth-first
processing queue. (Note: This is entirely distinct from the <em>message</em>
queue that I’ll introduce later, and is not a concurrent queue.) Further,
building the tree in <code class="language-plaintext highlighter-rouge">dirs</code> via breadth-first traversal will have useful
properties later.</p>

<p>The queue is initialized with the root directory, then iterated over until
the iterator reaches the end. Additional directories may added during
iteration, per the last section.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="n">root</span> <span class="o">=</span> <span class="n">ndirs</span><span class="o">++</span><span class="p">;</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">root</span><span class="p">].</span><span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">L"."</span><span class="p">);</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">root</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>  <span class="c1">// terminator</span>

<span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">parent</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">parent</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">parent</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ... FindFirstFileW / FindNextFileW ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When the loop exits, the program has traversed the full tree. Counts are
now propagated up the tree using the <code class="language-plaintext highlighter-rouge">link</code> field, pointing from leaves to
root. In this direction it’s just a linked list. Propagation starts at the
root and works towards leaves to avoid multiple-counting, and the
breadth-first <code class="language-plaintext highlighter-rouge">dirs</code> is already ordered for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">link</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nbytes</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">nlines</span> <span class="o">+=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nlines</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this is really another traversal, this could be done during the
first traversal. However, line counting will be done concurrently, and
it’s easier, and probably more efficient, to propagate concurrent results
after the concurrent part of the code is complete.</p>

<h3 id="inverting-the-tree-links">Inverting the tree links</h3>

<p>Printing the graph will require a depth-first traversal. Given an entry,
the program will iterate over its children. However, the tree links are
currently backwards, pointing from child to parent:</p>

<p><a href="/img/diagram/bfs0.dot"><img src="/img/diagram/bfs0.png" alt="" /></a></p>

<p>To traverse from root to leaves, those links will need to be inverted:</p>

<p><a href="/img/diagram/bfs1.dot"><img src="/img/diagram/bfs1.png" alt="" /></a></p>

<p>However, there’s only one <code class="language-plaintext highlighter-rouge">link</code> on each node, but potentially multiple
children. The breadth-first traversal comes to the rescue: All child nodes
for a given directory are adjacent in <code class="language-plaintext highlighter-rouge">dirs</code>. If <code class="language-plaintext highlighter-rouge">link</code> points to the
first child, finding the rest is trivial. There’s an implicit link between
siblings by virtue of position:</p>

<p><a href="/img/diagram/bfs2.dot"><img src="/img/diagram/bfs2.png" alt="" /></a></p>

<p>An entry’s first child immediately follows the previous entry’s last
child. So to flip the links around, manually establish the root’s <code class="language-plaintext highlighter-rouge">link</code>
field, then walk the tree breadth-first and hook <code class="language-plaintext highlighter-rouge">link</code> up to each entry’s
children based on the previous entry’s <code class="language-plaintext highlighter-rouge">link</code> and <code class="language-plaintext highlighter-rouge">nsubdirs</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dirs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">link</span> <span class="o">+</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The tree is now restructured for sorting and depth-first traversal.</p>

<h3 id="sort-by-line-count">Sort by line count</h3>

<p>I won’t include it here, but I have a <code class="language-plaintext highlighter-rouge">qsort</code>-compatible comparison
function, <code class="language-plaintext highlighter-rouge">dircmp</code> that compares by line count descending, then by name
ascending. As a file system tree, siblings cannot have equal names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dircmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Since child entries are adjacent, it’s a trivial to <code class="language-plaintext highlighter-rouge">qsort</code> each entry’s
children. A loop sorts the whole tree:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">dir</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">dirs</span> <span class="o">+</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span><span class="p">;</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">beg</span><span class="p">,</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">dirs</span><span class="p">),</span> <span class="n">dircmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We’re almost to the finish line.</p>

<h3 id="depth-first-traversal">Depth-first traversal</h3>

<p>As I said, recursion makes me nervous, so I took the slightly more
complicated route of an explicit stack. Path components must be separated
by a backslash delimiter, so the deepest possible stack is <code class="language-plaintext highlighter-rouge">MAX_PATH/2</code>.
Each stack element tracks a directory handle (<code class="language-plaintext highlighter-rouge">d</code>) and a subdirectory
index (<code class="language-plaintext highlighter-rouge">i</code>).</p>

<p>I have a <code class="language-plaintext highlighter-rouge">printstat</code> to output an entry. It takes an entry, the string
buffer, and a depth for indentation level.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">printstat</span><span class="p">(</span><span class="k">struct</span> <span class="n">dir</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">int</span> <span class="n">depth</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s a simplified depth-first traversal calling <code class="language-plaintext highlighter-rouge">printstat</code>. (The real
one has to make decisions about when to stop and summarize, and it’s
dominated by edge cases.) I initialize the stack with the root directory,
then loop until it’s empty.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// top of stack</span>
<span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">d</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span> <span class="n">stack</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="o">/</span><span class="mi">2</span><span class="p">];</span>

<span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">printstat</span><span class="p">(</span><span class="n">dirs</span><span class="o">+</span><span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>

<span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">d</span> <span class="o">=</span> <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span><span class="o">++</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">d</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">n</span><span class="o">--</span><span class="p">;</span>  <span class="c1">// pop</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">d</span><span class="p">].</span><span class="n">link</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">printstat</span><span class="p">(</span><span class="n">dirs</span><span class="o">+</span><span class="n">cur</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
        <span class="n">n</span><span class="o">++</span><span class="p">;</span>  <span class="c1">// push</span>
        <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
        <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="concurrency">Concurrency</h3>

<p>At this point the “process file” part of traversal was a straightforward
<code class="language-plaintext highlighter-rouge">CreateFile</code>, <code class="language-plaintext highlighter-rouge">ReadFile</code> loop, <code class="language-plaintext highlighter-rouge">CloseHandle</code>. I suspected it spent most of
its time in the loop counting newlines since I didn’t do anything special,
<a href="/blog/2021/12/04/">like SIMD</a>, aside from <a href="/blog/2019/12/09/">not over-constraining code
generation</a>.</p>

<p>However after taking some measurements, I found the program was spending
99.9% its time waiting on Win32 functions. <code class="language-plaintext highlighter-rouge">CreateFile</code> was the most
expensive at nearly 50% of the total run time, and even <code class="language-plaintext highlighter-rouge">CloseHandle</code> was
a substantial blocker. These two alone meant overlapped I/O wouldn’t help
much, and threads were necessary to run these Win32 blockers concurrently.
Counting newlines, even over gigabytes of data, was practically free, and
so required no further attention.</p>

<p>So I set up <a href="/blog/2022/05/14/">my lock-free work queue</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define QUEUE_LEN (1&lt;&lt;15)
</span><span class="k">struct</span> <span class="n">queue</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">q</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">d</span><span class="p">[</span><span class="n">QUEUE_LEN</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">f</span><span class="p">[</span><span class="n">QUEUE_LEN</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>As before, <code class="language-plaintext highlighter-rouge">q</code> here is the atomic. A max-size queue for <code class="language-plaintext highlighter-rouge">QUEUE_LEN</code> worked
best in my tests. Larger queues were rarely full. Or empty, except at
startup and shutdown. Queue elements are a pair of directory handle (<code class="language-plaintext highlighter-rouge">d</code>)
and file string handle (<code class="language-plaintext highlighter-rouge">f</code>), stored in separate arrays.</p>

<p>I didn’t need to push the file name strings into the string buffer before,
but now it’s a great way to supply strings to other threads. I push the
string into the buffer, then send the handle through the queue. The
recipient re-constructs the path on its end using the directory tree and
this file name. Unfortunately this puts more stress on the string buffer,
which is why I had to max out the size, but it’s worth it.</p>

<p>The “process files” part now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="n">fd</span><span class="p">.</span><span class="n">nFileSizeLow</span><span class="p">;</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">fd</span><span class="p">.</span><span class="n">nFileSizeHigh</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span><span class="p">;</span>

<span class="kt">int32_t</span> <span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">fd</span><span class="p">.</span><span class="n">cFileName</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">queue</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">name</span><span class="p">))</span> <span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
    <span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">buf</span><span class="p">.</span><span class="n">buf</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="n">processfile</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">queue_send()</code> returns false then the queue is full, so it processes
the job itself. There might be room later for the next file.</p>

<p>Worker threads look similar, spinning until an item arrives in the queue:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">d</span><span class="p">;</span>
        <span class="kt">int32_t</span> <span class="n">name</span><span class="p">;</span>
        <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_recv</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">d</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">name</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">d</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
        <span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
        <span class="n">processfile</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">d</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>A special directory entry handle of -1 tells the worker to exit. When
traversal completes, the main thread becomes a worker until the queue
empties, pushes one termination handle for each worker thread, then joins
the worker threads — a synchronization point that indicates all work is
complete, and the main thread can move on to propagation and sorting.</p>

<p>This was a substantial performance boost. At least on my system, running
just 4 threads total is enough to saturate the Win32 interface, and
additional threads do not make the program faster despite more available
cores.</p>

<p>Aside from overall portability, I’m quite happy with the results.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A lock-free, concurrent, generic queue in 32 bits</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/05/14/"/>
    <id>urn:uuid:b5a6b85a-19af-4f2f-8a32-0098f6e87edb</id>
    <updated>2022-05-14T04:22:24Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=31384602">on Hacker News</a>.</em></p>

<p>While considering concurrent queue design I came up with a generic,
lock-free queue that fits in a 32-bit integer. The queue is “generic” in
that a single implementation supports elements of any arbitrary type,
despite an implementation in C. It’s lock-free in that there is guaranteed
system-wide progress. It can store up to 32,767 elements at a time — more
than enough for message queues, which <a href="/blog/2020/05/24/">must always be bounded</a>. I
will first present a single-consumer, single-producer queue, then expand
support to multiple consumers at a cost. Like <a href="/blog/2022/03/13/">my lightweight barrier</a>,
I’m not presenting this as a packaged solution, but rather as a technique
you can apply when circumstances call.</p>

<!--more-->

<p>How can the queue store so many elements when it’s just 32 bits? It only
handles the indexes of a circular buffer. The <a href="/blog/2018/06/10/">caller is responsible</a>
for allocating and manipulating the queue’s storage, which, in the
single-consumer case, doesn’t require anything fancy. Synchronization is
managed by the queue.</p>

<p>Like a typical circular buffer, it has a head index and a tail index. The
head is the next element to be pushed, and the tail is the next element to
be popped. The queue storage must have a power-of-two length, but the
capacity is one less than the length. If the head and tail are equal then
the queue is empty. This “wastes” one element, which is why the capacity
is one less than the length of the storage. So already there are some
notable constraints imposed by this design, but I believe the main use
case for such a queue — a job queue for CPU-bound jobs — has no problem
with these constraints.</p>

<p>Since this is a concurrent queue it’s worth noting “ownership” of storage
elements. The consumer owns elements from the tail up to, but excluding,
the head. The producer owns everything else. Both pushing and popping
involve a “commit” step that transfers ownership of an element to the
other thread. No elements are accessed concurrently, which makes things
easy for either caller.</p>

<h3 id="queue-usage">Queue usage</h3>

<p>Pushing (to the front) and popping (from the back) are each a three-step
process:</p>

<ol>
  <li>Obtain the element index</li>
  <li>Access that element</li>
  <li>Commit the operation</li>
</ol>

<p>I’ll be using C11 atomics for my implementation, but it should be easy to
translate these into something else no matter the programming language. As
I mentioned, the queue fits in a 32-bit integer, and so it’s represented
by an <code class="language-plaintext highlighter-rouge">_Atomic uint32_t</code>. Here’s the entire interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">queue_pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">queue_pop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">);</span>

<span class="kt">int</span>  <span class="nf">queue_push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">queue_push_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">);</span>
</code></pre></div></div>

<p>Both <code class="language-plaintext highlighter-rouge">queue_pop</code> and <code class="language-plaintext highlighter-rouge">queue_push</code> return -1 if the queue is empty/full.</p>

<p>To create a queue, initialize an atomic 32-bit integer to zero. Also
choose a size exponent and allocate some storage. Here’s a 63-element
queue of jobs:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EXP 6  // note; 2**6 == 64
</span><span class="k">struct</span> <span class="n">job</span> <span class="n">slots</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
<span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="n">q</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>Rather than a length, the queue functions accept a base-2 exponent, which
is why I’ve defined <code class="language-plaintext highlighter-rouge">EXP</code>. If you don’t like this, you can just accept a
length in your own implementation, though remember it’s constrained to
powers of two. The producer might look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">queue_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while full</span>
    <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">job_create</span><span class="p">();</span>
    <span class="n">queue_push_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is a busy-wait loop, which makes for a simple illustration but isn’t
ideal. In a <a href="/blog/2022/05/22/">real program</a> I’d have the producer run a job while it
waits for a queue slot, or just have it turn into a consumer (if this
wasn’t a single-consumer queue). Similarly, if the queue is empty, then
maybe a consumer turns into the producer. It all depends on the context.</p>

<p>The consumer might look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">queue_pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while empty</span>
    <span class="k">struct</span> <span class="n">job</span> <span class="n">job</span> <span class="o">=</span> <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="n">queue_pop_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
    <span class="n">job_run</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In either case it’s important that neither touches the element after
committing since that transfers ownership away.</p>

<h3 id="pop-operation">Pop operation</h3>

<p>The queue is actually a pair of 16-bit integers, head and tail, each
stored in the low and high halves of the 32-bit integer. So the first
thing to do is atomically load the integer, then extract these “fields.”</p>

<p>If for some reason a capacity of 32,767 is insufficient, you can trivially
upgrade your queue to an Enterprise Queue: a 64-bit integer with a
capacity of over 2 billion elements. I’m going to stick with the 32-bit
queue.</p>

<p>Starting with the pop operation since it’s simpler:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>  <span class="c1">// consider "acquire"</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the indexes are equal, the queue is empty. Otherwise return the tail
field. The <code class="language-plaintext highlighter-rouge">*q</code> is an atomic load since it’s qualified <code class="language-plaintext highlighter-rouge">_Atomic</code>. The load
might be more efficient if this were an explicit “acquire” operation,
which is what I used in some of my tests.</p>

<p>To complete the pop, atomically increment the tail index so that the
element falls out of the range of elements owned by the consumer. The tail
is the high half of the integer so add <code class="language-plaintext highlighter-rouge">0x10000</code> rather than just 1.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">queue_pop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">q</span> <span class="o">+=</span> <span class="mh">0x10000</span><span class="p">;</span>  <span class="c1">// consider "release"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s harmless if this overflows since it’s congruent with the power-of-two
storage length, and an overflow won’t affect the head index. The increment
might be more efficient if this were an explicit “release” operation,
which, again, is what I used in some of my tests.</p>

<h3 id="push-operation">Push operation</h3>

<p>Pushing is a little more complex. As is typical with circular buffers,
before doing anything it must ensure the result won’t ambiguously create
an empty queue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>  <span class="c1">// consider "acquire"</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">next</span> <span class="o">=</span> <span class="p">(</span><span class="n">head</span> <span class="o">+</span> <span class="mi">1u</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&amp;</span> <span class="mh">0x8000</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// avoid overflow on commit</span>
        <span class="o">*</span><span class="n">q</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="mh">0x8000</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">next</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s important that incrementing the head field won’t overflow into the
tail field, so it atomically clears the high bit if set, giving the
increment overhead into which it can overflow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">queue_push_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">q</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// consider "release"</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="multiple-consumers">Multiple-consumers</h3>

<p>The single producer and single consumer didn’t require locks nor atomic
accesses to the storage array since the queue guaranteed that accesses at
the specified index were not concurrent. However, this is not the case
with multiple-consumers. Consumers race when popping. The loser’s access
might occur after the winner’s commit, making its access concurrent with
the producer. Both producer and consumers must account for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">job</span> <span class="n">slots</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
</code></pre></div></div>

<p>To prepare for multiple consumers, the array now has an atomic qualifier:
one of the costs of multiple consumers. Fortunately these new atomic
accesses can use a “relaxed” ordering since there are no required ordering
constraints. Even if it wasn’t atomic, and <a href="https://lwn.net/Articles/793253/">the load was torn</a>, we’d
detect it when attempting to commit. It’s simply against the rules to have
a data race, and I don’t know how else to avoid it other than dropping
into assembly.</p>

<p>The next cost is that committing can fail. Another consumer might have won
the race, which means you must start over. Here’s my multiple-consumer
interface, which I’ve uncreatively called <code class="language-plaintext highlighter-rouge">mpop</code> (“multiple-consumer
pop”). Besides a <code class="language-plaintext highlighter-rouge">_Bool</code> for indicating failure, the main change is a new
<code class="language-plaintext highlighter-rouge">save</code> parameter:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>   <span class="nf">queue_mpop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">save</span><span class="p">);</span>
<span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">save</span><span class="p">);</span>
</code></pre></div></div>

<p>The caller must carry some temporary state (<code class="language-plaintext highlighter-rouge">save</code>), which is how failures
are detected, ultimately communicated by that <code class="language-plaintext highlighter-rouge">_Bool</code> return.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">save</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">job</span> <span class="n">job</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="k">do</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="n">queue_mpop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while empty</span>
        <span class="n">job</span> <span class="o">=</span> <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_mpop_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">save</span><span class="p">));</span>
    <span class="n">job_run</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s important that the consumer doesn’t attempt to use <code class="language-plaintext highlighter-rouge">job</code> until a
successful commit, since it might not be valid. As noted, that load could
be relaxed (what a mouthful):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">job</span> <span class="o">=</span> <span class="n">atomic_load_explicit</span><span class="p">(</span><span class="n">slots</span><span class="o">+</span><span class="n">i</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s the pop implementation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_mpop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">save</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far it’s exactly the same, except it stores a full snapshot of the
queue state in <code class="language-plaintext highlighter-rouge">*save</code>. This is needed for a compare-and-swap (CAS) in the
commit, which checks that the queue hasn’t been modified concurrently
(i.e. by another consumer):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_compare_exchange_strong</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">,</span> <span class="n">save</span><span class="o">+</span><span class="mh">0x10000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As always with CAS, we must be wary of <a href="/blog/2014/09/02/">the ABA problem</a>. Imagine
that between starting to pop and this CAS that the producer and another
consumer looped over the entire queue and ended up back at exactly the
same spot as where we started. The queue would look like we expect, and
the commit would “succeed” despite reading a garbage value.</p>

<p>Fortunately this matches the entire 32-bit state, and so a small queue
capacity is not at a greater risk. The tail counter is always 16 bits, and
the head counter is 15 bits (due to keeping the 16th clear for overflow).
The chance of them landing at exactly the same count is low. Though if
those odds aren’t low enough, as mentioned you can always upgrade to the
64-bit Enterprise Queue with larger counters.</p>

<p>There’s a notable performance defect with this particular design. If the
producer concurrently pushes a new value, the commit will fail even if
there was no real race since only the head field changed. It would be
better if the head field was isolated from the tail field…</p>

<h3 id="a-less-cheeky-design">A less cheeky design</h3>

<p>You might have noticed that there’s little reason to pack two 16-bit
counters into a 32-bit integer. These could just be fields in a structure:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">queue</span> <span class="p">{</span>
    <span class="k">_Atomic</span> <span class="kt">uint16_t</span> <span class="n">head</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="kt">uint16_t</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>While this entire structure can be atomically loaded just like the 32-bit
integer, C11 (and later) do not permit non-atomic accesses to these atomic
fields in an unshared copy loaded from an atomic. So I’d either use
compiler-specific built-ins for atomics — much more flexible, and what I
prefer anyway — or just load them individually:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_pop</span><span class="p">(</span><span class="k">struct</span> <span class="n">queue</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="o">*</span><span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">q</span><span class="o">-&gt;</span><span class="n">head</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">save</span> <span class="o">=</span> <span class="n">q</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Technically with two loads this could extract a <code class="language-plaintext highlighter-rouge">head</code>/<code class="language-plaintext highlighter-rouge">tail</code> pair that
were never contemporaneous. The worst case is the queue appears empty even
if it was never actually empty.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">struct</span> <span class="n">queue</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_compare_exchange_strong</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">,</span> <span class="n">save</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the head index isn’t part of the CAS, the producer can’t interfere
with the commit. (Though there’s still certainly false sharing happening.)</p>

<h3 id="real-implementation-and-tests">Real implementation and tests</h3>

<p>If you want to try it out, especially with my tests: <a href="https://github.com/skeeto/scratch/blob/master/misc/queue.c"><strong>queue.c</strong></a>.
It has both single-consumer and multiple-consumer queues, and supports at
least:</p>

<ul>
  <li>atomics: C11, GNU, MSC</li>
  <li>threads: pthreads, win32</li>
  <li>compilers: GCC, Clang, MSC</li>
  <li>hosts: Linux, Windows, BSD</li>
</ul>

<p>Since I wanted to test across a variety of implementations, especially
under Thread Sanitizer (TSan). On a similar note, I also implemented a
concurrent queue shared between C and Go: <a href="https://github.com/skeeto/scratch/blob/master/misc/queue.go"><strong>queue.go</strong></a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Luhn algorithm using SWAR and SIMD</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/04/30/"/>
    <id>urn:uuid:2bb8fbd6-4197-4799-8258-861d316a7086</id>
    <updated>2022-04-30T17:53:05Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Ever been so successful that credit card processing was your bottleneck?
Perhaps you’ve wondered, “If only I could compute check digits three times
faster using the same hardware!” Me neither. But if that ever happens
someday, then this article is for you. I will show how to compute the
<a href="https://en.wikipedia.org/wiki/Luhn_algorithm">Luhn algorithm</a> in parallel using <em>SIMD within a register</em>, or
SWAR.</p>

<p>If you want to skip ahead, here’s the full source, tests, and benchmark:
<a href="https://github.com/skeeto/scratch/blob/master/misc/luhn.c"><code class="language-plaintext highlighter-rouge">luhn.c</code></a></p>

<p>The Luhn algorithm isn’t just for credit card numbers, but they do make a
nice target for a SWAR approach. The major payment processors use <a href="https://www.paypalobjects.com/en_GB/vhelp/paypalmanager_help/credit_card_numbers.htm">16
digit numbers</a> — i.e. 16 ASCII bytes — and typical machines today have
8-byte registers, so the input fits into two machine registers. In this
context, the algorithm works like so:</p>

<ol>
  <li>
    <p>Consider the digits number as an array, and double every other digit
starting with the first. For example, 6543 becomes 12, 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum individual digits in each element. The example becomes 3 (i.e.
1+2), 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.</p>
  </li>
</ol>

<p>I will implement this algorithm in C with this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>It assumes the input is 16 bytes and only contains digits, and it will
return the Luhn sum. Callers either validate a number by comparing the
result to zero, or use it to compute a check digit when generating a
number. (Read: You could use SWAR to rapidly generate valid numbers.)</p>

<p>The plan is to process the 16-digit number in two halves, and so first
load the halves into 64-bit registers, which I’m calling <code class="language-plaintext highlighter-rouge">hi</code> and <code class="language-plaintext highlighter-rouge">lo</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">hi</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">0</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">1</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">2</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">3</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">4</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">5</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">6</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">7</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">lo</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">8</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">9</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">11</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">13</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">14</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
</code></pre></div></div>

<p>This looks complicated and possibly expensive, but it’s really just an
idiom for loading a little endian 64-bit integer from a buffer. Breaking
it down:</p>

<ul>
  <li>
    <p>The input, <code class="language-plaintext highlighter-rouge">*s</code>, is <code class="language-plaintext highlighter-rouge">char</code>, which may be signed on some architectures. I
chose this type since it’s the natural type for strings. However, I do
not want sign extension, so I mask the low byte of the possibly-signed
result by ANDing with 255. It’s as though <code class="language-plaintext highlighter-rouge">*s</code> was <code class="language-plaintext highlighter-rouge">unsigned char</code>.</p>
  </li>
  <li>
    <p>The shifts assemble the 64-bit result in little endian byte order
<a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">regardless of the host machine byte order</a>. In other words, this
will produce correct results even on big endian hosts.</p>
  </li>
  <li>
    <p>I chose little endian since it’s the natural byte order for all the
architectures I care about. Big endian hosts may pay a cost on this load
(byte swap instruction, etc.). The rest of the function could just as
easily be computed over a big endian load if I was primarily targeting a
big endian machine instead.</p>
  </li>
  <li>
    <p>I could have used <code class="language-plaintext highlighter-rouge">unsigned long long</code> (i.e. <em>at least</em> 64 bits) since
no part of this function requires <em>exactly</em> 64 bits. I chose <code class="language-plaintext highlighter-rouge">uint64_t</code>
since it’s succinct, and in practice, every implementation supporting
<code class="language-plaintext highlighter-rouge">long long</code> also defines <code class="language-plaintext highlighter-rouge">uint64_t</code>.</p>
  </li>
</ul>

<p>Both GCC and Clang figure this all out and produce perfect code. On
x86-64, just one instruction for each statement:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Or, more impressively, loading both using a <em>single instruction</em> on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">ldp</span>  <span class="nv">x0</span><span class="p">,</span> <span class="nv">x1</span><span class="p">,</span> <span class="p">[</span><span class="nv">x0</span><span class="p">]</span>
</code></pre></div></div>

<p>The next step is to decode ASCII into numeric values. This is <a href="https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/">trivial and
common</a> in SWAR, and only requires subtracting <code class="language-plaintext highlighter-rouge">'0'</code> (<code class="language-plaintext highlighter-rouge">0x30</code>). So long
as there is no overflow, this can be done lane-wise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–9. Next, double every
other digit. Multiplication in SWAR is not easy, but doubling just means
adding the odd lanes to themselves. I can mask out the lanes that are not
doubled. Regarding the mask, recall that the least significant byte is the
first byte (little endian).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="n">lo</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–18. Now for the tricky
problem of folding the tens place into the ones place. Unlike 8 or 16, 10
is not a particularly convenient base for computers, especially since SWAR
lacks lane-wide division or modulo. Perhaps a lane-wise <a href="https://en.wikipedia.org/wiki/Binary-coded_decimal">binary-coded
decimal</a> could solve this. However, I have a better trick up my
sleeve.</p>

<p>Consider that the tens place is either 0 or 1. In other words, we really
only care if the value in the lane is greater than 9. If I add 6 to each
lane, the 5th bit (value 16) will definitely be set in any lanes that were
previously at least 10. I can use that bit as the tens place.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="p">(</span><span class="n">hi</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
</code></pre></div></div>

<p>This code adds 6 to the doubled lanes, shifts the 5th bit to the least
significant position in the lane, masks for just that bit, and adds it
lane-wise to the total. Only applying this to doubled lanes is a style
decision, and I could have applied it to all lanes for free.</p>

<p>The astute might notice I’ve strayed from the stated algorithm. A lane
that was holding, say, 12 now hold 13 rather than 3. Since the final
result of the algorithm is modulo 10, leaving the tens place alone is
harmless, so this is fine.</p>

<p>At this point each lane contains values in 0–19. Now that the tens
processing is done, I can combine the halves into one register with a
lane-wise sum:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">lo</span><span class="p">;</span>
</code></pre></div></div>

<p>Each lane contains values in 0–38. I would have preferred to do this
sooner, but that would have complicated tens place handling. Even if I had
rotated the doubled lanes in one register to even out the sums, some lanes
may still have had a 2 in the tens place.</p>

<p>The final step is a horizontal sum reduction using the typical SWAR
approach. Add the top half of the register to the bottom half, then the
top half of what’s left to the bottom half, etc.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
</code></pre></div></div>

<p>Before the sum I said each lane was 0–38, so couldn’t this sum be as high
as 304 (8x38)? It would overflow the lane, giving an incorrect result.
Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled
lanes. That’s a maximum of 224, which fits in the result lane without
overflow. Whew! I’ve been tracking the range all along to guard against
overflow like this.</p>

<p>Finally mask the result lane and return it modulo 10:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="p">(</span><span class="n">hi</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>On my machine, SWAR is around 3x faster than a straightforward
digit-by-digit implementation.</p>

<h3 id="usage-examples">Usage examples</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">is_valid</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">random_credit_card</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"%015llu0"</span><span class="p">,</span> <span class="n">rand64</span><span class="p">()</span><span class="o">%</span><span class="mi">1000000000000000</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">+</span> <span class="mi">10</span> <span class="o">-</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="simd">SIMD</h3>

<p>Conveniently, all the SWAR operations translate directly into SSE2
instructions. If you understand the SWAR version, then this is easy to
follow:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>

    <span class="c1">// decode ASCII</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mh">0x30</span><span class="p">));</span>

    <span class="c1">// double every other digit</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x00ff</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">m</span><span class="p">));</span>

    <span class="c1">// extract and add tens digit</span>
    <span class="n">__m128i</span> <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x0006</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_srai_epi32</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>

    <span class="c1">// horizontal sum</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi32</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On my machine, the SIMD version is around another 3x increase over SWAR,
and so nearly an order of magnitude faster than a digit-by-digit
implementation.</p>

<p><em>Update</em>: Const-me on Hacker News <a href="https://news.ycombinator.com/item?id=31320853">suggests a better option</a> for
handling the tens digit in the function above, shaving off 7% of the
function’s run time on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// if (digit &gt; 9) digit -= 9</span>
    <span class="n">__m128i</span> <span class="n">nine</span> <span class="o">=</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">9</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">gt</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">nine</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">gt</span><span class="p">,</span> <span class="n">nine</span><span class="p">));</span>
</code></pre></div></div>

<p><em>Update</em>: u/aqrit on reddit has come up with a more optimized SSE2
solution, 12% faster than mine on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="sc">'5'</span><span class="p">),</span> <span class="n">v</span><span class="p">);</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_slli_epi16</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">8</span><span class="p">));</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span>  <span class="c1">// subtract 1 if less than 5</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_setzero_si128</span><span class="p">());</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
    <span class="c1">// (('0' * 24) - 8) % 10 == 4</span>
<span class="p">}</span>
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A flexible, lightweight, spin-lock barrier</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/03/13/"/>
    <id>urn:uuid:5a72d27a-60f4-4b52-a4c2-f1c3b72e6c85</id>
    <updated>2022-03-13T23:55:08Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=30671979">on Hacker News</a>.</em></p>

<p>The other day I wanted try the famous <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">memory reordering experiment</a>
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an <a href="https://research.swtch.com/hwmm">“impossible” result</a> on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.</p>

<!--more-->

<p>Here’s the entire barrier implementation for two threads in C11.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for two threads. Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="mi">2</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">BarrierWait</span><span class="p">(</span><span class="n">barrier</span> <span class="o">*</span><span class="kt">uint32</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">v</span> <span class="o">:=</span> <span class="n">atomic</span><span class="o">.</span><span class="n">AddUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">v</span><span class="o">&amp;</span><span class="m">1</span> <span class="o">==</span> <span class="m">1</span> <span class="p">{</span>
        <span class="n">v</span> <span class="o">&amp;=</span> <span class="m">2</span>
        <span class="k">for</span> <span class="n">atomic</span><span class="o">.</span><span class="n">LoadUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="m">2</span> <span class="o">==</span> <span class="n">v</span> <span class="p">{</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.</p>

<p>When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of <a href="https://web.archive.org/web/20151109230817/https://stackoverflow.com/questions/33598686/spinning-thread-barrier-using-atomic-builtins">subtly-incorrect</a> spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.</p>

<p>Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.</p>

<h3 id="experiment">Experiment</h3>

<p>Here’s the setup for the memory reordering experiment, where <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>
are initialized to zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0
</code></pre></div></div>

<p>Considering all the possible orderings, it would seem that at least one of
<code class="language-plaintext highlighter-rouge">r0</code> or <code class="language-plaintext highlighter-rouge">r1</code> is 1. There seems to be no ordering where <code class="language-plaintext highlighter-rouge">r0</code> and <code class="language-plaintext highlighter-rouge">r1</code> could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.</p>

<p>How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use <code class="language-plaintext highlighter-rouge">volatile</code> for <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
<code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<p>So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"movl  $1, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"movl  %2, %0</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="o">*</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="o">*</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>ARM64 (to try on my Raspberry Pi):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"str  %w0, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"ldr  %w0, %2</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
<code class="language-plaintext highlighter-rouge">static</code>.</p>

<p>Alternatively, I could use C11 atomics with a relaxed memory order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">atomic_store_explicit</span><span class="p">(</span><span class="n">w0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">atomic_load_explicit</span><span class="p">(</span><span class="n">w1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this is a <em>race</em> and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of <em>starting barrier</em>… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">w0</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">r0</span><span class="p">,</span> <span class="n">r1</span><span class="p">;</span>

<span class="c1">// thread#1                   // thread#2</span>
<span class="n">w0</span> <span class="o">=</span> <span class="n">w1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>
<span class="n">r1</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w1</span><span class="p">);</span>    <span class="n">r0</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w0</span><span class="p">);</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>

<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r0</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">r1</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"impossible!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.</p>

<p>Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.</p>

<h3 id="barrier-selection">Barrier selection</h3>

<p>On my first attempt, I made the obvious decision for the barrier: I used
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_barrier_wait.html"><code class="language-plaintext highlighter-rouge">pthread_barrier_t</code></a>. I was already using pthreads for spawning the
extra thread, including <a href="/blog/2020/05/15/">on Windows</a>, so this was convenient.</p>

<p>However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a <em>global</em>
lock <em>twice</em> per wait to manage the barrier’s reference counter.</p>

<p>All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a <em>spin-lock barrier</em>.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.</p>

<h3 id="barrier-design">Barrier design</h3>

<p>Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to <code class="language-plaintext highlighter-rouge">w0</code>, <code class="language-plaintext highlighter-rouge">w1</code>) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.</p>

<p>I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.</p>

<p>At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait1</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Or to avoid an extra load, use the result directly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait2</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">++*</span><span class="n">barrier</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.</p>

<p>To fix this, the wait function must also track the <em>phase</em>. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently <strong>the rest of the integer acts like a phase counter</strong>!
Writing this out more explicitly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">observed</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">thread_count</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">thread_count</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// not last arrival, watch for phase change</span>
        <span class="kt">unsigned</span> <span class="n">init_phase</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
            <span class="kt">unsigned</span> <span class="n">current_phase</span> <span class="o">=</span> <span class="o">*</span><span class="n">barrier</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current_phase</span> <span class="o">!=</span> <span class="n">init_phase</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.</p>

<p>By the way, I’m using <code class="language-plaintext highlighter-rouge">unsigned</code> since it may eventually overflow, and
even <code class="language-plaintext highlighter-rouge">_Atomic int</code> overflow is undefined for the <code class="language-plaintext highlighter-rouge">++</code> operator. However,
if you use <code class="language-plaintext highlighter-rouge">atomic_fetch_add</code> or C++ <code class="language-plaintext highlighter-rouge">std::atomic</code> then overflow is
defined and you can use <code class="language-plaintext highlighter-rouge">int</code>.</p>

<p>Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (<code class="language-plaintext highlighter-rouge">&gt;&gt;</code>), I
mask (<code class="language-plaintext highlighter-rouge">&amp;</code>) the phase bit with 2.</p>

<p>With this spin-lock barrier, the experiment observes <code class="language-plaintext highlighter-rouge">r0 = r1 = 0</code> in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.</p>

<h3 id="generalizing-to-more-threads">Generalizing to more threads</h3>

<p>Two threads required two bits. This generalizes to <code class="language-plaintext highlighter-rouge">log2(n)+1</code> bits for
<code class="language-plaintext highlighter-rouge">n</code> threads, where <code class="language-plaintext highlighter-rouge">n</code> is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_waitn</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note: <strong>It never makes sense for <code class="language-plaintext highlighter-rouge">n</code> to exceed the logical core count!</strong>
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.</p>

<p>If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a <code class="language-plaintext highlighter-rouge">uint64_t</code> — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the <code class="language-plaintext highlighter-rouge">&amp;</code> operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.</p>

<p>While C11 <code class="language-plaintext highlighter-rouge">_Atomic</code> seems like it would be useful, unsurprisingly it is
not supported by one major, <a href="/blog/2021/12/30/">stubborn</a> implementation. If you’re
using C++11 or later, then go ahead use <code class="language-plaintext highlighter-rouge">std::atomic&lt;int&gt;</code> since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif
</span>
<span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="n">BARRIER_INC</span><span class="p">(</span><span class="n">barrier</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="n">BARRIER_GET</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This has the nice bonus that the interface does not have the <code class="language-plaintext highlighter-rouge">_Atomic</code>
qualifier, nor <code class="language-plaintext highlighter-rouge">std::atomic</code> template. It’s just a plain old <code class="language-plaintext highlighter-rouge">int</code>, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.</p>

<p>If you’d like to try the experiment yourself: <a href="https://gist.github.com/skeeto/c63b9ddf2c599eeca86356325b93f3a7"><code class="language-plaintext highlighter-rouge">reorder.c</code></a>. If
you’d like to see a test of Go and C sharing a thread barrier:
<a href="https://gist.github.com/skeeto/bdb5a0d2aa36b68b6f66ca39989e1444"><code class="language-plaintext highlighter-rouge">coop.go</code></a>.</p>

<p>I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe <a href="https://vimeo.com/644068002">context is
everything</a>. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Fast CSV processing with SIMD</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/12/04/"/>
    <id>urn:uuid:ba6e0ccf-1e11-4c5d-bc53-dd11fbc6da6c</id>
    <updated>2021-12-04T01:13:33Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=29439403">on Hacker News</a>.</em></p>

<p>I recently learned of <a href="https://github.com/dbro/csvquote">csvquote</a>, a tool that encodes troublesome
<a href="https://datatracker.ietf.org/doc/html/rfc4180">CSV</a> characters such that unix tools can correctly process them. It
reverses the encoding at the end of the pipeline, recovering the original
input. The original implementation handles CSV quotes using the
straightforward, naive method. However, there’s a better approach that is
not only simpler, but around 3x faster on modern hardware. Even more,
there’s yet another approach using SIMD intrinsics, plus some bit
twiddling tricks, which increases the processing speed by an order of
magnitude. <a href="https://github.com/skeeto/scratch/tree/master/csvquote"><strong>My csvquote implementation</strong></a> includes both
approaches.</p>

<!--more-->

<h3 id="background">Background</h3>

<p>Records in CSV data are separated by line feeds, and fields are separated
by commas. Fields may be quoted.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa,bbb,ccc
xxx,"yyy",zzz
</code></pre></div></div>

<p>Fields containing a line feed (U+000A), quotation mark (U+0022), or comma
(U+002C), must be quoted, otherwise they would be ambiguous with the CSV
formatting itself. Quoted quotation marks are turned into a pair of
quotes. For example, here are two records with two fields apiece:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1919–1921, 1923, 1926"
"Frankenstein;
or, The Modern Prometheus",Mary Shelley
</code></pre></div></div>

<p>A CSV-unaware tool splitting on commas and line feeds (ex. <code class="language-plaintext highlighter-rouge">awk</code>) would
process these records improperly. So csvquote translates quoted line feeds
into record separators (U+001E) and commas into unit separators (U+001F).
These control characters rarely appear in normal text data, and can be
trivially processed in UTF-8-encoded text without decoding or encoding.
The above records become:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1919–1921\x1f 1923\x1f 1926"
"Frankenstein;\x1eor\x1f The Modern Prometheus",Mary Shelley
</code></pre></div></div>

<p>I’ve used <code class="language-plaintext highlighter-rouge">\x1e</code> and <code class="language-plaintext highlighter-rouge">\x1f</code> here to illustrate the control characters.</p>

<p>The data is exactly the same length since it’s a straight byte-for-byte
replacement. Quotes are left entirely untouched. The challenge is parsing
the quotes to track whether the two special characters fall inside or
outside pairs of quotes.</p>

<h3 id="state-machine-improvements">State machine improvements</h3>

<p>The original csvquote walks the input a byte at a time and is in one of
three states:</p>

<ol>
  <li>Outside quotes (initial state)</li>
  <li>Inside quotes</li>
  <li>On a possibly “escaped” quote (the first <code class="language-plaintext highlighter-rouge">"</code> in a <code class="language-plaintext highlighter-rouge">""</code>)</li>
</ol>

<p>Since <a href="/blog/2020/12/31/">I love state machines so much</a>, here it is translated into a
switch-based state machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Return the next state given an input character.</span>
<span class="kt">int</span> <span class="nf">next</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">3</span> <span class="o">:</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">3</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a href="/img/csv/csv3.dot"><img src="/img/csv/csv3.png" alt="" /></a></p>

<p>The real program also has more conditions for potentially making a
replacement. It’s an awful lot of <a href="/blog/2017/10/06/">performance-killing branching</a>.</p>

<p>However, this <a href="https://vimeo.com/644068002">context</a> is about finding “in” and “out” — not validating
the CSV — so the “escape” state is unnecessary. I need only match up pairs
of quotes. An “escaped” quote can be considered terminating a quoted
region and immediately starting a new quoted region. That’s means there’s
just the first two states in a trivial arrangement:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">next</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a href="/img/csv/csv2.dot"><img src="/img/csv/csv2.png" alt="" /></a></p>

<p>Since the text can be processed as bytes, there are only 256 possible
inputs. With 2 states and 256 inputs, this state machine, <em>with</em>
replacement machinery, can be implemented with a 512-byte table and <em>no
branches</em>. Here’s the table initialization:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">table</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">256</span><span class="p">];</span>

<span class="kt">void</span> <span class="nf">init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">256</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">table</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="sc">'\n'</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x1e</span><span class="p">;</span>
    <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="sc">','</span><span class="p">]</span>  <span class="o">=</span> <span class="mh">0x1f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the first state, characters map onto themselves. In the second state,
characters map onto their replacements. This is the <em>entire</em> encoder and
decoder:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">encode</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span> <span class="o">^=</span> <span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'"'</span><span class="p">);</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Well, strictly speaking, the decoder need not process quotes. By my
benchmark (<code class="language-plaintext highlighter-rouge">csvdump</code> in my implementation) this processes at ~1 GiB/s on
my laptop — 3x faster than the original. However, there’s still
low-hanging fruit to be picked!</p>

<h3 id="simd-and-twos-complement">SIMD and two’s complement</h3>

<p>Any decent SIMD implementation is going to make use of masking. Find the
quotes, compute a mask over quoted regions, compute another mask for
replacement matches, combine the masks, then use that mask to blend the
input with the replacements. Roughly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quotes    = find_quoted_regions(input)
linefeeds = input == '\n'
commas    = input == ','
output    = blend(input, '\n', quotes &amp; linefeeds)
output    = blend(output, ',', quotes &amp; commas)
</code></pre></div></div>

<p>The hard part is computing the quote mask, and also somehow handle quoted
regions straddling SIMD chunks (not pictured), <em>and</em> do all that without
resorting to slow byte-at-time operations. Fortunately there are some
bitwise tricks that can resolve each issue.</p>

<p>Imagine I load 32 bytes into a SIMD register (e.g. AVX2), and I compute a
32-bit mask where each bit corresponds to one byte. If that byte contains
a quote, the corresponding bit is set.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001010
</code></pre></div></div>

<p>That last/lowest 1 corresponds to the beginning of a quoted region. For my
mask, I’d like to set all bits following that bit. I can do this by
subtracting 1.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001001
</code></pre></div></div>

<p>Using the <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan">Kernighan technique</a> I can also remove this bit from the
original input by ANDing them together.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001000
</code></pre></div></div>

<p>Now I’m left with a new bottom bit. If I repeat this, I build up layers of
masks, one for each input quote.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>10000000000000011000011000001001
10000000000000011000011000000111
10000000000000011000010111111111
10000000000000011000001111111111
10000000000000010111111111111111
10000000000000001111111111111111
01111111111111111111111111111111
</code></pre></div></div>

<p>Remember how I use XOR in the state machine above to toggle between
states? If I XOR all these together, I toggle the quotes on and off,
building up quoted regions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
01111111111111100111100111110001
</code></pre></div></div>

<p>However, for reasons I’ll explain shortly, it’s critical that the opening
quote is included in this mask. If I XOR the pre-subtracted value with the
mask when I compute the mask, I can toggle the remaining quotes on and off
such that the opening quotes are included. Here’s my function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">find_quoted_regions</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="n">x</span><span class="p">;</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">&amp;=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which gives me exactly what I want:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
11111111111111101111101111110011
</code></pre></div></div>

<p>It’s important that the opening quote is included because it means a
region that begins on the last byte will have that last bit set. I can use
that last bit to determine if the next chunk begins in a quoted state. If
a region begins in a quoted state, I need only NOT the whole result to
reverse the quoted regions.</p>

<p>How can I “sign extend” a 1 into all bits set, or do nothing for zero?
Negate it!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">uint32_t</span> <span class="n">carry</span>  <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="n">prev</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">quotes</span> <span class="o">=</span> <span class="n">find_quoted_regions</span><span class="p">(</span><span class="n">input</span><span class="p">)</span> <span class="o">^</span> <span class="n">carry</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">prev</span> <span class="o">=</span> <span class="n">quotes</span><span class="p">;</span>
</code></pre></div></div>

<p>That takes care of computing quoted regions and chaining them between
chunks. The loop will unfortunately cause branch prediction penalties if
the input has lots of quotes, but I couldn’t find a way around this.</p>

<p>However, I’ve made a serious mistake. I’m using <code class="language-plaintext highlighter-rouge">_mm256_movemask_epi8</code> and
it puts the first byte in the lowest bit. Doh! That means it looks like
this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1","htuR ""ebaB"" namreH egroeG"
01010000011000011000000000000001
</code></pre></div></div>

<p>There’s no efficient way to flip the bits around, so I just need to find a
way to work in the other direction. To flip the bits to the left of a set
bit, negate it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000000000000000000010000000 = +0x00000080
11111111111111111111111110000000 = -0x00000080
</code></pre></div></div>

<p>Unlike before, this keeps the original bit set, so I need to XOR the
original value into the input to flip the quotes. This is as simple as
initializing to the input rather than zero. The new loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">find_quoted_regions</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="o">-</span><span class="n">x</span> <span class="o">^</span> <span class="n">x</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">&amp;=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1","htuR ""ebaB"" namreH egroeG"
11001111110111110111111111111111
</code></pre></div></div>

<p>The carry now depends on the high bit rather than the low bit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uint32_t carry = -(prev &gt;&gt; 31);
</code></pre></div></div>

<h3 id="reversing-movemask">Reversing movemask</h3>

<p>The next problem: for reasons I don’t understand, AVX2 does not include
the inverse of <code class="language-plaintext highlighter-rouge">_mm256_movemask_epi8</code>. Converting the bit-mask back into a
byte-mask requires some clever shuffling. Fortunately <a href="https://web.archive.org/web/20150506071030/https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb">I’m not the first
to have this problem</a>, and so I didn’t have to figure it out from
scratch.</p>

<p>First fill the 32-byte register with repeated copies of the 32-bit mask.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abcdabcdabcdabcdabcdabcdabcdabcd
</code></pre></div></div>

<p>Shuffle the bytes so that the first 8 register bytes have the same copy of
the first bit-mask byte, etc.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaaaaaaabbbbbbbbccccccccdddddddd
</code></pre></div></div>

<p>In byte 0, I care only about bit 0, in byte 1 I care only about the bit 1,
… in byte N I care only about bit <code class="language-plaintext highlighter-rouge">N%8</code>. I can pre-compute a mask to
isolate each of these bits and produce a proper byte-wise mask from the
bit-mask. Fortunately all this isn’t too bad: four instructions instead of
the one I had wanted. It looks like a lot of code, but it’s really only a
few instructions.</p>

<h3 id="results">Results</h3>

<p>In my benchmark, which includes randomly occurring quoted fields, the SIMD
version processes at ~4 GiB/s — 10x faster than the original. I haven’t
profiled, but I expect mispredictions on the bit-mask loop are the main
obstacle preventing the hypothetical 32x speedup.</p>

<p>My version also optionally rejects inputs containing the two special
control characters since the encoding would be irreversible. This is
implemented in SIMD when available, and it slows processing by around 10%.</p>

<h3 id="followup-pclmulqdq">Followup: PCLMULQDQ</h3>

<p>Geoff Langdale and others have <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCABwTFSrDpNkmJs6TpkAfofcZq6e8YWaJUur20xZBz7mDBnvQ2w%40mail.gmail.com%3E">graciously pointed out PCLMULQDQ</a>,
which can <a href="https://wunkolo.github.io/post/2020/05/pclmulqdq-tricks/">compute the quote masks using carryless multiplication</a>
(<a href="https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/">also</a>) entirely in SIMD and without a loop. I haven’t yet quite
worked out exactly how to apply it, but it should be much faster.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Billions of Code Name Permutations in 32 bits</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/09/14/"/>
    <id>urn:uuid:bc17a779-bee1-4a60-80d1-5c5cfd8fd638</id>
    <updated>2021-09-14T21:06:59Z</updated>
    <category term="c"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>My friend over at Possibly Wrong <a href="https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/">created a code name generator</a>. By
coincidence I happened to be thinking about code names myself while
recently replaying <a href="https://en.wikipedia.org/wiki/XCOM:_Enemy_Within"><em>XCOM: Enemy Within</em></a> (2012/2013). The game
generates a random code name for each mission, and I wondered how often it
repeats. The <a href="https://www.ufopaedia.org/index.php/Mission_Names_(EU2012)">UFOpaedia page on the topic</a> gives the word lists: 53
adjectives and 76 nouns, for a total of 4028 possible code names. A
typical game has around 60 missions, and if code names are generated
naively on the fly, then per the birthday paradox around half of all games
will see a repeated mission code name! Fortunately this is easy to avoid,
and the particular configuration here lends itself to an interesting
implementation.</p>

<p>Mission code names are built using “<em>adjective</em> <em>noun</em>”. Some examples
from the game’s word list:</p>

<ul>
  <li>Fading Hammer</li>
  <li>Fallen Jester</li>
  <li>Hidden Crown</li>
</ul>

<p>To generate a code name, we could select a random adjective and a random
noun, but as discussed it wouldn’t take long for a collision. The naive
approach is to keep a database of previously-generated names, and to
consult this database when generating new names. That works, but there’s
an even better solution: use a random permutation. Done well, we don’t
need to keep track of previous names, and the generator won’t repeat until
it’s exhausted all possibilities.</p>

<p>Further, the total number of possible code names, 4028, is suspiciously
shy of 4,096, a power of two (<code class="language-plaintext highlighter-rouge">2**12</code>). That makes designing and
implementing an efficient permutation that much easier.</p>

<h3 id="a-linear-congruential-generator">A linear congruential generator</h3>

<p>A classic, obvious solution is a <a href="/blog/2019/11/19/">linear congruential generator</a>
(LCG). A full-period, 12-bit LCG is nothing more than a permutation of the
numbers 0 to 4,095. When generating names, we can skip over the extra 68
values and pretend it’s a permutation of 4,028 elements. An LCG is
constructed like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(n) = (f(n-1)*A + C) % M
</code></pre></div></div>

<p>Typically the seed is used for <code class="language-plaintext highlighter-rouge">f(0)</code>. M is selected based on the problem
space or implementation efficiency, and usually a power of two. In this
case it will be 4,096. Then there are some rules for choosing A and C.</p>

<p>Simply choosing a random <code class="language-plaintext highlighter-rouge">f(0)</code> per game isn’t great. The code name order
will always be the same, and we’re only choosing where in the cycle to
start. It would be better to vary the permutation itself, which we can do
by also choosing unique A and C constants per game.</p>

<p>Choosing C is easy: It must be relatively prime with M, i.e. it must be
odd. Since it’s addition modulo M, there’s no reason to choose <code class="language-plaintext highlighter-rouge">C &gt;= M</code>
since the results are identical to a smaller C. If we think of C as a
12-bit integer, 1 bit is locked in, and the other 11 bits are free to
vary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxxxx1
</code></pre></div></div>

<p>Choosing A is more complicated: must be odd, <code class="language-plaintext highlighter-rouge">A-1</code> must be divisible by 4,
and <code class="language-plaintext highlighter-rouge">A-1</code> should be divisible by 8 (better results). Again, thinking of
this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits
free:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxx101
</code></pre></div></div>

<p>This ensures all the <em>must</em> and <em>should</em> properties of A.</p>

<p>Finally <code class="language-plaintext highlighter-rouge">0 &lt;= f(0) &lt; M</code>. Because of modular arithmetic larger, values are
redundant, and all possible values are valid since the LCG, being
full-period, will cycle through all of them. This is just choosing the
starting point in a particular permutation cycle. As a 12-bit number, all
12 bits are free:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxxxxx
</code></pre></div></div>

<p>That’s <code class="language-plaintext highlighter-rouge">9 + 11 + 12 = 32</code> free bits to fill randomly: again, how
incredibly convenient! Every 32-bit integer defines some unique code name
permutation… <em>almost</em>. Any 32-bit descriptor where <code class="language-plaintext highlighter-rouge">f(0) &gt;= 4028</code> will
collide with at least one other due to skipping, and so around 1.7% of the
state space is redundant. A small loss that should shrink with slightly
better word list planning. I don’t think anyone will notice.</p>

<h3 id="slice-and-dice">Slice and dice</h3>

<p><a href="/blog/2020/12/31/">I love compact state machines</a>, and this is an opportunity to put one
to good use. My code name generator will be just one function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>This takes one of those 32-bit permutation descriptors, writes the first
code name to <code class="language-plaintext highlighter-rouge">buf</code>, and returns a descriptor for another permutation that
starts with the next name. All we have to do is keep track of that 32-bit
number and we’ll never need to worry about repeating code names until all
have been exhausted.</p>

<p>First, lets extract A, C, and <code class="language-plaintext highlighter-rouge">f(0)</code>, which I’m calling S. The low bits
are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to
lock in the hard-set bits.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&lt;&lt;</span>  <span class="mi">3</span> <span class="o">|</span> <span class="mi">5</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">//  9 bits</span>
<span class="kt">long</span> <span class="n">c</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">// 11 bits</span>
<span class="kt">long</span> <span class="n">s</span> <span class="o">=</span>  <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>               <span class="c1">// 12 bits</span>
</code></pre></div></div>

<p>Next iterate the LCG until we have a number in range:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">do</span> <span class="p">{</span>
    <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;=</span> <span class="mi">4028</span><span class="p">);</span>
</code></pre></div></div>

<p>Once we have an appropriate LCG state, compute the adjective/noun indexes
and build a code name:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">s</span> <span class="o">%</span> <span class="mi">53</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="mi">53</span><span class="p">;</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
</code></pre></div></div>

<p>Finally assemble the next 32-bit state. Since A and C don’t change, these
are passed through while the old S is masked out and replaced with the new
S.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
</code></pre></div></div>

<p>Putting it all together:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">adjvs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">nouns</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>

<span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&lt;&lt;</span>  <span class="mi">3</span> <span class="o">|</span> <span class="mi">5</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">//  9 bits</span>
    <span class="kt">long</span> <span class="n">c</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">// 11 bits</span>
    <span class="kt">long</span> <span class="n">s</span> <span class="o">=</span>  <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>               <span class="c1">// 12 bits</span>

    <span class="k">do</span> <span class="p">{</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;=</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">)</span><span class="o">*</span><span class="n">COUNTOF</span><span class="p">(</span><span class="n">nouns</span><span class="p">));</span>

    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">s</span> <span class="o">%</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The caller just needs to generate an initial 32-bit integer. Any 32-bit
integer is valid — even zero — so this could just be, say, the unix epoch
(<code class="language-plaintext highlighter-rouge">time(2)</code>), but adjacent values will have similar-ish permutations. I
intentionally placed S in the high bits, which are least likely to vary,
since it only affects where the cycle begins, while A and C have a much
more dramatic impact and so are placed at more variable locations.</p>

<p>Regardless, it would be better to hash such an input so that adjacent time
values map to distant states. It also helps hide poorer (less random)
choices for A multipliers. I happen to have <a href="/blog/2018/07/31/">designed some great functions
for exactly this purpose</a>. Here’s one of my best:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">+=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd168aaadU</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xaf723597U</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would be perfectly reasonable for generating all possible names in a
random order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="n">state</span> <span class="o">=</span> <span class="n">hash32</span><span class="p">(</span><span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">4028</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
    <span class="n">state</span> <span class="o">=</span> <span class="n">codename</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To further help cover up poorer A multipliers, it’s better for the word
list to be pre-shuffled in its static storage. If that underlying order
happens to show through, at least it will be less obvious (i.e. not in
alphabetical order). Shuffling the string list in my source is just a few
keystrokes in Vim, so this is easy enough.</p>

<h3 id="robustness">Robustness</h3>

<p>If you’re set on making the <code class="language-plaintext highlighter-rouge">codename</code> function easier to use such that
consumers don’t need to think about hashes, you could “encode” and
“decode” the descriptor going in an out of the function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">state</span> <span class="o">+=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">*=</span> <span class="mh">0x9e485565U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">*=</span> <span class="mh">0xef1d6b47U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>

    <span class="c1">// ...</span>

    <span class="n">state</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">state</span> <span class="o">*=</span> <span class="mh">0xeb00ce77U</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">state</span> <span class="o">*=</span> <span class="mh">0x88ccd46dU</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span> <span class="n">state</span> <span class="o">-=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">state</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This permutes the state coming in, and reverses that permutation on the
way out (read: inverse hash). This breaks up similar starting points.</p>

<h3 id="a-random-access-code-name-permutation">A random-access code name permutation</h3>

<p>Of course this isn’t the only way to build a permutation. I recently
picked up another trick: <a href="https://andrew-helmer.github.io/permute/">Kensler permutation</a>. The key insight
is cycle-walking, allowing for random-access to a permutation of a smaller
domain (e.g. 4,028 elements) through permutation of a larger domain (e.g.
4096 elements).</p>

<p>Here’s such a code name generator built around a bespoke 12-bit
xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply
is less effective the smaller the permutation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Generate the nth code name for this seed.</span>
<span class="kt">void</span> <span class="nf">codename_n</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">seed</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x325</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x3f5</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0xa89</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x85b</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">)</span><span class="o">*</span><span class="n">COUNTOF</span><span class="p">(</span><span class="n">nouns</span><span class="p">));</span>

    <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="n">i</span> <span class="o">%</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">i</span> <span class="o">/</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">a</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">b</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While this is more flexible, avoids poorer permutations, and doesn’t have
state space collisions, I still have a soft spot for my LCG-based state
machine generator.</p>

<h3 id="source-code">Source code</h3>

<p>You can find the complete, working source code with both generators here:
<a href="https://github.com/skeeto/scratch/tree/master/misc/codename.c"><strong><code class="language-plaintext highlighter-rouge">codename.c</code></strong></a>. I used <a href="https://en.wikipedia.org/wiki/Secret_Service_code_name">real US Secret Service code names</a> for
my word list. Some sample outputs:</p>

<ul>
  <li>PLASTIC HUMMINGBIRD</li>
  <li>BLACK VENUS</li>
  <li>SILENT SUNBURN</li>
  <li>BRONZE AUTHOR</li>
  <li>FADING MARVEL</li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>State machines are wonderful tools</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/12/31/"/>
    <id>urn:uuid:c93d7a7b-6ae0-4b7e-afa6-424ef40b9d9c</id>
    <updated>2020-12-31T22:48:13Z</updated>
    <category term="compsci"/><category term="c"/><category term="python"/><category term="lua"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25601821">on Hacker News</a>.</em></p>

<p>I love when my current problem can be solved with a state machine. They’re
fun to design and implement, and I have high confidence about correctness.
They tend to:</p>

<ol>
  <li>Present <a href="/blog/2018/06/10/">minimal, tidy interfaces</a></li>
  <li>Require few, fixed resources</li>
  <li>Hold no opinions about input and output</li>
  <li>Have a compact, concise implementation</li>
  <li>Be easy to reason about</li>
</ol>

<p>State machines are perhaps one of those concepts you heard about in
college but never put into practice. Maybe you use them regularly.
Regardless, you certainly run into them regularly, from <a href="https://swtch.com/~rsc/regexp/">regular
expressions</a> to traffic lights.</p>

<!--more-->

<h3 id="morse-code-decoder-state-machine">Morse code decoder state machine</h3>

<p>Inspired by <a href="https://possiblywrong.wordpress.com/2020/11/21/among-us-morse-code-puzzle/">a puzzle</a>, I came up with this deterministic state
machine for decoding <a href="https://en.wikipedia.org/wiki/Morse_code">Morse code</a>. It accepts a dot (<code class="language-plaintext highlighter-rouge">'.'</code>), dash
(<code class="language-plaintext highlighter-rouge">'-'</code>), or terminator (0) one at a time, advancing through a state
machine step by step:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">morse_decode</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">t</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mh">0x03</span><span class="p">,</span> <span class="mh">0x3f</span><span class="p">,</span> <span class="mh">0x7b</span><span class="p">,</span> <span class="mh">0x4f</span><span class="p">,</span> <span class="mh">0x2f</span><span class="p">,</span> <span class="mh">0x63</span><span class="p">,</span> <span class="mh">0x5f</span><span class="p">,</span> <span class="mh">0x77</span><span class="p">,</span> <span class="mh">0x7f</span><span class="p">,</span> <span class="mh">0x72</span><span class="p">,</span>
        <span class="mh">0x87</span><span class="p">,</span> <span class="mh">0x3b</span><span class="p">,</span> <span class="mh">0x57</span><span class="p">,</span> <span class="mh">0x47</span><span class="p">,</span> <span class="mh">0x67</span><span class="p">,</span> <span class="mh">0x4b</span><span class="p">,</span> <span class="mh">0x81</span><span class="p">,</span> <span class="mh">0x40</span><span class="p">,</span> <span class="mh">0x01</span><span class="p">,</span> <span class="mh">0x58</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x68</span><span class="p">,</span> <span class="mh">0x51</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x88</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x8c</span><span class="p">,</span> <span class="mh">0x92</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x02</span><span class="p">,</span>
        <span class="mh">0x03</span><span class="p">,</span> <span class="mh">0x18</span><span class="p">,</span> <span class="mh">0x14</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x10</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x0c</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x08</span><span class="p">,</span> <span class="mh">0x1c</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x24</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x28</span><span class="p">,</span> <span class="mh">0x04</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x30</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span>
        <span class="mh">0x36</span><span class="p">,</span> <span class="mh">0x37</span><span class="p">,</span> <span class="mh">0x38</span><span class="p">,</span> <span class="mh">0x39</span><span class="p">,</span> <span class="mh">0x41</span><span class="p">,</span> <span class="mh">0x42</span><span class="p">,</span> <span class="mh">0x43</span><span class="p">,</span> <span class="mh">0x44</span><span class="p">,</span> <span class="mh">0x45</span><span class="p">,</span> <span class="mh">0x46</span><span class="p">,</span>
        <span class="mh">0x47</span><span class="p">,</span> <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x49</span><span class="p">,</span> <span class="mh">0x4a</span><span class="p">,</span> <span class="mh">0x4b</span><span class="p">,</span> <span class="mh">0x4c</span><span class="p">,</span> <span class="mh">0x4d</span><span class="p">,</span> <span class="mh">0x4e</span><span class="p">,</span> <span class="mh">0x4f</span><span class="p">,</span> <span class="mh">0x50</span><span class="p">,</span>
        <span class="mh">0x51</span><span class="p">,</span> <span class="mh">0x52</span><span class="p">,</span> <span class="mh">0x53</span><span class="p">,</span> <span class="mh">0x54</span><span class="p">,</span> <span class="mh">0x55</span><span class="p">,</span> <span class="mh">0x56</span><span class="p">,</span> <span class="mh">0x57</span><span class="p">,</span> <span class="mh">0x58</span><span class="p">,</span> <span class="mh">0x59</span><span class="p">,</span> <span class="mh">0x5a</span>
    <span class="p">};</span>
    <span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="o">-</span><span class="n">state</span><span class="p">];</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mh">0x00</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span> <span class="o">?</span> <span class="n">t</span><span class="p">[(</span><span class="n">v</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mi">63</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">case</span> <span class="mh">0x2e</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span>  <span class="mi">2</span> <span class="o">?</span> <span class="n">state</span><span class="o">*</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">case</span> <span class="mh">0x2d</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span>  <span class="mi">1</span> <span class="o">?</span> <span class="n">state</span><span class="o">*</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="nl">default:</span>   <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It typically compiles to under 200 bytes (table included), requires only a
few bytes of memory to operate, and will fit on even the smallest of
microcontrollers. The full source listing, documentation, and
comprehensive test suite:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c">https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c</a></p>

<p>The state machine is trie-shaped, and the 100-byte table <code class="language-plaintext highlighter-rouge">t</code> is the static
<a href="/blog/2016/11/15/">encoding of the Morse code trie</a>:</p>

<p><a href="/img/diagram/morse.dot"><img src="/img/diagram/morse.svg" alt="" /></a></p>

<p>Dots traverse left, dashes right, terminals emit the character at the
current node (terminal state). Stopping on red nodes, or attempting to
take an unlisted edge is an error (invalid input).</p>

<p>Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of <code class="language-plaintext highlighter-rouge">t</code>), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as <a href="https://en.wikipedia.org/wiki/Binary_heap#Heap_implementation">a binary heap in an
array</a>: the left and right children of the node at <code class="language-plaintext highlighter-rouge">i</code> are found at
<code class="language-plaintext highlighter-rouge">i*2+1</code> and <code class="language-plaintext highlighter-rouge">i*2+2</code>. No need to <a href="/blog/2020/10/19/#minimax-costs">waste memory storing edges</a>!</p>

<p>Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No <code class="language-plaintext highlighter-rouge">c</code> input
will produce a bad state.</p>

<p>In the original problem the terminals were missing. Despite being a <em>state
machine</em>, <code class="language-plaintext highlighter-rouge">morse_decode</code> is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.</p>

<h3 id="utf-8-decoder-state-machine">UTF-8 decoder state machine</h3>

<p>The classic UTF-8 decoder state machine is <a href="https://bjoern.hoehrmann.de/utf-8/decoder/dfa/">Bjoern Hoehrmann’s Flexible
and Economical UTF-8 Decoder</a>. It packs the entire state machine into
a relatively small table using clever tricks. It’s easily my favorite
UTF-8 decoder.</p>

<p>I wanted to try my own hand at it, so I re-derived the same canonical
UTF-8 automaton:</p>

<p><a href="/img/diagram/utf8.dot"><img src="/img/diagram/utf8.svg" alt="" /></a></p>

<p>Then I encoded this diagram directly into a much larger (2,064-byte), less
elegant table, too large to display inline here:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c">https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c</a></p>

<p>However, the trade-off is that the executable code is smaller, faster, and
<a href="/blog/2017/10/06/">branchless again</a> (by accident, I swear!):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">utf8_decode</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">cp</span><span class="p">,</span> <span class="kt">int</span> <span class="n">byte</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">signed</span> <span class="kt">char</span> <span class="n">table</span><span class="p">[</span><span class="mi">8</span><span class="p">][</span><span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">masks</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="kt">int</span> <span class="n">next</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">byte</span><span class="p">];</span>
    <span class="o">*</span><span class="n">cp</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">cp</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">byte</span> <span class="o">&amp;</span> <span class="n">masks</span><span class="p">[</span><span class="o">!</span><span class="n">state</span><span class="p">][</span><span class="n">next</span><span class="o">&amp;</span><span class="mi">7</span><span class="p">]);</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like Bjoern’s decoder, there’s a code point accumulator. The <em>real</em> state
machine has 1,109,950 terminal states, and many more edges and nodes. The
accumulator is an optimization to track exactly which edge was taken to
which node without having to represent such a monstrosity.</p>

<p>Despite the huge table I’m pretty happy with it.</p>

<h3 id="word-count-state-machine">Word count state machine</h3>

<p>Here’s another state machine I came up with awhile back for counting words
one Unicode code point at a time while accounting for Unicode’s various
kinds of whitespace. If your input is bytes, then plug this into the above
UTF-8 state machine to convert bytes to code points! This one uses a
switch instead of a lookup table since the table would be sparse (i.e.
<a href="/blog/2019/12/09/">let the compiler figure it out</a>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */</span>
<span class="kt">long</span> <span class="nf">word_count</span><span class="p">(</span><span class="kt">long</span> <span class="n">state</span><span class="p">,</span> <span class="kt">long</span> <span class="n">codepoint</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">codepoint</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mh">0x0009</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000a</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000b</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000c</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000d</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x0020</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x0085</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x00a0</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x1680</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2000</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2001</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2002</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2003</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2004</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2005</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2006</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2007</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2008</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2009</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x200a</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2028</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2029</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x202f</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x205f</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x3000</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">state</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">state</span> <span class="o">:</span> <span class="n">state</span><span class="p">;</span>
    <span class="nl">default:</span>
        <span class="k">return</span> <span class="n">state</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">?</span> <span class="n">state</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span> <span class="o">-</span> <span class="n">state</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m particularly happy with the <em>edge-triggered</em> state transition
mechanism. The sign of the state tracks whether the “signal” is “high”
(inside of a word) or “low” (outside of a word), and so it counts rising
edges.</p>

<p><a href="/img/diagram/wordcount.dot"><img src="/img/diagram/wordcount.svg" alt="" /></a></p>

<p>The counter is not <em>technically</em> part of the state machine — though it
eventually overflows for practical reasons, it isn’t really “finite” — but
is rather an external count of the times the state machine transitions
from low to high, which is the actual, useful output.</p>

<p><em>Reader challenge</em>: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the <code class="language-plaintext highlighter-rouge">switch</code> (chain of branches, jump table?).</p>

<h3 id="coroutines-and-generators-as-state-machines">Coroutines and generators as state machines</h3>

<p>In languages that support them, state machines can be implemented using
coroutines, including generators. I do particularly like the idea of
<a href="/blog/2018/05/31/">compiler-synthesized coroutines</a> as state machines, though this is a
rare treat. The state is implicit in the coroutine at each yield, so the
programmer doesn’t have to manage it explicitly. (Though often that
explicit control is powerful!)</p>

<p>Unfortunately in practice it always feels clunky. The following implements
the word count state machine (albeit in a rather un-Pythonic way). The
generator returns the current count and is continued by sending it another
code point:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WHITESPACE</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mh">0x0009</span><span class="p">,</span> <span class="mh">0x000a</span><span class="p">,</span> <span class="mh">0x000b</span><span class="p">,</span> <span class="mh">0x000c</span><span class="p">,</span> <span class="mh">0x000d</span><span class="p">,</span>
    <span class="mh">0x0020</span><span class="p">,</span> <span class="mh">0x0085</span><span class="p">,</span> <span class="mh">0x00a0</span><span class="p">,</span> <span class="mh">0x1680</span><span class="p">,</span> <span class="mh">0x2000</span><span class="p">,</span>
    <span class="mh">0x2001</span><span class="p">,</span> <span class="mh">0x2002</span><span class="p">,</span> <span class="mh">0x2003</span><span class="p">,</span> <span class="mh">0x2004</span><span class="p">,</span> <span class="mh">0x2005</span><span class="p">,</span>
    <span class="mh">0x2006</span><span class="p">,</span> <span class="mh">0x2007</span><span class="p">,</span> <span class="mh">0x2008</span><span class="p">,</span> <span class="mh">0x2009</span><span class="p">,</span> <span class="mh">0x200a</span><span class="p">,</span>
    <span class="mh">0x2028</span><span class="p">,</span> <span class="mh">0x2029</span><span class="p">,</span> <span class="mh">0x202f</span><span class="p">,</span> <span class="mh">0x205f</span><span class="p">,</span> <span class="mh">0x3000</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">def</span> <span class="nf">wordcount</span><span class="p">():</span>
    <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="c1"># low signal
</span>            <span class="n">codepoint</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">count</span>
            <span class="k">if</span> <span class="n">codepoint</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">WHITESPACE</span><span class="p">:</span>
                <span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
                <span class="k">break</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="c1"># high signal
</span>            <span class="n">codepoint</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">count</span>
            <span class="k">if</span> <span class="n">codepoint</span> <span class="ow">in</span> <span class="n">WHITESPACE</span><span class="p">:</span>
                <span class="k">break</span>
</code></pre></div></div>

<p>However, the generator ceremony dominates the interface, so you’d probably
want to wrap it in something nicer — at which point there’s really no
reason to use the generator in the first place:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wc</span> <span class="o">=</span> <span class="n">wordcount</span><span class="p">()</span>
<span class="nb">next</span><span class="p">(</span><span class="n">wc</span><span class="p">)</span>  <span class="c1"># prime the generator
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">'A'</span><span class="p">))</span>  <span class="c1"># =&gt; 1
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">' '</span><span class="p">))</span>  <span class="c1"># =&gt; 1
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">'B'</span><span class="p">))</span>  <span class="c1"># =&gt; 2
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">' '</span><span class="p">))</span>  <span class="c1"># =&gt; 2
</span></code></pre></div></div>

<p>Same idea in Lua, which famously has full coroutines:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">local</span> <span class="n">WHITESPACE</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">[</span><span class="mh">0x0009</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000a</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000b</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000c</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x000d</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x0020</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x0085</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x00a0</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x1680</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2000</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2001</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2002</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2003</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2004</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2005</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2006</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2007</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2008</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2009</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x200a</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2028</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2029</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x202f</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x205f</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x3000</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span>
<span class="p">}</span>

<span class="k">function</span> <span class="nf">wordcount</span><span class="p">()</span>
    <span class="kd">local</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
        <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
            <span class="c1">-- low signal</span>
            <span class="kd">local</span> <span class="n">codepoint</span> <span class="o">=</span> <span class="nb">coroutine.yield</span><span class="p">(</span><span class="n">count</span><span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">WHITESPACE</span><span class="p">[</span><span class="n">codepoint</span><span class="p">]</span> <span class="k">then</span>
                <span class="n">count</span> <span class="o">=</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span>
                <span class="k">break</span>
            <span class="k">end</span>
        <span class="k">end</span>
        <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
            <span class="c1">-- high signal</span>
            <span class="kd">local</span> <span class="n">codepoint</span> <span class="o">=</span> <span class="nb">coroutine.yield</span><span class="p">(</span><span class="n">count</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">WHITESPACE</span><span class="p">[</span><span class="n">codepoint</span><span class="p">]</span> <span class="k">then</span>
                <span class="k">break</span>
            <span class="k">end</span>
        <span class="k">end</span>
    <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Except for initially priming the coroutine, at least <code class="language-plaintext highlighter-rouge">coroutine.wrap()</code>
hides the fact that it’s a coroutine.</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wc</span> <span class="o">=</span> <span class="nb">coroutine.wrap</span><span class="p">(</span><span class="n">wordcount</span><span class="p">)</span>
<span class="n">wc</span><span class="p">()</span>  <span class="c1">-- prime the coroutine</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">'A'</span><span class="p">))</span>  <span class="c1">-- =&gt; 1</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span>  <span class="c1">-- =&gt; 1</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">'B'</span><span class="p">))</span>  <span class="c1">-- =&gt; 2</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span>  <span class="c1">-- =&gt; 2</span>
</code></pre></div></div>

<h3 id="extra-examples">Extra examples</h3>

<p>Finally, a couple more examples not worth describing in detail here. First
a Unicode case folding state machine:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/misc/casefold.c">https://github.com/skeeto/scratch/blob/master/misc/casefold.c</a></p>

<p>It’s just an interface to do a lookup into the <a href="https://www.unicode.org/Public/13.0.0/ucd/CaseFolding.txt">official case folding
table</a>. It was an experiment, and I <em>probably</em> wouldn’t use it in a
real program.</p>

<p>Second, I’ve mentioned <a href="https://github.com/skeeto/utf-7">my UTF-7 encoder and decoder</a> before. It’s
not obvious from the interface, but internally it’s just a state machine
for both encoder and decoder, which is what it allows it to “pause”
between any pair of input/output bytes.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Improving on QBasic's Random Number Generator</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/11/17/"/>
    <id>urn:uuid:9aba5382-01e4-41fc-bc27-b996b3c17f07</id>
    <updated>2020-11-17T02:51:23Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25120083">on Hacker News</a>.</em></p>

<p><a href="https://www.pixelships.com/">Pixelmusement</a> produces videos about <a href="/blog/2020/10/19/">MS-DOS games</a> and software.
Each video ends with a short, randomly-selected listing of financial
backers. In <a href="https://www.youtube.com/watch?v=YVV9bkbpaPY">ADG Filler #57</a>, Kris revealed the selection process,
and it absolutely fits the channel’s core theme: a <a href="https://en.wikipedia.org/wiki/QBasic">QBasic</a> program.
His program relies on QBasic’s built-in pseudo random number generator
(PRNG). Even accounting for the platform’s limitations, the PRNG is much
poorer quality than it could be. Let’s discuss these weaknesses and figure
out how to make the selection more fair.</p>

<!--more-->

<p>Kris’s program seeds the PRNG with the system clock (<code class="language-plaintext highlighter-rouge">RANDOMIZE TIMER</code>, a
QBasic idiom), populates an array with the backers represented as integers
(indices), continuously shuffles the list until the user presses a key, then
finally prints out a random selection from the array. Here’s a simplified
version of the program (note: QBasic comments start with apostrophe <code class="language-plaintext highlighter-rouge">'</code>):</p>

<pre><code class="language-qbasic">CONST ntickets = 203  ' input parameter
CONST nresults = 12

RANDOMIZE TIMER

DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
    tickets(i) = i
NEXT

CLS
PRINT "Press any key to stop shuffling..."
DO
    i = INT(RND * ntickets)
    j = INT(RND * ntickets)
    SWAP tickets(i), tickets(j)
LOOP WHILE INKEY$ = ""

FOR i = 0 to nresults - 1
    PRINT tickets(i)
NEXT
</code></pre>

<p>This should be readable even if you don’t know QBasic. Note: In the real
program, backers at higher tiers get multiple tickets in order to weight
the results. This is accounted for in the final loop such that nobody
appears more than once. It’s mostly irrelevant to the discussion here, so
I’ve omitted it.</p>

<p>The final result is ultimately a function of just three inputs:</p>

<ol>
  <li>The system clock (<code class="language-plaintext highlighter-rouge">TIMER</code>)</li>
  <li>The total number of tickets</li>
  <li>The number of loop iterations until a key press</li>
</ol>

<p>The second item has the nice property that by becoming a backer you influence
the result.</p>

<h3 id="qbasic-rnd">QBasic RND</h3>

<p>QBasic’s PRNG is this 24-bit <a href="https://en.wikipedia.org/wiki/Linear_congruential_generator">Linear Congruential Generator</a> (LCG):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">rnd24</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The result is the entire 24-bit state. <code class="language-plaintext highlighter-rouge">RND</code> divides this by 2^24 and
returns it as a single precision float so that the caller receives a value
between 0 and 1 (exclusive).</p>

<p>Needless to say, this is a very poor PRNG. The <a href="/blog/2019/11/19/">LCG constants are
<em>reasonable</em></a>, but the choice to limit the state to 24 bits is
strange. According to the <a href="https://www.qb64.org/forum/index.php?topic=1414.0">QBasic 16-bit assembly</a> (note: the LCG
constants listed here <a href="http://www.qb64.net/forum/index_topic_10727-0/">are wrong</a>), the implementation is a full
32-bit multiply using 16-bit limbs, and it allocates and writes a full 32
bits when storing the state. As expected for the 8086, there was nothing
gained by using only the lower 24 bits.</p>

<p>To illustrate how poor it is, here’s a <a href="https://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html">randogram</a> for this PRNG,
which shows obvious structure. (This is a small slice of a 4096x4096
randogram where each of the 2^23 24-bit samples is plotted as two 12-bit
coordinates.)</p>

<p><img src="/img/qbasic/rnd-thumb.png" alt="" /></p>

<p>Admittedly this far <a href="https://www.pcg-random.org/paper.html"><em>overtaxes</em></a> the PRNG. With a 24-bit state, it’s
only good for 4,096 (2^12) outputs, after which it no longer follows the
<a href="/blog/2019/07/22/">birthday paradox</a>: No outputs are repeated even though we should
start seeing some. However, as I’ll soon show, this doesn’t actually
matter.</p>

<p>Instead of discarding the high 8 bits — the highest quality output bits —
QBasic’s designers should have discarded the <em>low</em> 8 bits for the output,
turning it into a <em>truncated 32-bit LCG</em>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">rnd32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This LCG would have the same performance, but significantly better
quality. Here’s the randogram for this PRNG, and it is <em>also</em> heavily
overtaxed (more than 65,536, 2^16 outputs).</p>

<p><img src="/img/qbasic/rnd32-thumb.png" alt="" /></p>

<p>It’s a solid upgrade, <em>completely for free</em>!</p>

<h3 id="qbasic-randomize">QBasic RANDOMIZE</h3>

<p>That’s not the end of our troubles. The <code class="language-plaintext highlighter-rouge">RANDOMIZE</code> statement accepts a
double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754
binary representation are XORed with the next highest 16 bits. The high 16
bits of the PRNG state is set to this result. The lowest 8 bits are
preserved.</p>

<p>To make this clearer, here’s a C implementation, verified against QBasic
7.1:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="n">s</span><span class="p">;</span>

<span class="kt">void</span>
<span class="nf">randomize</span><span class="p">(</span><span class="kt">double</span> <span class="n">seed</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span> <span class="p">,</span><span class="o">&amp;</span><span class="n">seed</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">24</span> <span class="o">^</span> <span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">40</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffff00</span> <span class="o">|</span> <span class="p">(</span><span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0xff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In other words, <strong><code class="language-plaintext highlighter-rouge">RANDOMIZE</code> only sets the PRNG to one of 65,536 possible
states</strong>.</p>

<p>As the final piece, here’s how <code class="language-plaintext highlighter-rouge">RND</code> is implemented, also verified against
QBasic 7.1:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">rnd</span><span class="p">(</span><span class="kt">float</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arg</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arg</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">s</span> <span class="o">/</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="mh">0x1000000</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="system-clock-seed">System clock seed</h3>

<p>The <a href="https://www.qb64.org/wiki/TIMER"><code class="language-plaintext highlighter-rouge">TIMER</code> function</a> returns the single precision number of
seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer
interrupt counter). This is strictly time of day, and the current date is
not part of the result, unlike, say, the unix epoch.</p>

<p>This means there are only 1,572,480 distinct values returned by <code class="language-plaintext highlighter-rouge">TIMER</code>.
That’s small even before considering that these map onto only 65,536
possible seeds with <code class="language-plaintext highlighter-rouge">RANDOMIZE</code> — all of which <em>are</em> fortunately
realizable via <code class="language-plaintext highlighter-rouge">TIMER</code>.</p>

<p>Of the three inputs to random selection, this first one is looking pretty
bad.</p>

<h3 id="loop-iterations">Loop iterations</h3>

<p>Kris’s idea of continuously mixing the array until he presses a key makes
up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000
array swaps — traversing over 2% of the PRNG’s period — and the array
itself acts like an extended PRNG state, supplementing the 24-bit <code class="language-plaintext highlighter-rouge">RND</code>
state.</p>

<p>Since iterations fly by quickly, the exact number of iterations becomes
another <a href="/blog/2019/04/30/">source of entropy</a>. The results will be quite different if it
runs 214,600 iterations versus 273,500 iterations.</p>

<p>Possible improvement: Only exit the loop when a certain key is pressed. If
any other key is pressed then that input and the <code class="language-plaintext highlighter-rouge">TIMER</code> are mixed into
the PRNG state. Mashing the keyboard during the loop introduces more
entropy.</p>

<h3 id="replacing-the-prng">Replacing the PRNG</h3>

<p>Since the built-in PRNG is so poor, we could improve the situation by
implementing a <a href="/blog/2017/09/21/">new one</a> in QBasic itself. The challenge is that
QBasic has no unsigned integers, not even unsigned integer operators (i.e.
Java and JavaScript’s <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code>), and signed overflow is a run-time error. We
can’t even re-implement QBasic’s own LCG without doing long multiplication
in software, since the intermediate result overflows its 32-bit <code class="language-plaintext highlighter-rouge">LONG</code>.</p>

<p>Popular choices in these constraints are <a href="https://en.wikipedia.org/wiki/Lehmer_random_number_generator">Park–Miller generator</a> (as
we saw <a href="/blog/2018/12/25/">in Bash</a>) or a <a href="https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator">lagged Fibonacci generator</a> (as used by
Emacs, which was for a long time constrained to 29-bit integers).</p>

<p>However, I have a better idea: a PRNG based on <a href="https://en.wikipedia.org/wiki/RC4">RC4</a>. Specifically,
my own design called <a href="https://github.com/skeeto/scratch/tree/master/sp4"><strong>Sponge4</strong></a>, a <a href="https://en.wikipedia.org/wiki/Sponge_function">sponge construction</a>
built atop RC4. In short: Mixing in more input is just a matter of running
the key schedule again. Implementing this PRNG requires just two simple
operations: modular addition over 2^8, and array swap. QBasic has a <code class="language-plaintext highlighter-rouge">SWAP</code>
statement, so it’s a natural fit!</p>

<p>Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I
can mix in more sources of entropy. With its 1,700-bit state, it can
absorb quite a bit of entropy without loss.</p>

<h4 id="learning-qbasic">Learning QBasic</h4>

<p>Until this past weekend, I had not touched QBasic for about 23 years and
had to learn it essentially from scratch. Though within a couple of hours
I probably already understood it better than I ever had. That’s in large
part because I’m far more experienced, but also probably because QBasic
tutorials are universally awful. Not surprisingly they’re written for
beginners, but they also seem to be all written <em>by</em> beginners, too. I
soon got the impression that QBasic community has usually been another
case of <a href="/blog/2019/09/25/">the blind leading the blind</a>.</p>

<p>There’s little direct information for experienced programmers, and even
the official documentation tends to be thin in important places. I wanted
documentation that started with the core language semantics:</p>

<ul>
  <li>
    <p>The basic types are INTEGER (int16), LONG (int32), SINGLE (float32),
DOUBLE (float64), and two flavors of STRING, fixed-width and
variable-width. Late versions also had incomplete support for a 64-bit,
10,000x fixed-point CURRENCY type.</p>
  </li>
  <li>
    <p>Variables are SINGLE by default and do not need to be declared ahead of
time. Arrays have 11 elements by default.</p>
  </li>
  <li>
    <p>Variables, constants, and functions may have a suffix if their type is
not SINGLE: INTEGER <code class="language-plaintext highlighter-rouge">%</code>, LONG <code class="language-plaintext highlighter-rouge">&amp;</code>, SINGLE <code class="language-plaintext highlighter-rouge">!</code>, DOUBLE <code class="language-plaintext highlighter-rouge">#</code>, STRING <code class="language-plaintext highlighter-rouge">$</code>,
and CURRENCY <code class="language-plaintext highlighter-rouge">@</code>. For functions, this is the return type.</p>
  </li>
  <li>
    <p>Each variable type has its own namespace, i.e. <code class="language-plaintext highlighter-rouge">i%</code> is distinct from
<code class="language-plaintext highlighter-rouge">i&amp;</code>. Arrays are also their own namespace, i.e. <code class="language-plaintext highlighter-rouge">i%</code> is distinct from
<code class="language-plaintext highlighter-rouge">i%(0)</code> is distinct from <code class="language-plaintext highlighter-rouge">i&amp;(0)</code>.</p>
  </li>
  <li>
    <p>Variables may be declared explicitly with <code class="language-plaintext highlighter-rouge">DIM</code>. Declaring a variable
with <code class="language-plaintext highlighter-rouge">DIM</code> allows the suffix to be omitted. It also locks that name out
of the other type namespaces, i.e. <code class="language-plaintext highlighter-rouge">DIM i AS LONG</code> makes any use of <code class="language-plaintext highlighter-rouge">i%</code>
invalid in that scope. Though arrays and scalars can still have the same
name even with <code class="language-plaintext highlighter-rouge">DIM</code> declarations.</p>
  </li>
  <li>
    <p>Numeric operations with mixed types implicitly promote like C.</p>
  </li>
  <li>
    <p>Functions and subroutines have a single, common namespace regardless of
function suffix. As a result, the suffix can (usually) be omitted at
function call sites. Built-in functions are special in this case.</p>
  </li>
  <li>
    <p>Despite initial appearances, QBasic is statically-typed.</p>
  </li>
  <li>
    <p>The default is pass-by-reference. Use <code class="language-plaintext highlighter-rouge">BYVAL</code> to pass by value.</p>
  </li>
  <li>
    <p>In array declarations, the parameter is not the <em>size</em> but the largest
index. Multidimensional arrays are supported. Arrays need not be indexed
starting at zero (e.g. <code class="language-plaintext highlighter-rouge">(x TO y)</code>), though this is the default.</p>
  </li>
  <li>
    <p>Strings are not arrays, but their own special thing with special
accessor statements and functions.</p>
  </li>
  <li>
    <p>Scopes are module, subroutine, and function. “Global” variables must be
declared with <code class="language-plaintext highlighter-rouge">SHARED</code>.</p>
  </li>
  <li>
    <p>Users can define custom structures with <code class="language-plaintext highlighter-rouge">TYPE</code>. Functions cannot return
user-defined types and instead rely on pass-by-reference.</p>
  </li>
  <li>
    <p>A crude kind of dynamic allocation is supported with <code class="language-plaintext highlighter-rouge">REDIM</code> to resize
<code class="language-plaintext highlighter-rouge">$DYNAMIC</code> arrays at run-time. <code class="language-plaintext highlighter-rouge">ERASE</code> frees allocations.</p>
  </li>
</ul>

<p><em>These</em> are the semantics I wanted to know getting started. Throw in some
illustrative examples, and then it’s a tutorial for experienced
developers. (Future article perhaps?) Anyway, that’s enough to follow
along below.</p>

<h4 id="implementing-sponge4">Implementing Sponge4</h4>

<p>Like RC4, I need a 256-element byte array, and two 1-byte indices, <code class="language-plaintext highlighter-rouge">i</code> and
<code class="language-plaintext highlighter-rouge">j</code>. Sponge4 also keeps a third 1-byte counter, <code class="language-plaintext highlighter-rouge">k</code>, to count input.</p>

<pre><code class="language-qbasic">TYPE sponge4
    i AS INTEGER
    j AS INTEGER
    k AS INTEGER
    s(0 TO 255) AS INTEGER
END TYPE
</code></pre>

<p>QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would
normally be a good match here, but since they’re not arrays, strings are
not compatible with <code class="language-plaintext highlighter-rouge">SWAP</code> and are not indexed efficiently. So instead I
accept some wasted space and use 16-bit integers for everything.</p>

<p>There are four “methods” for this structure. Three are subroutines since
they don’t return a value, but mutate the sponge. The last, <code class="language-plaintext highlighter-rouge">squeeze</code>,
returns the next byte as an INTEGER (<code class="language-plaintext highlighter-rouge">%</code>).</p>

<pre><code class="language-qbasic">DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)
</code></pre>

<p>Initialization follows RC4:</p>

<pre><code class="language-qbasic">SUB init (r AS sponge4)
    r.i = 0
    r.j = 0
    r.k = 0
    FOR i% = 0 TO 255
        r.s(i%) = i%
    NEXT
END SUB
</code></pre>

<p>Absorbing a byte means running the RC4 key schedule one step. Absorbing a
“stop” symbol, for separating inputs, transforms the state in a way that
absorbing a byte cannot.</p>

<pre><code class="language-qbasic">SUB absorb (r AS sponge4, b AS INTEGER)
    r.j = (r.j + r.s(r.i) + b) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    r.i = (r.i + 1) MOD 256
    r.k = (r.k + 1) MOD 256
END SUB

SUB absorbstop (r AS sponge4)
    r.j = (r.j + 1) MOD 256
END SUB
</code></pre>

<p>Squeezing a byte may involve mixing the state first, then it runs the RC4
generator normally.</p>

<pre><code class="language-qbasic">FUNCTION squeeze% (r AS sponge4)
    IF r.k &gt; 0 THEN
        absorbstop r
        DO WHILE r.k &gt; 0
            absorb r, r.k
        LOOP
    END IF

    r.j = (r.j + r.i) MOD 256
    r.i = (r.i + 1) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION
</code></pre>

<p>That’s the entire generator in QBasic! A couple more helper functions will
be useful, though. One absorbs entire strings, and the second emits 24-bit
results.</p>

<pre><code class="language-qbasic">SUB absorbstr (r AS sponge4, s AS STRING)
    FOR i% = 1 TO LEN(s)
        absorb r, ASC(MID$(s, i%))
    NEXT
END SUB

FUNCTION squeeze24&amp; (r AS sponge4)
    b0&amp; = squeeze%(r)
    b1&amp; = squeeze%(r)
    b2&amp; = squeeze%(r)
    squeeze24&amp; = b2&amp; * &amp;H10000 + b1&amp; * &amp;H100 + b0&amp;
END FUNCTION
</code></pre>

<p>QBasic doesn’t have bit-shift operations, so we must make due with
multiplication. The <code class="language-plaintext highlighter-rouge">&amp;H</code> is hexadecimal notation.</p>

<h4 id="putting-the-sponge-to-use">Putting the sponge to use</h4>

<p>One of the problems with the original program is that only the time of day
was a seed. Even were it mixed better, if we run the program at exactly
the same instant on two different days, we get the same seed. The <code class="language-plaintext highlighter-rouge">DATE$</code>
function returns the current date, which we can absorb into the sponge to
make the whole date part of the input.</p>

<pre><code class="language-qbasic">DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE$
absorbstr sponge, MKS$(TIMER)
absorbstr sponge, MKI$(ntickets)
</code></pre>

<p>I follow this up with the timer. It’s converted to a string with <code class="language-plaintext highlighter-rouge">MKS$</code>,
which returns the little-endian, single precision binary representation as
a 4-byte string. <code class="language-plaintext highlighter-rouge">MKI$</code> does the same for INTEGER, as a 2-byte string.</p>

<p>One of the problems with the original program was bias: Multiplying <code class="language-plaintext highlighter-rouge">RND</code>
by a constant, then truncating the result to an integer is not uniform in
most cases. Some numbers are selected slightly more often than others
because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all
the shuffling in the original it probably doesn’t make a practical
difference, but I’d like to avoid it.</p>

<p>In my program I account for it by generating another number if it happens
to fall into that extra “tail” part of the input distribution (very
unlikely for small <code class="language-plaintext highlighter-rouge">ntickets</code>). The <code class="language-plaintext highlighter-rouge">squeezen</code> function uniformly
generates a number in 0 to N (exclusive).</p>

<pre><code class="language-qbasic">FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
    DO
       x&amp; = squeeze24&amp;(r) - &amp;H1000000 MOD n
    LOOP WHILE x&amp; &lt; 0
    squeezen% = x&amp; MOD n
END FUNCTION
</code></pre>

<p>Finally a Fisher–Yates shuffle, then print the first N elements:</p>

<pre><code class="language-qbasic">FOR i% = ntickets - 1 TO 1 STEP -1
    j% = squeezen%(sponge, i% + 1)
    SWAP tickets(i%), tickets(j%)
NEXT

FOR i% = 1 TO nresults
    PRINT tickets(i%)
NEXT
</code></pre>

<p>Though if you really love Kris’s loop idea:</p>

<pre><code class="language-qbasic">PRINT "Press Esc to finish, any other key for entropy..."
DO
    c&amp; = c&amp; + 1
    LOCATE 2, 1
    PRINT "cycles ="; c&amp;; "; keys ="; k%

    FOR i% = ntickets - 1 TO 1 STEP -1
        j% = squeezen%(sponge, i% + 1)
        SWAP tickets(i%), tickets(j%)
    NEXT

    k$ = INKEY$
    IF k$ = CHR$(27) THEN
        EXIT DO
    ELSEIF k$ &lt;&gt; "" THEN
        k% = k% + 1
        absorbstr sponge, k$
    END IF
    absorbstr sponge, MKS$(TIMER)
LOOP
</code></pre>

<p>If you want to try it out for yourself in, say, DOSBox, here’s the full
source: <a href="https://github.com/skeeto/scratch/blob/master/sp4/sponge4.bas"><strong><code class="language-plaintext highlighter-rouge">sponge4.bas</code></strong></a></p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>I Solved British Square</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/10/19/"/>
    <id>urn:uuid:c500b91a-046f-4320-8eff-9bc8f8443ef3</id>
    <updated>2020-10-19T19:32:52Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update</em>: I <a href="/blog/2022/10/12/">solved another game</a> using essentially the same
technique.</p>

<p><a href="https://boardgamegeek.com/boardgame/3719/british-square">British Square</a> is a 1978 abstract strategy board game which I
recently discovered <a href="https://www.youtube.com/watch?v=PChKZbut3lM&amp;t=10m">from a YouTube video</a>. It’s well-suited to play
by pencil-and-paper, so my wife and I played a few rounds to try it out.
Curious about strategies, I searched online for analysis and found
nothing whatsoever, meaning I’d have to discover strategies for myself.
This is <em>exactly</em> the sort of problem that <a href="https://xkcd.com/356/">nerd snipes</a>, and so I
sunk a couple of evenings building an analysis engine in C — enough to
fully solve the game and play <em>perfectly</em>.</p>

<p><strong>Repository</strong>: <a href="https://github.com/skeeto/british-square"><strong>British Square Analysis Engine</strong></a>
(and <a href="https://github.com/skeeto/british-square/releases">prebuilt binaries</a>)</p>

<p><a href="/img/british-square/british-square.jpg"><img src="/img/british-square/british-square-thumb.jpg" alt="" /></a>
<!-- Photo credit: Kelsey Wellons --></p>

<!--more-->

<p>The game is played on a 5-by-5 grid with two players taking turns
placing pieces of their color. Pieces may not be placed on tiles
4-adjacent to an opposing piece, and as a special rule, the first player
may not play the center tile on the first turn. Players pass when they
have no legal moves, and the game ends when both players pass. The score
is the difference between the piece counts for each player.</p>

<p>In the default configuration, my engine takes a few seconds to explore
the full game tree, then presents the <a href="https://en.wikipedia.org/wiki/Minimax">minimax</a> values for the
current game state along with the list of perfect moves. The UI allows
manually exploring down the game tree. It’s intended for analysis, but
there’s enough UI present to “play” against the AI should you so wish.
For some of my analysis I made small modifications to the program to
print or count game states matching certain conditions.</p>

<h3 id="game-analysis">Game analysis</h3>

<p>Not accounting for symmetries, there are 4,233,789,642,926,592 possible
playouts. In these playouts, the first player wins 2,179,847,574,830,592
(~51%), the second player wins 1,174,071,341,606,400 (~28%), and the
remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious
the first player has a huge advantage.</p>

<p>Accounting for symmetries, there are 8,659,987 total game states. Of
these, 6,955 are terminal states, of which the first player wins 3,599
(~52%) and the second player wins 2,506 (~36%). This small number of
states is what allows the engine to fully explore the game tree in a few
seconds.</p>

<p>Most importantly: <strong>The first player can always win by two points.</strong> In
other words, it’s <em>not</em> like Tic-Tac-Toe where perfect play by both
players results in a tie. Due to the two-point margin, the first player
also has more room for mistakes and usually wins even without perfect
play. There are fewer opportunities to blunder, and a single blunder
usually results in a lower win score. The second player has a narrow
lane of perfect play, making it easy to blunder.</p>

<p>Below is the minimax analysis for the first player’s options. The number
is the first player’s score given perfect play from that point — i.e.
perfect play starts on the tiles marked “2”, and the tiles marked “0”
are blunders that lead to ties.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10-01
12021
11111
</code></pre></div></div>

<p>The special center rule probably exists to reduce the first player’s
obvious advantage, but in practice it makes little difference. Without
the rule, the first player has an additional (fifth) branch for a win by
two points:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10201
12021
11111
</code></pre></div></div>

<p>Improved alternative special rule: <strong>Bias the score by two in favor of
the second player.</strong> This fully eliminates the first player’s advantage,
perfect play by both sides results in a tie, and both players have a
narrow lane of perfect play.</p>

<p>The four tie openers are interesting because the reasoning does not
require computer assistance. If the first player opens on any of those
tiles, the second player can mirror each of the first player’s moves,
guaranteeing a tie. Note: The first player can still make mistakes that
results in a second player win <em>if</em> the second player knows when to stop
mirroring.</p>

<p>One of my goals was to develop a heuristic so that even human players
can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was
not able to develop any such heuristic, though I <em>was</em> able to prove
that <strong>a greedy heuristic — always claim as much territory as possible —
is often incorrect</strong> and, in some cases, leads to blunders.</p>

<h3 id="engine-implementation">Engine implementation</h3>

<p>As <a href="/blog/2017/04/27/">I’ve done before</a>, my engine represents the game using
<a href="https://www.chessprogramming.org/Bitboards">bitboards</a>. Each player has a 25-bit bitboard representing their
pieces. To make move validation more efficient, it also sometimes tracks
a “mask” bitboard where invalid moves have been masked. Updating all
bitboards is cheap (<code class="language-plaintext highlighter-rouge">place()</code>, <code class="language-plaintext highlighter-rouge">mask()</code>), as is validating moves
against the mask (<code class="language-plaintext highlighter-rouge">valid()</code>).</p>

<p>The longest possible game is 32 moves. This would <em>just</em> fit in 5 bits,
except that I needed a special “invalid” turn, making it a total of 33
bits. So I use 6 bits to store the turn counter.</p>

<p>Besides generally being unnecessary, the validation masks can be derived
from the main bitboards, so I don’t need to store them in the game tree.
That means I need 25 bits per player, and 6 bits for the counter: <strong>56
bits total</strong>. I pack these into a 64-bit integer. The first player’s
bitboard goes in the bottom 25 bits, the second player in the next 25
bits, and the turn counter in the topmost 6 bits. The turn counter
starts at 1, so an all zero state is invalid. I exploit this in the hash
table so that zeroed slots are empty (more on this later).</p>

<p>In other words, the <em>empty</em> state is <code class="language-plaintext highlighter-rouge">0x4000000000000</code> (<code class="language-plaintext highlighter-rouge">INIT</code>) and zero
is the null (invalid) state.</p>

<p>Since the state is so small, rather than passing a pointer to a state to
be acted upon, bitboard functions return a new bitboard with the
requested changes… functional style.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// Compute bitboard+mask where first play is tile 6</span>
    <span class="c1">// -----</span>
    <span class="c1">// -X---</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="kt">uint64_t</span> <span class="n">b</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">place</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">mask</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
</code></pre></div></div>

<h4 id="minimax-costs">Minimax costs</h4>

<p>The engine uses minimax to propagate information up the tree. Since the
search extends to the very bottom of the tree, the minimax “heuristic”
evaluation function is the actual score, not an approximation, which is
why it’s able to play perfectly.</p>

<p>When <a href="/blog/2010/10/17/">I’ve used minimax before</a>, I built an actual tree data
structure in memory, linking states by pointer / reference. In this
engine there is no such linkage, and instead the links are computed
dynamically via the validation masks. Storing the pointers is more
expensive than computing their equivalents on the fly, <em>so I don’t store
them</em>. Therefore my game tree only requires 56 bits per node — or 64
bits in practice since I’m using a 64-bit integer. With only 8,659,987
nodes to store, that’s a mere 66MiB of memory! This analysis could have
easily been done on commodity hardware two decades ago.</p>

<p>What about the minimax values? Game scores range from -10 to 11: 22
distinct values. (That the first player can score up to 11 and the
second player at most 10 is another advantage to going first.) That’s 5
bits of information. However, I didn’t have this information up front,
and so I assumed a range from -25 to 25, which requires 6 bits.</p>

<p>There are still 8 spare bits left in the 64-bit integer, so I use 6 of
them for the minimax score. Rather than worry about two’s complement, I
bias the score to eliminate negative values before storing it. So the
minimax score rides along for free above the state bits.</p>

<h4 id="hash-table-memoization">Hash table (memoization)</h4>

<p>The vast majority of game tree branches are redundant. Even without
taking symmetries into account, nearly all states are reachable from
multiple branches. Exploring all these redundant branches would take
centuries. If I run into a state I’ve seen before, I don’t want to
recompute it.</p>

<p>Once I’ve computed a result, I store it in a hash table so that I can
find it later. Since the state is just a 64-bit integer, I use <a href="/blog/2018/07/31/">an
integer hash function</a> to compute a starting index from which to
linearly probe an open addressing hash table. The <em>entire</em> hash table
implementation is literally a dozen lines of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="o">*</span>
<span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">bitboard</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">table</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="mh">0xffffffffffffff</span><span class="p">;</span> <span class="c1">// sans minimax</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">bitboard</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">*=</span> <span class="mh">0xcca1cee435c5048f</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">^=</span> <span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span> <span class="o">%</span> <span class="n">N</span><span class="p">;</span> <span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="n">mask</span> <span class="o">==</span> <span class="n">bitboard</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the bitboard is not found, it returns a pointer to the (zero-valued)
slot where it should go so that the caller can fill it in.</p>

<h4 id="canonicalization">Canonicalization</h4>

<p>Memoization eliminates nearly all redundancy, but there’s still a major
optimization left. Many states are equivalent by symmetry or reflection.
Taking that into account, about 7/8th of the remaining work can still be
eliminated.</p>

<p>Multiple different states that are identical by symmetry must to be
somehow “folded” into a single, <em>canonical</em> state to represent them all.
I do this by visiting all 8 rotations and reflections and choosing the
one with the smallest 64-bit integer representation.</p>

<p>I only need two operations to visit all 8 symmetries, and I chose
transpose (flip around the diagonal) and vertical flip. Alternating
between these operations visits each symmetry. Since they’re bitboards,
transforms can be implemented using <a href="https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating">fancy bit-twiddling hacks</a>.
Chess boards, with their power-of-two dimensions, have useful properties
which these British Square boards lack, so this is the best I could come
up with:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Transpose a board or mask (flip along the diagonal).</span>
<span class="kt">uint64_t</span>
<span class="nf">transpose</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000020000010</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000410000208</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00008208004104</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00104104082082</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfe082083041041</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x01041040820820</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00820800410400</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00410000208000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00200000100000</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Flip a board or mask vertically.</span>
<span class="kt">uint64_t</span>
<span class="nf">flipv</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0000003e00001f</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x000007c00003e0</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfc00f800007c00</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x001f00000f8000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x03e00001f00000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These transform both players’ bitboards in parallel while leaving the
turn counter intact. The logic here is quite simple: Shift the bitboard
a little bit at a time while using a mask to deposit bits in their new
home once they’re lined up. It’s like a coin sorter. Vertical flip is
analogous to byte-swapping, though with 5-bit “bytes”.</p>

<p>Canonicalizing a bitboard now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">canonicalize</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">c</span> <span class="o">=</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Callers need only use <code class="language-plaintext highlighter-rouge">canonicalize()</code> on values they pass to <code class="language-plaintext highlighter-rouge">lookup()</code>
or store in the table (via the returned pointer).</p>

<h3 id="developing-a-heuristic">Developing a heuristic</h3>

<p>If you can come up with a perfect play heuristic, especially one that
can be reasonably performed by humans, I’d like to hear it. My engine
has a built-in heuristic tester, so I can test it against perfect play
at all possible game positions to check that it actually works. It’s
currently programmed to test the greedy heuristic and print out the
millions of cases where it fails. Even a heuristic that fails in only a
small number of cases would be pretty reasonable.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>When Parallel: Pull, Don't Push</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/04/30/"/>
    <id>urn:uuid:ac12ef1d-299f-4edb-9eb1-5ed4dac1219c</id>
    <updated>2020-04-30T22:35:51Z</updated>
    <category term="optimization"/><category term="interactive"/><category term="javascript"/><category term="opengl"/><category term="media"/><category term="webgl"/><category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23089729">on Hacker News</a>.</em></p>

<p>I’ve noticed a small pattern across a few of my projects where I had
vectorized and parallelized some code. The original algorithm had a
“push” approach, the optimized version instead took a “pull” approach.
In this article I’ll describe what I mean, though it’s mostly just so I
can show off some pretty videos, pictures, and demos.</p>

<!--more-->

<h3 id="sandpiles">Sandpiles</h3>

<p>A good place to start is the <a href="https://en.wikipedia.org/wiki/Abelian_sandpile_model">Abelian sandpile model</a>, which, like
many before me, completely <a href="https://xkcd.com/356/">captured</a> my attention for awhile.
It’s a cellular automaton where each cell is a pile of grains of sand —
a sandpile. At each step, any sandpile with more than four grains of
sand spill one grain into its four 4-connected neighbors, regardless of
the number of grains in those neighboring cell. Cells at the edge spill
their grains into oblivion, and those grains no longer exist.</p>

<p>With excess sand falling over the edge, the model eventually hits a
stable state where all piles have three or fewer grains. However, until
it reaches stability, all sorts of interesting patterns ripple though
the cellular automaton. In certain cases, the final pattern itself is
beautiful and interesting.</p>

<p>Numberphile has a great video describing how to <a href="https://www.youtube.com/watch?v=1MtEUErz7Gg">form a group over
recurrent configurations</a> (<a href="https://www.youtube.com/watch?v=hBdJB-BzudU">also</a>). In short, for any given grid
size, there’s a stable <em>identity</em> configuration that, when “added” to
any other element in the group will stabilize back to that element. The
identity configuration is a fractal itself, and has been a focus of
study on its own.</p>

<p>Computing the identity configuration is really just about running the
simulation to completion a couple times from certain starting
configurations. Here’s an animation of the process for computing the
64x64 identity configuration:</p>

<p><a href="https://nullprogram.com/video/?v=sandpiles-64"><img src="/img/identity-64-thumb.png" alt="" /></a></p>

<p>As a fractal, the larger the grid, the more self-similar patterns there
are to observe. There are lots of samples online, and the biggest I
could find was <a href="https://commons.wikimedia.org/wiki/File:Sandpile_group_identity_on_3000x3000_grid.png">this 3000x3000 on Wikimedia Commons</a>. But I wanted
to see one <em>that’s even bigger, damnit</em>! So, skipping to the end, I
eventually computed this 10000x10000 identity configuration:</p>

<p><a href="/img/identity-10000.png"><img src="/img/identity-10000-thumb.png" alt="" /></a></p>

<p>This took 10 days to compute using my optimized implementation:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c">https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c</a></p>

<p>I picked an algorithm described <a href="https://codegolf.stackexchange.com/a/106990">in a code golf challenge</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(ones(n)*6 - f(ones(n)*6))
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">f()</code> is the function that runs the simulation to a stable state.</p>

<p>I used <a href="/blog/2015/07/10/">OpenMP to parallelize across cores, and SIMD to parallelize
within a thread</a>. Each thread operates on 32 sandpiles at a time.
To compute the identity sandpile, each sandpile only needs 3 bits of
state, so this could potentially be increased to 85 sandpiles at a time
on the same hardware. The output format is my old mainstay, Netpbm,
<a href="/blog/2017/11/03/">including the video output</a>.</p>

<h4 id="sandpile-push-and-pull">Sandpile push and pull</h4>

<p>So, what do I mean about pushing and pulling? The naive approach to
simulating sandpiles looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each i in sandpiles {
    if input[i] &lt; 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
        for each j in neighbors {
            output[j] = output[j] + 1
        }
    }
}
</code></pre></div></div>

<p>As the algorithm examines each cell, it <em>pushes</em> results into
neighboring cells. If we’re using concurrency, that means multiple
threads of execution may be mutating the same cell, which requires
synchronization — locks, <a href="/blog/2014/09/02/">atomics</a>, etc. That much
synchronization is the death knell of performance. The threads will
spend all their time contending for the same resources, even if it’s
just false sharing.</p>

<p>The solution is to <em>pull</em> grains from neighbors:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each i in sandpiles {
    if input[i] &lt; 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
    }
    for each j in neighbors {
        if input[j] &gt;= 4 {
            output[i] = output[i] + 1
        }
    }
}
</code></pre></div></div>

<p>Each thread only modifies one cell — the cell it’s in charge of updating
— so no synchronization is necessary. It’s shader-friendly and should
sound familiar if you’ve seen <a href="/blog/2014/06/10/">my WebGL implementation of Conway’s Game
of Life</a>. It’s essentially the same algorithm. If you chase down
the various Abelian sandpile references online, you’ll eventually come
across a 2017 paper by Cameron Fish about <a href="http://people.reed.edu/~davidp/homepage/students/fish.pdf">running sandpile simulations
on GPUs</a>. He cites my WebGL Game of Life article, bringing
everything full circle. We had spoken by email at the time, and he
<a href="https://people.reed.edu/~davidp/web_sandpiles/">shared his <strong>interactive simulation</strong> with me</a>.</p>

<p>Vectorizing this algorithm is straightforward: Load multiple piles at
once, one per SIMD channel, and use masks to implement the branches. In
my code I’ve also unrolled the loop. To avoid bounds checking in the
SIMD code, I pad the state data structure with zeros so that the edge
cells have static neighbors and are no longer special.</p>

<h3 id="webgl-fire">WebGL Fire</h3>

<p>Back in the old days, one of the <a href="http://fabiensanglard.net/doom_fire_psx/">cool graphics tricks was fire
animations</a>. It was so easy to implement on limited hardware. In
fact, the most obvious way to compute it was directly in the
framebuffer, such as in <a href="/blog/2014/12/09/">the VGA buffer</a>, with no outside state.</p>

<p>There’s a heat source at the bottom of the screen, and the algorithm
runs from bottom up, propagating that heat upwards randomly. Here’s the
algorithm using traditional screen coordinates (top-left corner origin):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func rand(min, max) // random integer in [min, max]

for each x, y from bottom {
    buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}
</code></pre></div></div>

<p>As a <em>push</em> algorithm it works fine with a single-thread, but
it doesn’t translate well to modern video hardware. So convert it to a
<em>pull</em> algorithm!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each x, y {
    sx = x + rand(-1, 1)
    sy = y + rand(1, 2)
    output[y][x] = input[sy][sx] - rand(0, 1)
}
</code></pre></div></div>

<p>Cells pull the fire upward from the bottom. Though this time there’s a
catch: <em>This algorithm will have subtly different results.</em></p>

<ul>
  <li>
    <p>In the original, there’s a single state buffer and so a flame could
propagate upwards multiple times in a single pass. I’ve compensated
here by allowing a flames to propagate further at once.</p>
  </li>
  <li>
    <p>In the original, a flame only propagates to one other cell. In this
version, two cells might pull from the same flame, cloning it.</p>
  </li>
</ul>

<p>In the end it’s hard to tell the difference, so this works out.</p>

<p><a href="https://nullprogram.com/webgl-fire/"><img src="/img/fire-thumb.png" alt="" /></a></p>

<p><a href="https://github.com/skeeto/webgl-fire/">source code and instructions</a></p>

<p>There’s still potentially contention in that <code class="language-plaintext highlighter-rouge">rand()</code> function, but this
can be resolved <a href="https://www.shadertoy.com/view/WttXWX">with a hash function</a> that takes <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> as
inputs.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Purgeable Memory Allocations for Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/29/"/>
    <id>urn:uuid:50300bbe-0939-4bcf-96ff-8fb96a9b12d5</id>
    <updated>2019-12-29T00:25:49Z</updated>
    <category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I saw (part of) a video, <a href="https://www.youtube.com/watch?v=9l0nWEUpg7s">OS hacking: Purgeable memory</a>, by
Andreas Kling who’s writing an operating system called <a href="https://github.com/SerenityOS/serenity">Serenity</a>
and recording videos his progress. In the video he implements
<em>purgeable memory</em> as <a href="https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/CachingandPurgeableMemory.html">found on some Apple platforms</a> by adding
special support in the kernel. A process tells the kernel that a
particular range of memory isn’t important, and so the kernel can
reclaim if it the system is under memory pressure — the memory is
purgeable.</p>

<p>Linux has a mechanism like this, <a href="http://man7.org/linux/man-pages/man2/madvise.2.html"><code class="language-plaintext highlighter-rouge">madvise(2)</code></a>, that allows
processes to provide hints to the kernel on how memory is expected to be
used. The flag of interest is <code class="language-plaintext highlighter-rouge">MADV_FREE</code>:</p>

<blockquote>
  <p>The application no longer requires the pages in the range specified by
<code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">len</code>. The kernel can thus free these pages, but the
freeing could be delayed until memory pressure occurs. For each of the
pages that has been marked to be freed but has not yet been freed, the
free operation will be canceled if the caller writes into the page.</p>
</blockquote>

<p>So, given this, I built a proof of concept / toy on top of <code class="language-plaintext highlighter-rouge">MADV_FREE</code>
that provides this functionality for Linux:</p>

<p><strong><a href="https://github.com/skeeto/purgeable">https://github.com/skeeto/purgeable</a></strong></p>

<p>It <a href="/blog/2018/11/15/">allocates anonymous pages</a> using <code class="language-plaintext highlighter-rouge">mmap(2)</code>. When the allocation
is “unlocked” — i.e. the process isn’t actively using it — its pages are
marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code> so that the kernel can reclaim them at any time.
To lock the allocation so that the process can safely make use of them,
the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> is canceled. This is all a little trickier than it sounds,
and that’s the subject of this article.</p>

<p>Note: There’s also <code class="language-plaintext highlighter-rouge">MADV_DONTNEED</code> which seems like it would fit the
bill, but <a href="https://www.youtube.com/watch?v=bg6-LVCHmGM#t=58m23s">it’s implemented incorrectly in Linux</a>. It <em>immediately</em>
frees the pages, and so it’s useless for implementing purgeable memory.</p>

<h3 id="purgeable-api">Purgeable API</h3>

<p>Before diving into the implementation, here’s the API. It’s <a href="/blog/2018/06/10/">just four
functions</a> with no structure definitions. The pointer used by the
API is the memory allocation itself. All the bookkeeping <a href="/blog/2017/01/08/">associated
with that pointer</a> is hidden away, out of sight from the API’s
consumer. The full documentation is in <a href="https://github.com/skeeto/purgeable/blob/master/purgeable.h"><code class="language-plaintext highlighter-rouge">purgeable.h</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_alloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_unlock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_lock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The semantics are much like a C++ <code class="language-plaintext highlighter-rouge">weak_ptr</code> in that locking both
validates that the allocation is still available and creates a “strong”
reference to it that prevents it from being purged. Though unlike a weak
reference, the allocation is stickier. It will remain until the system is
actually under pressure, not just when the garbage collector happens to
run or the last strong reference is gone.</p>

<p>Here’s how it might be used to, say, store decoded PNG data that can
decompressed again if needed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">png</span> <span class="o">*</span><span class="n">png</span> <span class="o">=</span> <span class="n">png_load</span><span class="p">(</span><span class="s">"texture.png"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">png</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>

<span class="cm">/* ... */</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">*</span> <span class="mi">4</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>
        <span class="n">png_decode_rgba</span><span class="p">(</span><span class="n">png</span><span class="p">,</span> <span class="n">texture</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">purgeable_lock</span><span class="p">(</span><span class="n">texture</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">purgeable_free</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">continue</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">glTexImage2D</span><span class="p">(</span>
        <span class="n">GL_TEXTURE_2D</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">GL_UNSIGNED_BYTE</span><span class="p">,</span> <span class="n">texture</span>
    <span class="p">);</span>
    <span class="n">purgeable_unlock</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
    <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Memory is allocated in a locked state since it’s very likely to be
immediately filled with data. The application should unlock it before
moving on with other tasks. The purgeable memory must always be freed
using <code class="language-plaintext highlighter-rouge">purgeable_free()</code>, even if <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> failed. This not only
frees the bookkeeping, but also releases the now-zero pages and the
mapping itself. Originally I had <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> free the purgeable
memory on failure, but I felt this was clearer. There’s no technical
reason it couldn’t, though.</p>

<h3 id="purgeable-implementation">Purgeable Implementation</h3>

<p>The main challenge is that the kernel doesn’t necessarily treat the
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> range contiguously. It might reclaim just some pages, and do
so in an arbitrary order. In order to lock the region, each page must be
handled individually. Per the man page quoted above, reversing
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> requires a write to each page — to either trigger a page
fault or set <a href="https://en.wikipedia.org/wiki/Dirty_bit">a dirty bit</a>.</p>

<p>The only way to tell if a page has been purged is to check if it’s been
filled with zeros. That’s easy if we’re sure a particular byte in the
page should be zero, but, since this is a library, the caller might just
store <em>anything</em> on these pages.</p>

<p>So here’s my solution: To unlock a page, look at the first byte on the
page. Remember whether or not it’s zero. If it’s zero, write a 1 into
that byte. Once this has been done for all pages, use <code class="language-plaintext highlighter-rouge">madvise(2)</code> to
mark them all <code class="language-plaintext highlighter-rouge">MADV_FREE</code>.</p>

<p>With this approach, the library only needs to track one bit of information
per page regardless of the page’s contents. Assuming 4kB pages, each 32kB
of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead.
Not too bad!</p>

<p>Locking purgeable memory is a little trickier. Again, each page must be
visited in turn, and if any page was purged, then the whole allocation is
considered lost. If the first byte was non-zero when unlocking, the
library checks that it’s still non-zero. If the first byte was zero when
unlocking, then it prepares to write a zero back into that byte, which
must currently be non-zero.</p>

<p>In either case, the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> needs to be canceled using a write, so
the library <a href="/blog/2014/09/02/">does an atomic compare-and-swap</a> (CAS) to write the
correct byte into the page, <em>even if it’s the same value</em> in the
non-zero case. The atomic CAS is essential because <strong>it ensures the page
wasn’t purged between the check and the write, as both are done
together, atomically</strong>. If every page has the expected first byte, and
every CAS succeeded, then the purgeable memory has been successfully
locked.</p>

<p>As an optimization, the library could consider more than just the first
byte, and look at, say, the first <code class="language-plaintext highlighter-rouge">long int</code> on each page. The library
does less work when the page contains a non-zero value, and the chance of
an arbitrary 8-byte value being zero is much lower. However, I wanted to
avoid <a href="/blog/2018/07/20/#strict-aliasing">potential aliasing issues</a>, especially if this library were
to be embedded, so I passed on the idea.</p>

<h4 id="bookkeeping">Bookkeeping</h4>

<p>The bookkeeping data is stored just before the buffer returned as the
purgeable memory, and it’s never marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code>. Assuming 4kB
pages, for each 128MB of purgeable memory the library allocates one extra
anonymous page to track it. The number of pages in the allocation is
stored just before the purgeable memory as a <code class="language-plaintext highlighter-rouge">size_t</code>, and the rest is the
per-page bit table described above.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">numpages</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
</code></pre></div></div>

<p>So the library can immediately find it starting from the purgeable memory
address. Here’s an illustration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      ,--- p
      |
      v
----------------------------------------------
|...Z|    |    |    |    |    |    |    |    |
----------------------------------------------
 ^  ^
 |  |
 |  `--- size_t numpages
 |
 `--- bit table
</code></pre></div></div>

<p>The downside is that buffer underflows in the application would easily
trample the <code class="language-plaintext highlighter-rouge">numpages</code> value because it’s located immediately adjacent. It
would be safer to move it to the <em>beginning</em> of the first page before the
purgeable memory, but this would have made bit table access more
complicated. While the region is locked, the contents of the bit table
don’t matter, so it won’t be damaged by an underflow. Another idea: put a
checksum alongside <code class="language-plaintext highlighter-rouge">numpages</code>. It could just be a simple <a href="/blog/2018/07/31/">integer
hash</a>.</p>

<p>This makes for a really slick API since the consumer doesn’t need to track
anything more than a single pointer, the address of the purgeable memory
allocation itself.</p>

<h3 id="worth-using">Worth using?</h3>

<p>I’m not quite sure how often I’d actually use purgeable memory in real
programs, especially in software intended to be portable. Each operating
system needs its own implementation, and this library is not portable
since it relies on interfaces and behaviors specific to Linux.</p>

<p>It also has a not-so-unlikely pathological case: Imagine a program that
makes two purgeable memory allocation, and they’re large enough that one
always evicts the other. The program would thrash back and forth
fighting itself as it used each allocation. Detecting this situation
might be difficult, especially as the number of purgeable memory
allocations increases.</p>

<p>Regardless, it’s another tool for my software toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Efficient Alias of a Built-In Emacs Lisp Function</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/10/"/>
    <id>urn:uuid:15421609-2681-4b75-99b2-b2d6aaa835fe</id>
    <updated>2019-12-10T02:32:04Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Suppose you don’t like the names <code class="language-plaintext highlighter-rouge">car</code> and <code class="language-plaintext highlighter-rouge">cdr</code>, the traditional
identifiers for two halves of a lisp cons cell. <a href="https://irreal.org/blog/?p=8500">This is
misguided.</a> A cons is really just a 2-tuple, and the halves
don’t have any particular meaning on their own, even as “head” and
“tail.” However, maybe this is really important to you so you want to
do it anyway. What’s the best way to go about it?</p>

<h3 id="defalias">defalias</h3>

<p>Emacs Lisp has a built-in function just for this, <code class="language-plaintext highlighter-rouge">defalias</code>, which
is the obvious choice.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">defalias</span> <span class="ss">'car-alias</span> <span class="nf">#'</span><span class="nb">car</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">car</code> built-in function is so fundamental to the language that <a href="/blog/2014/01/04/">it
gets its own byte-code opcode</a>. When you call <code class="language-plaintext highlighter-rouge">car</code> in your code,
the byte-compiler doesn’t generate a function call, but instead uses a
single instruction. For example, here’s an <code class="language-plaintext highlighter-rouge">add</code> function that sums
the <code class="language-plaintext highlighter-rouge">car</code> of its two arguments. I’ve followed the definition with its
disassembly (Emacs 26.3, <a href="/blog/2016/12/22/">lexical scope</a>):</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">add</span> <span class="p">(</span><span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">b</span><span class="p">)))</span>
<span class="c1">;; 0       stack-ref 1</span>
<span class="c1">;; 1       car</span>
<span class="c1">;; 2       stack-ref 1</span>
<span class="c1">;; 3       car</span>
<span class="c1">;; 4       plus</span>
<span class="c1">;; 5       return</span>
</code></pre></div></div>

<p>There are zero function calls because of the dedicated <code class="language-plaintext highlighter-rouge">car</code> opcode, and
it has the optimal six byte-code instructions.</p>

<p>The problem with <code class="language-plaintext highlighter-rouge">defalias</code> is that the definition is permitted change
— or <a href="/blog/2013/01/22/">be advised</a> — and that robs the byte-compiler of
optimization opportunities. It’s <a href="/blog/2019/12/09/">a constraint</a>. When the
byte-code compiler sees <code class="language-plaintext highlighter-rouge">car-alias</code>, it <em>must</em> emit a function call:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">add-alias</span> <span class="p">(</span><span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nv">car-alias</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nv">car-alias</span> <span class="nv">b</span><span class="p">)))</span>
<span class="c1">;; 0       constant  car-alias</span>
<span class="c1">;; 1       stack-ref 2</span>
<span class="c1">;; 2       call      1</span>
<span class="c1">;; 3       constant  car-alias</span>
<span class="c1">;; 4       stack-ref 2</span>
<span class="c1">;; 5       call      1</span>
<span class="c1">;; 6       plus</span>
<span class="c1">;; 7       return</span>
</code></pre></div></div>

<p>This has two function calls and eight byte-code instructions. Those
function calls are significantly more expensive than a <code class="language-plaintext highlighter-rouge">car</code>
instruction, which will show in the benchmark later.</p>

<h3 id="defsubst">defsubst</h3>

<p>An alternative is <code class="language-plaintext highlighter-rouge">defsubst</code>, an inlined function definition, which
will inline an actual <code class="language-plaintext highlighter-rouge">car</code>. The semantics for <code class="language-plaintext highlighter-rouge">defsubst</code> are, like
macros, explicit that re-definitions may not affect previous uses, so
the constraint is gone. Unfortunately <a href="/blog/2019/02/24/">the byte-code compiler is
pretty dumb</a>, and does a poor job inlining <code class="language-plaintext highlighter-rouge">car-subst</code>.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">defsubst</span> <span class="nv">car-subst</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">car</span> <span class="nv">x</span><span class="p">))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">add-subst</span> <span class="p">(</span><span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nv">car-subst</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nv">car-subst</span> <span class="nv">b</span><span class="p">)))</span>
<span class="c1">;; 0       stack-ref 1</span>
<span class="c1">;; 1       dup</span>
<span class="c1">;; 2       car</span>
<span class="c1">;; 3       stack-set 1</span>
<span class="c1">;; 5       stack-ref 1</span>
<span class="c1">;; 6       dup</span>
<span class="c1">;; 7       car</span>
<span class="c1">;; 8       stack-set 1</span>
<span class="c1">;; 10      plus</span>
<span class="c1">;; 11      return</span>
</code></pre></div></div>

<p>There are zero function calls and ten byte-code instructions. The
<code class="language-plaintext highlighter-rouge">car</code> opcode <em>is</em> in use, but there are five unnecessary instructions.
This is still faster than making the function calls, though. If the
byte-code compiler was just a little smarter and could compile this to
the ideal case, then this would be the end of the discussion.</p>

<h3 id="cl-first">cl-first</h3>

<p>The built-in <code class="language-plaintext highlighter-rouge">cl-lib</code> package has a <code class="language-plaintext highlighter-rouge">cl-first</code> alias for <code class="language-plaintext highlighter-rouge">car</code>. This was
written by someone with intimate knowledge of Emacs Lisp, so how how
well did they do?</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">require</span> <span class="ss">'cl-lib</span><span class="p">)</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">add-cl-first</span> <span class="p">(</span><span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nv">cl-first</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nv">cl-first</span> <span class="nv">b</span><span class="p">)))</span>
<span class="c1">;; 0       stack-ref 1</span>
<span class="c1">;; 1       car</span>
<span class="c1">;; 2       stack-ref 1</span>
<span class="c1">;; 3       car</span>
<span class="c1">;; 4       plus</span>
<span class="c1">;; 5       return</span>
</code></pre></div></div>

<p>It’s just like plain old <code class="language-plaintext highlighter-rouge">car</code>! How did they manage this? By using a
byte-compiler hint:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">defalias</span> <span class="ss">'cl-first</span> <span class="ss">'car</span><span class="p">)</span>
<span class="p">(</span><span class="nv">put</span> <span class="ss">'cl-first</span> <span class="ss">'byte-optimizer</span> <span class="ss">'byte-compile-inline-expand</span><span class="p">)</span>
</code></pre></div></div>

<p>They used <code class="language-plaintext highlighter-rouge">defalias</code>, but they also manually told the byte-compiler to
inline the definition like <code class="language-plaintext highlighter-rouge">defsubst</code>. In fact, <code class="language-plaintext highlighter-rouge">defsubst</code> expands to an
expression that sets <code class="language-plaintext highlighter-rouge">byte-compile-inline-expand</code>, but, as seen above,
the inline function overhead gets inlined and doesn’t get eliminated.</p>

<h3 id="benchmark">Benchmark</h3>

<p>So how do the alternatives perform? (<a href="https://gist.github.com/skeeto/36baa3b1493f53eab4e082b449448a96">benchmark source</a>)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>add           (0.594811299 0 0.0)
add-alias     (1.232037132 0 0.0)
add-subst     (0.700044324 0 0.0)
add-cl-first  (0.58332882 0 0.0)
</code></pre></div></div>

<p>(The <code class="language-plaintext highlighter-rouge">car</code> of the list is the running time.) Since <code class="language-plaintext highlighter-rouge">add</code> and
<code class="language-plaintext highlighter-rouge">add-cl-first</code> have the same byte-codes, we shouldn’t, and didn’t, see
a significant difference. The simple use of <code class="language-plaintext highlighter-rouge">defalias</code> doubles the
running time, and using <code class="language-plaintext highlighter-rouge">defsubst</code> is about 15% slower.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Chunking Optimizations: Let the Knife Do the Work</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/09/"/>
    <id>urn:uuid:961086fa-46af-42d4-bd69-6f4a326a1505</id>
    <updated>2019-12-09T22:37:55Z</updated>
    <category term="c"/><category term="cpp"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>There’s an old saying, <a href="https://www.youtube.com/watch?v=bTee6dKpDB0"><em>let the knife do the work</em></a>. Whether
preparing food in the kitchen or whittling a piece of wood, don’t push
your weight into the knife. Not only is it tiring, you’re much more
likely to hurt yourself. Use the tool properly and little force will be
required.</p>

<p>The same advice also often applies to compilers.</p>

<p>Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of
data. The simplest approach would be to do it a byte at a time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* XOR src into dst */</span>
<span class="kt">void</span>
<span class="nf">xor512a</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Maybe you benchmark it or you look at the assembly output, and the
results are disappointing. Your compiler did <em>exactly</em> what you asked
of it and produced code that performs 64 single-byte XOR operations
(GCC 9.2.0, x86-64, <code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512a:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="p">],</span> <span class="nb">cl</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">64</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The target architecture has wide registers so it could be doing <em>at
least</em> 8 bytes at a time. Since your compiler isn’t doing it, you
decide to chunk the work into 8 byte blocks yourself in an attempt to
manually implement a <em>chunking operation</em>. Here’s some <a href="https://old.reddit.com/r/C_Programming/comments/e83jzk/strange_gcc_compiler_bug_when_using_o2_or_higher/">real world
code</a> that does so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* WARNING: Broken, do not use! */</span>
<span class="kt">void</span>
<span class="nf">xor512b</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You check the assembly output of this function, and it looks much
better. It’s now processing 8 bytes at a time, so it should be about 8
times faster than before.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512b:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">rcx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">],</span> <span class="nb">rcx</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Still, this machine has 16-byte wide registers (SSE2 <code class="language-plaintext highlighter-rouge">xmm</code>), so there
could be another doubling in speed. Oh well, this is good enough, so you
plug it into your program. But something strange happens: <strong>The output
is now wrong!</strong></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">dst</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span>
        <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">13</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">16</span>
    <span class="p">};</span>
    <span class="kt">uint32_t</span> <span class="n">src</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">36</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span>
        <span class="mi">81</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span> <span class="mi">169</span><span class="p">,</span> <span class="mi">196</span><span class="p">,</span> <span class="mi">225</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">xor512b</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">src</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">dst</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Your program prints 1..16 as if <code class="language-plaintext highlighter-rouge">xor512b()</code> was never called. You check
over everything a dozen times, and you can’t find anything wrong. Even
crazier, if you disable optimizations then the bug goes away. It must be
some kind of compiler bug!</p>

<p>Investigating a bit more, you learn that the <code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>
option also fixes the bug. That’s because this program violates C strict
aliasing rules. An array of <code class="language-plaintext highlighter-rouge">uint32_t</code> was accessed as a <code class="language-plaintext highlighter-rouge">uint64_t</code>. As
an <a href="/blog/2018/07/20/#strict-aliasing">important optimization</a>, compilers are allowed to assume such
variables do not alias and generate code accordingly. Otherwise every
memory store could potentially modify any variable, which limits the
compiler’s ability to produce decent code.</p>

<p>The original version is fine because <code class="language-plaintext highlighter-rouge">char *</code>, including both <code class="language-plaintext highlighter-rouge">signed</code>
and <code class="language-plaintext highlighter-rouge">unsigned</code>, has a special exemption and may alias with anything. For
the same reason, using <code class="language-plaintext highlighter-rouge">char *</code> unnecessarily can also make your
programs slower.</p>

<p>What could you do to keep the chunking operation while not running afoul
of strict aliasing? Counter-intuitively, you could use <code class="language-plaintext highlighter-rouge">memcpy()</code>. Copy
the chunks into legitimate, local <code class="language-plaintext highlighter-rouge">uint64_t</code> variables, do the work, and
copy the result back out.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512c</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint64_t</span> <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">src</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">memcpy()</code> is a built-in function, your compiler knows its
semantics and can ultimately elide all that copying. The assembly
listing for <code class="language-plaintext highlighter-rouge">xor512c</code> is identical to <code class="language-plaintext highlighter-rouge">xor512b</code>, but it won’t go haywire
when integrated into a real program.</p>

<p>It works and it’s correct, but you can still do much better than this!</p>

<h3 id="letting-your-compiler-do-the-work">Letting your compiler do the work</h3>

<p>The problem is you’re forcing the knife and not letting it do the work.
There’s a constraint on your compiler that hasn’t been considered: It
must work correctly for overlapping inputs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">74</span><span class="p">]</span> <span class="o">=</span> <span class="p">{...};</span>
<span class="n">xor512a</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div></div>

<p>In this situation, the byte-by-byte and chunked versions of the function
will have different results. That’s exactly why your compiler can’t do
the chunking operation itself. However, <em>you don’t care about this
situation</em> because the inputs never overlap.</p>

<p>Let’s revisit the first, simple implementation, but this time being
smarter about it. The <code class="language-plaintext highlighter-rouge">restrict</code> keyword indicates that the inputs
will not overlap, freeing your compiler of this unwanted concern.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512d</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(Side note: Adding <code class="language-plaintext highlighter-rouge">restrict</code> to the manually chunked function,
<code class="language-plaintext highlighter-rouge">xor512b()</code>, will not fix it. Using <code class="language-plaintext highlighter-rouge">restrict</code> can never make an
incorrect program correct.)</p>

<p>Compiled with GCC 9.2.0 and <code class="language-plaintext highlighter-rouge">-O3</code>, the resulting unrolled code
processes 16-byte chunks at a time (<code class="language-plaintext highlighter-rouge">pxor</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm2</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm3</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm4</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm2</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm3</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm4</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Compiled with Clang 9.0.0 with AVX-512 enabled in the target
(<code class="language-plaintext highlighter-rouge">-mavx512bw</code>), <em>it does the entire operation in a single, big chunk!</em></p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">vmovdqu64</span>   <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">vpxorq</span>      <span class="nv">zmm0</span><span class="p">,</span> <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">vmovdqu64</span>   <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">zmm0</span>
        <span class="nf">vzeroupper</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>“Letting the knife do the work” means writing a correct program and
lifting unnecessary constraints so that the compiler can use whatever
chunk size is appropriate for the target.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>On-the-fly Linear Congruential Generator Using Emacs Calc</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/11/19/"/>
    <id>urn:uuid:13e56720-ef3a-4fa4-a4ff-0a6fef914504</id>
    <updated>2019-11-19T01:17:50Z</updated>
    <category term="emacs"/><category term="crypto"/><category term="optimization"/><category term="c"/><category term="java"/><category term="javascript"/>
    <content type="html">
      <![CDATA[<p>I regularly make throwaway “projects” and do a surprising amount of
programming in <code class="language-plaintext highlighter-rouge">/tmp</code>. For Emacs Lisp, the equivalent is the
<code class="language-plaintext highlighter-rouge">*scratch*</code> buffer. These are places where I can make a mess, and the
mess usually gets cleaned up before it becomes a problem. A lot of my
established projects (<a href="/blog/2019/03/22/">ex</a>.) start out in volatile storage and
only graduate to more permanent storage once the concept has proven
itself.</p>

<p>Throughout my whole career, this sort of throwaway experimentation has
been an important part of my personal growth, and I try to <a href="/blog/2016/09/02/">encourage it
in others</a>. Even if the idea I’m trying doesn’t pan out, I usually
learn something new, and occasionally it translates into an article here.</p>

<p>I also enjoy small programming challenges. One of the most abused
tools in my mental toolbox is the Monte Carlo method, and I readily
apply it to solve toy problems. Even beyond this, random number
generators are frequently a useful tool (<a href="/blog/2017/04/27/">1</a>, <a href="/blog/2019/07/22/">2</a>), so I
find myself reaching for one all the time.</p>

<p>Nearly every programming language comes with a pseudo-random number
generation function or library. Unfortunately the language’s standard
PRNG is usually a poor choice (C, <a href="https://arvid.io/2018/06/30/on-cxx-random-number-generator-quality/">C++</a>, <a href="https://lowleveldesign.org/2018/08/15/randomness-in-net/">C#</a>, <a href="https://grokbase.com/t/gg/golang-nuts/155f6kbb7a/go-nuts-why-are-high-bits-used-by-math-rand-helpers-instead-of-low-ones">Go</a>).
It’s probably mediocre quality, <a href="/blog/2018/05/27/">slower than it needs to be</a>
(<a href="https://grokbase.com/t/gg/golang-nuts/155f6kbb7a/go-nuts-why-are-high-bits-used-by-math-rand-helpers-instead-of-low-ones">also</a>), <a href="https://lists.freebsd.org/pipermail/svn-src-head/2013-July/049068.html">lacks reliable semantics or behavior between
implementations</a>, or is missing some other property I want. So I’ve
long been a fan of <em>BYOPRNG:</em> Bring Your Own Pseudo-random Number
Generator. Just embed a generator with the desired properties directly
into the program. The <a href="/blog/2017/09/21/">best non-cryptographic PRNGs today</a> are
tiny and exceptionally friendly to embedding. Though, depending on what
you’re doing, you might <a href="/blog/2019/04/30/">need to be creative about seeding</a>.</p>

<h3 id="crafting-a-prng">Crafting a PRNG</h3>

<p>On occasion I don’t have an established, embeddable PRNG in reach, and
I have yet to commit xoshiro256** to memory. Or maybe I want to use
a totally unique PRNG for a particular project. In these cases I make
one up. With just a bit of know-how it’s not too difficult.</p>

<p>Probably the easiest decent PRNG to code from scratch is the venerable
<a href="https://en.wikipedia.org/wiki/Linear_congruential_generator">Linear Congruential Generator</a> (LCG). It’s a simple recurrence
relation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x[1] = (x[0] * A + C) % M
</code></pre></div></div>

<p>That’s trivial to remember once you know the details. You only need to
choose appropriate values for <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">C</code>, and <code class="language-plaintext highlighter-rouge">M</code>. Done correctly, it
will be a <em>full-period</em> generator — a generator that visits a
permutation of each of the numbers between 0 and <code class="language-plaintext highlighter-rouge">M - 1</code>. The seed —
the value of <code class="language-plaintext highlighter-rouge">x[0]</code> — is chooses a starting position in this (looping)
permutation.</p>

<p><code class="language-plaintext highlighter-rouge">M</code> has a natural, obvious choice: a power of two matching the range of
operands, such as 2^32 or 2^64. With this the modulo operation is free
as a natural side effect of the computer architecture.</p>

<p>Choosing <code class="language-plaintext highlighter-rouge">C</code> also isn’t difficult. It must be co-prime with <code class="language-plaintext highlighter-rouge">M</code>, and
since <code class="language-plaintext highlighter-rouge">M</code> is a power of two, any odd number is valid. Even 1. In
theory choosing a small value like 1 is faster since the compiler
won’t need to embed a large integer in the code, but this difference
doesn’t show up in any micro-benchmarks I tried. If you want a cool,
unique generator, then choose a large random integer. More on that
below.</p>

<p>The tricky value is <code class="language-plaintext highlighter-rouge">A</code>, and getting it right is the linchpin of the
whole LCG. It must be coprime with <code class="language-plaintext highlighter-rouge">M</code> (i.e. not even), and, for a
full-period generator, <code class="language-plaintext highlighter-rouge">A-1</code> must be divisible by four. For better
results, <code class="language-plaintext highlighter-rouge">A-1</code> should not be divisible by 8. A good choice is a prime
number that satisfies these properties.</p>

<p>If your operands are 64-bit integers, or larger, how are you going to
generate a prime number?</p>

<h4 id="primes-from-emacs-calc">Primes from Emacs Calc</h4>

<p>Emacs Calc can solve this problem. I’ve <a href="/blog/2009/06/23/">noted before</a> how
featureful it is. It has arbitrary precision, random number
generation, and primality testing. It’s everything we need to choose
<code class="language-plaintext highlighter-rouge">A</code>. (In fact, this is nearly identical to <a href="/blog/2015/10/30/">the process I used to
implement RSA</a>.) For this example I’m going to generate a 64-bit
LCG for the C programming language, but it’s easy to use whatever
width you like and mostly whatever language you like. If you wanted a
<a href="http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html">minimal standard 128-bit LCG</a>, this will still work.</p>

<p>Start by opening up Calc with <code class="language-plaintext highlighter-rouge">M-x calc</code>, then:</p>

<ol>
  <li>Push <code class="language-plaintext highlighter-rouge">2</code> on the stack</li>
  <li>Push <code class="language-plaintext highlighter-rouge">64</code> on the stack</li>
  <li>Press <code class="language-plaintext highlighter-rouge">^</code>, computing 2^64 and pushing it on the stack</li>
  <li>Press <code class="language-plaintext highlighter-rouge">k r</code> to generate a random number in this range</li>
  <li>Press <code class="language-plaintext highlighter-rouge">d r 16</code> to switch to hexadecimal display</li>
  <li>Press <code class="language-plaintext highlighter-rouge">k n</code> to find the next prime following the random value</li>
  <li>Repeat step 6 until you get a number that ends with <code class="language-plaintext highlighter-rouge">5</code> or <code class="language-plaintext highlighter-rouge">D</code></li>
  <li>Press <code class="language-plaintext highlighter-rouge">k p</code> a few times to avoid false positives.</li>
</ol>

<p>What’s left on the stack is your <code class="language-plaintext highlighter-rouge">A</code>! If you want a random value for
<code class="language-plaintext highlighter-rouge">C</code>, you can follow a similar process. Heck, make it prime, too!</p>

<p>The reason for using hexadecimal (step 5) and looking for <code class="language-plaintext highlighter-rouge">5</code> or <code class="language-plaintext highlighter-rouge">D</code>
(step 7) is that such numbers satisfy both of the important properties
for <code class="language-plaintext highlighter-rouge">A-1</code>.</p>

<p>Calc doesn’t try to factor your random integer. Instead it uses the
<a href="https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test">Miller–Rabin primality test</a>, a probabilistic test that, itself,
requires random numbers. It has false positives but no false negatives.
The false positives can be mitigated by repeating the test multiple
times, hence step 8.</p>

<p>Trying this all out right now, I got this implementation (in C):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">lcg1</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, we can still do a little better. Outputting the entire state
doesn’t have great results, so instead it’s better to create a
<em>truncated</em> LCG and only return some portion of the most significant
bits.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">lcg2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This won’t quite pass <a href="http://simul.iro.umontreal.ca/testu01/tu01.html">BigCrush</a> in 64-bit form, but the results
are pretty reasonable for most purposes.</p>

<p>But we can still do better without needing to remember much more than
this.</p>

<h3 id="appending-permutation">Appending permutation</h3>

<p>A <a href="http://www.pcg-random.org/">Permuted Congruential Generator</a> (PCG) is really just a
truncated LCG with a permutation applied to its output. Like LCGs
themselves, there are arbitrarily many variations. The “official”
implementation has a <a href="/blog/2018/02/07/">data-dependent shift</a>, for which I can
never remember the details. Fortunately a couple of simple, easy to
remember transformations is sufficient. Basically anything I used
<a href="/blog/2018/07/31/">while prospecting for hash functions</a>. I love xorshifts, so
lets add one of those:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">pcg1</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is a big improvement, but it still fails one BigCrush test. As
they say, when xorshift isn’t enough, use xorshift-multiply! Below I
generated a 32-bit prime for the multiply, but any odd integer is a
valid permutation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">pcg2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">*=</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x60857ba9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This passes BigCrush, and I can reliably build a new one entirely from
scratch using Calc any time I need it.</p>

<h3 id="bonus-adapting-to-other-languages">Bonus: Adapting to other languages</h3>

<p>Sometimes it’s not so straightforward to adapt this technique to other
languages. For example, JavaScript has limited support for 32-bit
integer operations (enough for a poor 32-bit LCG) and no 64-bit
integer operations. Though <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt">BigInt</a> is now a thing, and should
make a great 96- or 128-bit LCG easy to build.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">lcg</span><span class="p">(</span><span class="nx">seed</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">let</span> <span class="nx">s</span> <span class="o">=</span> <span class="nx">BigInt</span><span class="p">(</span><span class="nx">seed</span><span class="p">);</span>
    <span class="k">return</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
        <span class="nx">s</span> <span class="o">*=</span> <span class="mh">0xef725caa331524261b9646cd</span><span class="nx">n</span><span class="p">;</span>
        <span class="nx">s</span> <span class="o">+=</span> <span class="mh">0x213734f2c0c27c292d814385</span><span class="nx">n</span><span class="p">;</span>
        <span class="nx">s</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffffffffffffffffff</span><span class="nx">n</span><span class="p">;</span>
        <span class="k">return</span> <span class="nb">Number</span><span class="p">(</span><span class="nx">s</span> <span class="o">&gt;&gt;</span> <span class="mi">64</span><span class="nx">n</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Java doesn’t have unsigned integers, so how could you build the above
PCG in Java? Easy! First, remember is that Java has two’s complement
semantics, including wrap around, and that two’s complement doesn’t
care about unsigned or signed for multiplication (or addition, or
subtraction). The result is identical. Second, the oft-forgotten <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code>
operator does an unsigned right shift. With these two tips:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>

<span class="kt">int</span> <span class="nf">pcg2</span><span class="o">()</span> <span class="o">{</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="mh">0x7c3c3267d015ceb5</span><span class="no">L</span> <span class="o">+</span> <span class="mh">0x24bd2d95276253a9</span><span class="no">L</span><span class="o">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">(</span><span class="kt">int</span><span class="o">)(</span><span class="n">s</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">32</span><span class="o">);</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">16</span><span class="o">;</span>
    <span class="n">r</span> <span class="o">*=</span> <span class="mh">0x60857ba9</span><span class="o">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div>

<p>So, in addition to the Calc step list above, you may need to know some
of the finer details of your target language.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Legitimate-ish Use of alloca()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/10/28/"/>
    <id>urn:uuid:ce906d6f-b228-4dc6-bd02-34b845d3c5e2</id>
    <updated>2019-10-28T00:42:23Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21374863">on Hacker News</a></em>.</p>

<p>Yesterday <a href="/blog/2019/10/27/">I wrote about a legitimate use for variable length
arrays</a>. While recently discussing this topic with <a href="/blog/2016/09/02/">a
co-worker</a>, I also thought of a semi-legitimate use for
<a href="http://man7.org/linux/man-pages/man3/alloca.3.html"><code class="language-plaintext highlighter-rouge">alloca()</code></a>, a non-standard “function” for dynamically allocating
memory on the stack.</p>

<!--more-->

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloca</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>I say “function” in quotes because it’s not truly a function and cannot
be implemented as a function or by a library. It’s implemented in the
compiler and is essentially part of the language itself. It’s a tool
allowing a function to manipulate its own stack frame.</p>

<p>Like VLAs, it has the problem that if you’re able to use <code class="language-plaintext highlighter-rouge">alloca()</code>
safely, then you really don’t need it in the first place. Allocation
failures are undetectable and once they happen it’s already too late.</p>

<h3 id="opaque-structs">Opaque structs</h3>

<p>To set the scene, let’s talk about opaque structs. Suppose you’re
writing a C library with <a href="/blog/2018/06/10/">a clean interface</a>. It’s set up so that
changing your struct fields won’t break the Application Binary Interface
(ABI), and callers are largely unable to depend on implementation
details, even by accident. To achieve this, it’s likely you’re making
use of <em>opaque structs</em> in your interface. Callers only ever receive
pointers to library structures, which are handed back into the interface
when they’re used. The internal details are hidden away.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* opaque float stack API */</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span>          <span class="nf">stack_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">int</span>           <span class="nf">stack_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v</span><span class="p">);</span>
<span class="kt">float</span>         <span class="nf">stack_pop</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Callers can use the API above without ever knowing the layout or even
the size of <code class="language-plaintext highlighter-rouge">struct stack</code>. Only a pointer to the struct is ever needed.
However, in order for this to work, the library must allocate the struct
itself. If this is a concern, then the library will typically allow the
caller to supply an allocator via function pointers. To see a really
slick version of this in practice, check out <a href="https://www.lua.org/manual/5.3/manual.html#lua_Alloc">Lua’s <code class="language-plaintext highlighter-rouge">lua_Alloc</code></a>, a
single function allocator API.</p>

<p>Suppose we wanted to support something simpler: The library will
advertise the size of the struct so the caller can allocate it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* API additions */</span>
<span class="kt">size_t</span> <span class="nf">stack_sizeof</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span>   <span class="nf">stack_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// like stack_create()</span>
<span class="kt">void</span>   <span class="nf">stack_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// like stack_destroy()</span>
</code></pre></div></div>

<p>The implementation of <code class="language-plaintext highlighter-rouge">stack_sizeof()</code> would literally just be <code class="language-plaintext highlighter-rouge">return
sizeof struct stack</code>. The caller might use it like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="n">stack_free</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, that’s still a heap allocation. If this wasn’t an opaque
struct, the caller could very naturally use automatic (i.e. stack)
allocation, which is likely even preferred in this case. Is this still
possible? Idea: Allocate it via a generic <code class="language-plaintext highlighter-rouge">char</code> array (VLA in this
case).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>However, this is technically undefined behavior. While a <code class="language-plaintext highlighter-rouge">char</code> pointer
is special and permitted to alias with anything, the inverse isn’t true.
Pointers to other types don’t get a free pass to alias with a <code class="language-plaintext highlighter-rouge">char</code>
array. Accessing a <code class="language-plaintext highlighter-rouge">char</code> value as if it were a different type just
isn’t allowed. Why? Because the standard says so. If you want one of the
practical reasons: the alignment might be incorrect.</p>

<p>Hmmm, is there another option? Maybe with <code class="language-plaintext highlighter-rouge">alloca()</code>!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">len</code> is expected to be small, it’s not any less safe than the
non-opaque alternative. It doesn’t undermine the type system, either,
since <code class="language-plaintext highlighter-rouge">alloca()</code> has the same semantics as <code class="language-plaintext highlighter-rouge">malloc()</code>. The downsides
are:</p>

<ul>
  <li>It’s not portable: <code class="language-plaintext highlighter-rouge">alloca()</code> is only a common extension, never
standardized, and for good reason.</li>
  <li>This is still a dynamic stack allocation, so, like I showed in the
last article, the function making this allocation becomes more
complex. It must manage its own stack frame dynamically.</li>
</ul>

<h3 id="optimizing-out-alloca">Optimizing out <code class="language-plaintext highlighter-rouge">alloca()</code>?</h3>

<p>The second issue can possibly be resolved if the size is available as a
compile time constant. This starts to break the abstraction provided by
opaque structs, but they’re still <em>mostly</em> opaque. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* API additions */</span>
<span class="cp">#define STACK_SIZE 24
</span>
<span class="cm">/* In practice, this would likely be horrific #ifdef spaghetti! */</span>
</code></pre></div></div>

<p>The caller might use it like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">STACK_SIZE</span><span class="p">);</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>Now the compiler can see the allocation size, and potentially optimize
away the <code class="language-plaintext highlighter-rouge">alloca()</code>. As of this writing, Clang (all versions) can
optimize these fixed-size <code class="language-plaintext highlighter-rouge">alloca()</code> usages, but GCC (9.2) still does
not. Here’s a simple example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;alloca.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="cp">#ifdef ALLOCA
</span>    <span class="k">volatile</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="mi">64</span><span class="p">);</span>
<span class="cp">#else
</span>    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">s</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="cp">#endif
</span>    <span class="n">s</span><span class="p">[</span><span class="mi">63</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With the <code class="language-plaintext highlighter-rouge">char</code> array version, both GCC and Clang produce optimal code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;foo&gt;:
   0:	c6 44 24 f8 00       	mov    BYTE PTR [rsp-0x1],0x0
   5:	c3                   	ret
</code></pre></div></div>

<p>Side note: This is on x86-64 Linux, which uses the System V ABI. The
entire array falls within the <a href="https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/">red zone</a>, so it doesn’t need to be
explicitly allocated.</p>

<p>With <code class="language-plaintext highlighter-rouge">-DALLOCA</code>, Clang does the same, but GCC does the allocation
inefficiently as if it were dynamic:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;foo&gt;:
   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   4:	48 83 ec 50          	sub    rsp,0x50
   8:	48 8d 44 24 0f       	lea    rax,[rsp+0xf]
   d:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  11:	c6 40 3f 00          	mov    BYTE PTR [rax+0x3f],0x0
  15:	c9                   	leave
  16:	c3                   	ret
</code></pre></div></div>

<p>It would make a slightly better case for <code class="language-plaintext highlighter-rouge">alloca()</code> here if GCC was
better about optimizing it. Regardless, this is another neat little
trick that I probably wouldn’t use in practice.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Legitimate Use of Variable Length Arrays</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/10/27/"/>
    <id>urn:uuid:acf6af69-f18c-49a6-b3ae-a23ae537da6d</id>
    <updated>2019-10-27T19:58:00Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21375580">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/dz1fau/variable_length_arrays_in_c_are_nearly_always_the/">on reddit</a></em>.</p>

<p>The C99 (ISO/IEC 9899:1999) standard of C introduced a new, powerful
feature called Variable Length Arrays (VLAs). The size of an array with
automatic storage duration (i.e. stack allocated) can be determined at
run time. Each instance of the array may even have a different length.
Unlike <code class="language-plaintext highlighter-rouge">alloca()</code>, they’re a sanctioned form of dynamic stack
allocation.</p>

<!--more-->

<p>At first glance, VLAs seem convenient, useful, and efficient. Heap
allocations have a small cost because the allocator needs to do some
work to find or request some free memory, and typically the operation
must be synchronized since there may be other threads also making
allocations. Stack allocations are trivial and fast by comparison:
Allocation is a matter of bumping the stack pointer, and no
synchronization is needed.</p>

<p>For example, here’s a function that non-destructively finds the median
of a buffer of floats:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* note: nmemb must be non-zero */</span>
<span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It uses a VLA, <code class="language-plaintext highlighter-rouge">copy</code>, as a temporary copy of the input for sorting. The
function doesn’t know at compile time how big the input will be, so it
cannot just use a fixed size. With a VLA, it efficiently allocates
exactly as much memory as needed on the stack.</p>

<p>Well, sort of. If <code class="language-plaintext highlighter-rouge">nmemb</code> is too large, then the VLA will <em>silently</em>
overflow the stack. By silent I mean that the program has no way to
detect it and avoid it. In practice, it can be a lot louder, from a
segmentation fault in the best case, to an exploitable vulnerability in
the worst case: <a href="/blog/2017/06/21/"><strong>stack clashing</strong></a>. If an attacker can control
<code class="language-plaintext highlighter-rouge">nmemb</code>, they might choose a value that causes <code class="language-plaintext highlighter-rouge">copy</code> to overlap with
other allocations, giving them control over those values as well.</p>

<p>If there’s any risk that <code class="language-plaintext highlighter-rouge">nmemb</code> is too large, it must be guarded.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define COPY_MAX 4096
</span>
<span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>  <span class="cm">/* or whatever */</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, if <code class="language-plaintext highlighter-rouge">median</code> is expected to safely accommodate <code class="language-plaintext highlighter-rouge">COPY_MAX</code>
elements, it may as well <em>always</em> allocate an array of this size. If it
can’t, then that’s not a safe maximum.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">COPY_MAX</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And rather than abort, you might still want to support arbitrary input
sizes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">buf</span><span class="p">[</span><span class="n">COPY_MAX</span><span class="p">];</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">copy</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">copy</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">result</span> <span class="o">=</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">copy</span> <span class="o">!=</span> <span class="n">buf</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then small inputs are fast, but large inputs still work correctly. This
is called <a href="/blog/2016/10/07/"><strong>small size optimization</strong></a>.</p>

<p>If the correct solution ultimately didn’t use a VLA, then what good are
they? In general, VLAs not useful. They’re <a href="https://www.phoronix.com/scan.php?page=news_item&amp;px=Linux-Kills-The-VLA">time bombs</a>. <strong>VLAs
are nearly always the wrong choice.</strong> You must be careul to check that
they don’t exceed some safe maximum, and there’s no reason not to always
use the maximum. This problem was realized for the C11 standard (ISO/IEC
9899:2011) where VLAs were made optional. A program containing a VLA
will not necessarily compile on a C11 compiler.</p>

<p>Some purists also object to a special exception required for VLAs: The
<code class="language-plaintext highlighter-rouge">sizeof</code> operator may evaluate its operand, and so it does not always
evaluate to compile-time constant. If the operand contains a VLA, then
the result depends on a run-time value.</p>

<p>Because they’re optional, it’s best to avoid even <em>trivial</em> VLAs like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">max</span> <span class="o">=</span> <span class="mi">4096</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">max</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">max</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s easy to prove that the array length is always 4096, but technically
this is still a VLA. That would still be true even if <code class="language-plaintext highlighter-rouge">max</code> were <code class="language-plaintext highlighter-rouge">const
int</code>, because the array length still isn’t a constant integral
expression.</p>

<h3 id="vla-overhead">VLA overhead</h3>

<p>Finally, there’s also the problem that VLAs just aren’t as efficient as
you might hope. A function that does dynamic stack allocation requires
additional stack management. It must track additional memory addresses
and will require extra instructions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">fixed</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">];</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">dynamic</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with <code class="language-plaintext highlighter-rouge">gcc -Os</code> and viewed with <code class="language-plaintext highlighter-rouge">objdump -d -Mintel</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;fixed&gt;:
   0:	81 ff 00 40 00 00    	cmp    edi,0x4000
   6:	7f 19                	jg     21 &lt;fixed+0x21&gt;
   8:	ff cf                	dec    edi
   a:	48 81 ec 88 3f 00 00 	sub    rsp,0x3f88
  11:	48 63 ff             	movsxd rdi,edi
  14:	c6 44 3c 88 00       	mov    BYTE PTR [rsp+rdi*1-0x78],0x0
  19:	48 81 c4 88 3f 00 00 	add    rsp,0x3f88
  20:	c3                   	ret    
  21:	c3                   	ret    

0000000000000022 &lt;dynamic&gt;:
  22:	81 ff 00 40 00 00    	cmp    edi,0x4000
  28:	7f 23                	jg     4d &lt;dynamic+0x2b&gt;
  2a:	55                   	push   rbp
  2b:	48 63 c7             	movsxd rax,edi
  2e:	ff cf                	dec    edi
  30:	48 83 c0 0f          	add    rax,0xf
  34:	48 63 ff             	movsxd rdi,edi
  37:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  3b:	48 89 e5             	mov    rbp,rsp
  3e:	48 89 e2             	mov    rdx,rsp
  41:	48 29 c4             	sub    rsp,rax
  44:	c6 04 3c 00          	mov    BYTE PTR [rsp+rdi*1],0x0
  48:	48 89 d4             	mov    rsp,rdx
  4b:	c9                   	leave  
  4c:	c3                   	ret    
  4d:	c3                   	ret    
</code></pre></div></div>

<p>Note the use of a base pointer, <code class="language-plaintext highlighter-rouge">rbp</code> and <code class="language-plaintext highlighter-rouge">leave</code>, in the second
function in order to dynamically track the stack frame. (Hmm, in both
cases GCC could easily shave off the extra <code class="language-plaintext highlighter-rouge">ret</code> at the end of each
function. Missed optimization?)</p>

<p>The story is even worse when stack clash protection is enabled
(<code class="language-plaintext highlighter-rouge">-fstack-clash-protection</code>). The compiler generates extra code to probe
every page of allocation in case one of those pages is a guard page.
That’s also more complex when the allocation is dynamic. The VLA version
more than doubles in size (from 44 bytes to 101 bytes)!</p>

<h3 id="safe-and-useful-variable-length-arrays">Safe and useful variable length arrays</h3>

<p>There is one convenient, useful, and safe form of VLAs: a pointer to a
VLA. It’s convenient and useful because it makes some expressions
simpler. It’s safe because there’s no arbitrary stack allocation.</p>

<p>Pointers to arrays are a rare sight in C code, whether variable length
or not. That’s because, the vast majority of the time, C programmers
implicitly rely on <em>array decay</em>: arrays quietly “decay” into pointers
to their first element the moment you do almost anything with them. Also
because they’re really awkward to use.</p>

<p>For example, the function <code class="language-plaintext highlighter-rouge">sum3</code> takes a pointer to an array of exactly
three elements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">sum3</span><span class="p">(</span><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">3</span><span class="p">])</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The parentheses are necessary because, without them, <code class="language-plaintext highlighter-rouge">array</code> would be an
array of pointers — a type far more common than a pointer to an array.
To index into the array, first the pointer to the array must be
dereferenced to the array value itself, then this intermediate array is
indexed triggering array decay. Conceptually there’s quite a bit to it,
but, in practice, it’s all as efficient as the conventional approach to
<code class="language-plaintext highlighter-rouge">sum3</code> that accepts a plain <code class="language-plaintext highlighter-rouge">int *</code>.</p>

<p>The caller must take the address of an array of exactly the right
length:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">buf</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">};</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">sum3</span><span class="p">(</span><span class="o">&amp;</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>Or if dynamically allocating the array:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">));</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">sum3</span><span class="p">(</span><span class="n">array</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">array</span><span class="p">);</span>
</code></pre></div></div>

<p>The mandatory parentheses and strict type requirements make this awkward
and rarely useful. However, with VLAs perhaps it’s worth the trouble!
Consider an NxN matrix expressed using a pointer to a VLA:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="cm">/* run-time value */</span><span class="p">;</span>
<span class="cm">/* TODO: Check for integer overflow. See note. */</span>
<span class="kt">float</span> <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">n</span><span class="p">][</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">y</span><span class="p">][</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When indexing, the parentheses are weird, but the indices have the
convenient <code class="language-plaintext highlighter-rouge">[y][x]</code> format. The non-VLA alternative is to compute a 1D
index manually from 2D indices (<code class="language-plaintext highlighter-rouge">y*n+x</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="cm">/* run-time value */</span><span class="p">;</span>
<span class="cm">/* TODO: Check for integer overflow. */</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">identity</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">identity</span><span class="p">[</span><span class="n">y</span><span class="o">*</span><span class="n">n</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note: What’s the behavior in the VLA version when <code class="language-plaintext highlighter-rouge">n</code> is so large that
<code class="language-plaintext highlighter-rouge">sizeof(*identity)</code> doesn’t fit in a <code class="language-plaintext highlighter-rouge">size_t</code>? I couldn’t find anything
in the standard about it, though I bet it’s undefined behavior. Neither
GCC and Clang check for overflow and, when it occurs, the overflow is
silent. Neither the undefined behavior sanitizer nor address sanitizer
complain when this happens.</p>

<p><strong>Update</strong>: <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCAP-ht1CQKVByZt1EXOb3J7TF%3DMcCKi%3DEtzjEH+CaEsPtvY5%3Djg%40mail.gmail.com%3E">bru del pointed out</a> that these multi-dimensional
VLAs can be simplified such that the parenthesis may be omitted when
indexing. The trick is to omit the first dimension from the VLA
expression:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">identity</span><span class="p">[</span><span class="n">y</span><span class="p">][</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So VLAs <em>might</em> be worth the trouble when using pointers to
multi-dimensional, dynamically-allocated arrays. However, I’m still
judicious about their use due to reduced portability. As a practical
example, MSVC famously does not, and likely will never will, support
VLAs.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>The CPython Bytecode Compiler is Dumb</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/02/24/"/>
    <id>urn:uuid:4348d611-858b-4f48-a6f5-6e4b93f71a34</id>
    <updated>2019-02-24T21:56:35Z</updated>
    <category term="python"/><category term="lua"/><category term="lang"/><category term="elisp"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was <a href="https://news.ycombinator.com/item?id=19241545">discussed on Hacker News</a>.</em></p>

<p>Due to sheer coincidence of several unrelated tasks converging on
Python at work, I recently needed to brush up on my Python skills. So
far for me, Python has been little more than <a href="/blog/2017/05/15/">a fancy extension
language for BeautifulSoup</a>, though I also used it to participate
in the recent tradition of <a href="https://github.com/skeeto/qualbum">writing one’s own static site
generator</a>, in this case for <a href="http://photo.nullprogram.com/">my wife’s photo blog</a>.
I’ve been reading through <em>Fluent Python</em> by Luciano Ramalho, and it’s
been quite effective at getting me up to speed.</p>

<!--more-->

<p>As I write Python, <a href="/blog/2014/01/04/">like with Emacs Lisp</a>, I can’t help but
consider what exactly is happening inside the interpreter. I wonder if
the code I’m writing is putting undue constraints on the bytecode
compiler and limiting its options. Ultimately I’d like the code I
write <a href="/blog/2017/01/30/">to drive the interpreter efficiently and effectively</a>.
<a href="https://www.python.org/dev/peps/pep-0020/">The Zen of Python</a> says there should “only one obvious way to do
it,” but in practice there’s a lot of room for expression. Given
multiple ways to express the same algorithm or idea, I tend to prefer
the one that compiles to the more efficient bytecode.</p>

<p>Fortunately CPython, the main and most widely used implementation of
Python, is very transparent about its bytecode. It’s easy to inspect
and reason about its bytecode. The disassembly listing is easy to read
and understand, and I can always follow it without consulting the
documentation. This contrasts sharply with modern JavaScript engines
and their opaque use of JIT compilation, where performance is guided
by obeying certain patterns (<a href="https://www.youtube.com/watch?v=UJPdhx5zTaw">hidden classes</a>, etc.), helping the
compiler <a href="https://blog.mozilla.org/javascript/2013/11/07/efficient-float32-arithmetic-in-javascript/">understand my program’s types</a>, and being careful
not to unnecessarily constrain the compiler.</p>

<p>So, besides just catching up with Python the language, I’ve been
studying the bytecode disassembly of the functions that I write. One
fact has become quite apparent: <strong>the CPython bytecode compiler is
pretty dumb</strong>. With a few exceptions, it’s a very literal translation
of a Python program, and there is almost <a href="https://legacy.python.org/workshops/1998-11/proceedings/papers/montanaro/montanaro.html">no optimization</a>.
Below I’ll demonstrate a case where it’s possible to detect one of the
missed optimizations without inspecting the bytecode disassembly
thanks to a small abstraction leak in the optimizer.</p>

<p>To be clear: This isn’t to say CPython is bad, or even that it should
necessarily change. In fact, as I’ll show, <strong>dumb bytecode compilers
are par for the course</strong>. In the past I’ve lamented how the Emacs Lisp
compiler could do a better job, but CPython and Lua are operating at
the same level. There are benefits to a dumb and straightforward
bytecode compiler: the compiler itself is simpler, easier to maintain,
and more amenable to modification (e.g. as Python continues to
evolve). It’s also easier to debug Python (<code class="language-plaintext highlighter-rouge">pdb</code>) because it’s such a
close match to the source listing.</p>

<p><em>Update</em>: <a href="https://codewords.recurse.com/issues/seven/dragon-taming-with-tailbiter-a-bytecode-compiler">Darius Bacon points out</a> that Guido van Rossum
himself said, “<a href="https://books.google.com/books?id=bIxWAgAAQBAJ&amp;pg=PA26&amp;lpg=PA26&amp;dq=%22Python+is+about+having+the+simplest,+dumbest+compiler+imaginable.%22&amp;source=bl&amp;ots=2OfDoWX321&amp;sig=ACfU3U32jKZBE3VkJ0gvkKbxRRgD0bnoRg&amp;hl=en&amp;sa=X&amp;ved=2ahUKEwjZ1quO89bgAhWpm-AKHfckAxUQ6AEwAHoECAkQAQ#v=onepage&amp;q=%22Python%20is%20about%20having%20the%20simplest%2C%20dumbest%20compiler%20imaginable.%22&amp;f=false">Python is about having the simplest, dumbest compiler
imaginable.</a>” So this is all very much by design.</p>

<p>The consensus seems to be that if you want or need better performance,
use something other than Python. (And if you can’t do that, at least use
<a href="https://pypy.org/">PyPy</a>.) That’s a fairly reasonable and healthy goal. Still, if
I’m writing Python, I’d like to do the best I can, which means
exploiting the optimizations that <em>are</em> available when possible.</p>

<h3 id="disassembly-examples">Disassembly examples</h3>

<p>I’m going to compare three bytecode compilers in this article: CPython
3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically
typed, are primarily executed on a bytecode virtual machine, and it’s
easy to access their disassembly listings. One caveat: CPython and Emacs
use a stack-based virtual machine while Lua uses a register-based
virtual machine.</p>

<p>For CPython I’ll be using the <code class="language-plaintext highlighter-rouge">dis</code> module. For Emacs Lisp I’ll use <code class="language-plaintext highlighter-rouge">M-x
disassemble</code>, and all code will use lexical scoping. In Lua I’ll use
<code class="language-plaintext highlighter-rouge">lua -l</code> on the command line.</p>

<h3 id="local-variable-elimination">Local variable elimination</h3>

<p>Will the bytecode compiler eliminate local variables? Keeping the
variable around potentially involves allocating memory for it, assigning
to it, and accessing it. Take this example:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">y</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">x</span>
</code></pre></div></div>

<p>This function is equivalent to:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="k">return</span> <span class="mi">0</span>
</code></pre></div></div>

<p>Despite this, CPython completely misses this optimization for both <code class="language-plaintext highlighter-rouge">x</code>
and <code class="language-plaintext highlighter-rouge">y</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE
</code></pre></div></div>

<p>It assigns both variables, and even loads again from <code class="language-plaintext highlighter-rouge">x</code> for the return.
Missed optimizations, but, as I said, by keeping these variables around,
debugging is more straightforward. Users can always inspect variables.</p>

<p>How about Lua?</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span> <span class="nf">foo</span><span class="p">()</span>
    <span class="kd">local</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="kd">local</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">x</span>
<span class="k">end</span>
</code></pre></div></div>

<p>It also misses this optimization, though it matters a little less due to
its architecture (the return instruction references a register
regardless of whether or not that register is allocated to a local
variable):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1
</code></pre></div></div>

<p>Emacs Lisp also misses it:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">foo</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">x</span> <span class="mi">0</span><span class="p">)</span>
        <span class="p">(</span><span class="nv">y</span> <span class="mi">1</span><span class="p">))</span>
    <span class="nv">x</span><span class="p">))</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0	constant  0
1	constant  1
2	stack-ref 1
3	return
</code></pre></div></div>

<p>All three are on the same page.</p>

<h3 id="constant-folding">Constant folding</h3>

<p>Does the bytecode compiler evaluate simple constant expressions at
compile time? This is simple and everyone does it.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="k">return</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">/</span> <span class="mi">4</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE
</code></pre></div></div>

<p>Lua:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span> <span class="nf">foo</span><span class="p">()</span>
    <span class="k">return</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">/</span> <span class="mi">4</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1
</code></pre></div></div>

<p>Emacs Lisp:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">foo</span> <span class="p">()</span>
  <span class="p">(</span><span class="nb">+</span> <span class="mi">1</span> <span class="p">(</span><span class="nb">/</span> <span class="p">(</span><span class="nb">*</span> <span class="mi">2</span> <span class="mi">3</span><span class="p">)</span> <span class="mf">4.0</span><span class="p">))</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0	constant  2.5
1	return
</code></pre></div></div>

<p>That’s something we can count on so long as the operands are all
numeric literals (or also, for Python, string literals) that are
visible to the compiler. Don’t count on your operator overloads to
work here, though.</p>

<h3 id="allocation-optimization">Allocation optimization</h3>

<p>Optimizers often perform <em>escape analysis</em>, to determine if objects
allocated in a function ever become visible outside of that function. If
they don’t then these objects could potentially be stack-allocated
(instead of heap-allocated) or even be eliminated entirely.</p>

<p>None of the bytecode compilers are this sophisticated. However CPython
does have a trick up its sleeve: tuple optimization. Since tuples are
immutable, in certain circumstances CPython will reuse them and avoid
both the constructor and the allocation.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="k">return</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>

<p>Check it out, the tuple is used as a constant:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE
</code></pre></div></div>

<p>Which we can detect by evaluating <code class="language-plaintext highlighter-rouge">foo() is foo()</code>, which is <code class="language-plaintext highlighter-rouge">True</code>.
Though deviate from this too much and the optimization is disabled.
Remember how CPython can’t optimize away variables, and that they
break constant folding? The break this, too:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE
</code></pre></div></div>

<p>This function might document that it always returns a simple tuple,
but we can tell if its being optimized or not using <code class="language-plaintext highlighter-rouge">is</code> like before:
<code class="language-plaintext highlighter-rouge">foo() is foo()</code> is now <code class="language-plaintext highlighter-rouge">False</code>! In some future version of Python with
a cleverer bytecode compiler, that expression might evaluate to
<code class="language-plaintext highlighter-rouge">True</code>. (Unless the <a href="https://docs.python.org/3/reference/">Python language specification</a> is specific
about this case, which I didn’t check.)</p>

<p>Note: Curiously PyPy replicates this exact behavior when examined with
<code class="language-plaintext highlighter-rouge">is</code>. Was that deliberate? I’m impressed that PyPy matches CPython’s
semantics so closely here.</p>

<p>Putting a mutable value, such as a list, in the tuple will also break
this optimization. But that’s not the compiler being dumb. That’s a
hard constraint on the compiler: the caller might change the mutable
component of the tuple, so it must always return a fresh copy.</p>

<p>Neither Lua nor Emacs Lisp have a language-level concept equivalent of
an immutable tuple, so there’s nothing to compare.</p>

<p>Other than the tuples situation in CPython, none of the bytecode
compilers eliminate unnecessary intermediate objects.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">():</span>
    <span class="k">return</span> <span class="p">[</span><span class="mi">1024</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE
</code></pre></div></div>

<p>Lua:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span> <span class="nf">foo</span><span class="p">()</span>
    <span class="k">return</span> <span class="p">({</span><span class="mi">1024</span><span class="p">})[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1
</code></pre></div></div>

<p>Emacs Lisp:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">foo</span> <span class="p">()</span>
  <span class="p">(</span><span class="nb">car</span> <span class="p">(</span><span class="nb">list</span> <span class="mi">1024</span><span class="p">)))</span>
</code></pre></div></div>

<p>Disassembly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0	constant  1024
1	list1
2	car
3	return
</code></pre></div></div>

<h3 id="dont-expect-too-much">Don’t expect too much</h3>

<p>I could go on with lots of examples, looking at loop optimizations and
so on, and each case is almost certainly unoptimized. The general rule
of thumb is to simply not expect much from these bytecode compilers.
They’re very literal in their translation.</p>

<p>Working so much in C has put me in the habit of expecting all obvious
optimizations from the compiler. This frees me to be more expressive
in my code. Lots of things are cost-free thanks to these
optimizations, such as breaking a complex expression up into several
variables, naming my constants, or not using a local variable to
manually cache memory accesses. I’m confident the compiler will
optimize away my expressiveness. The catch is that <a href="/blog/2018/05/01/">clever compilers
can take things too far</a>, so I’ve got to be mindful of how it might
undermine my intentions — i.e. when I’m doing something unusual or not
strictly permitted.</p>

<p>These bytecode compilers will never truly surprise me. The cost is
that being more expressive in Python, Lua, or Emacs Lisp may reduce
performance at run time because it shows in the bytecode. Usually this
doesn’t matter, but sometimes it does.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Prospecting for Hash Functions</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/07/31/"/>
    <id>urn:uuid:e865266a-2896-30c5-3f7d-cfad767b1ae2</id>
    <updated>2018-07-31T22:32:45Z</updated>
    <category term="c"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update 2022</em>: <a href="https://github.com/skeeto/hash-prospector/issues/19">TheIronBorn has found even better permutations</a> using
a smarter technique. That thread completely eclipses my efforts in this
article.</p>

<p>I recently got an itch to design my own non-cryptographic integer hash
function. Firstly, I wanted to <a href="/blog/2017/09/15/">better understand</a> how hash
functions work, and the best way to learn is to do. For years I’d been
treating them like magic, shoving input into it and seeing
<a href="/blog/2018/02/07/">random-looking</a>, but deterministic, output come out the other
end. Just how is the avalanche effect achieved?</p>

<p>Secondly, could I apply my own particular strengths to craft a hash
function better than the handful of functions I could find online?
Especially the classic ones from <a href="https://gist.github.com/badboy/6267743">Thomas Wang</a> and <a href="http://burtleburtle.net/bob/hash/integer.html">Bob
Jenkins</a>. Instead of struggling with the mathematics, maybe I
could software engineer my way to victory, working from the advantage
of access to the excessive computational power of today.</p>

<p>Suppose, for example, I wrote tool to generate a <strong>random hash
function definition</strong>, then <strong>JIT compile it</strong> to a native function in
memory, then execute that function across various inputs to <strong>evaluate
its properties</strong>. My tool could rapidly repeat this process in a loop
until it stumbled upon an incredible hash function the world had never
seen. That’s what I actually did. I call it the <strong>Hash Prospector</strong>:</p>

<p><strong><a href="https://github.com/skeeto/hash-prospector">https://github.com/skeeto/hash-prospector</a></strong></p>

<p>It only works on x86-64 because it uses the same <a href="/blog/2015/03/19/">JIT compiling
technique I’ve discussed before</a>: allocate a page of memory, write
some machine instructions into it, set the page to executable, cast the
page pointer to a function pointer, then call the generated code through
the function pointer.</p>

<h3 id="generating-a-hash-function">Generating a hash function</h3>

<p>My focus is on integer hash functions: a function that accepts an
<em>n</em>-bit integer and returns an <em>n</em>-bit integer. One of the important
properties of an <em>integer</em> hash function is that it maps its inputs to
outputs 1:1. In other words, there are <strong>no collisions</strong>. If there’s a
collision, then some outputs aren’t possible, and the function isn’t
making efficient use of its entropy.</p>

<p>This is actually a lot easier than it sounds. As long as every <em>n</em>-bit
integer operation used in the hash function is <em>reversible</em>, then the
hash function has this property. An operation is reversible if, given
its output, you can unambiguously compute its input.</p>

<p>For example, XOR with a constant is trivially reversible: XOR the
output with the same constant to reverse it. Addition with a constant
is reversed by subtraction with the same constant. Since the integer
operations are modular arithmetic, modulo 2^n for <em>n</em>-bit integers,
multiplication by an <em>odd</em> number is reversible. Odd numbers are
coprime with the power-of-two modulus, so there is some <em>modular
multiplicative inverse</em> that reverses the operation.</p>

<p><a href="http://papa.bretmulvey.com/post/124027987928/hash-functions">Bret Mulvey’s hash function article</a> provides a convenient list
of some reversible operations available for constructing integer hash
functions. This list was the catalyst for my little project. Here are
the ones used by the hash prospector:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span>  <span class="o">=</span> <span class="o">~</span><span class="n">x</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">*=</span> <span class="n">constant</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// e.g. only odd constants</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">&lt;&lt;&lt;=</span> <span class="n">constant</span><span class="p">;</span> <span class="c1">// left rotation</span>
</code></pre></div></div>

<p>I’ve come across a couple more useful operations while studying existing
integer hash functions, but I didn’t put these in the prospector.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hash</span> <span class="o">+=</span> <span class="o">~</span><span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">);</span>
<span class="n">hash</span> <span class="o">-=</span> <span class="o">~</span><span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">);</span>
</code></pre></div></div>

<p>The prospector picks some operations at random and fills in their
constants randomly within their proper constraints. For example,
here’s an awful hash function I made it generate as an example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// do NOT use this!</span>
<span class="kt">uint32_t</span>
<span class="nf">badhash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x1eca7d79U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>
    <span class="n">x</span>  <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">);</span>
    <span class="n">x</span>  <span class="o">=</span> <span class="o">~</span><span class="n">x</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">5</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">+=</span> <span class="mh">0x10afe4e7U</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That function is reversible, and it would be <a href="https://naml.us/post/inverse-of-a-hash-function/">relatively
straightforward</a> to <a href="http://c42f.github.io/2015/09/21/inverting-32-bit-wang-hash.html">define its inverse</a>. However, it has
awful biases and poor avalanche. How do I know this?</p>

<h3 id="the-measure-of-a-hash-function">The measure of a hash function</h3>

<p>There are two key properties I’m looking for in randomly generated hash
functions.</p>

<ol>
  <li>
    <p>High avalanche effect. When I flip one input bit, the output bits
should each flip with a 50% chance.</p>
  </li>
  <li>
    <p>Low bias. Ideally there is no correlation between which output bits
flip for a particular flipped input bit.</p>
  </li>
</ol>

<p>Initially I screwed up and only measured the first property. This lead
to some hash functions that <em>seemed</em> to be amazing before close
inspection, since, for a 32-bit hash function, it was flipping over 15
output bits on average. However, the particular bits being flipped
were heavily biased, resulting in obvious patterns in the output.</p>

<p>For example, when hashing a counter starting from zero, the high bits
would follow a regular pattern. 15 to 16 bits were being flipped each
time, but it was always the same bits.</p>

<p>Conveniently it’s easy to measure both properties at the same time. For
an <em>n</em>-bit integer hash function, create an <em>n</em> by <em>n</em> table initialized
to zero. The rows are input bits and the columns are output bits. The
<em>i</em>th row and <em>j</em>th column track the correlation between the <em>i</em>th input
bit and <em>j</em>th output bit.</p>

<p>Then exhaustively iterate over all 2^n inputs, and flip each bit one at
a time. Increment the appropriate element in the table if the output bit
flips.</p>

<p>When you’re done, ideally each element in the table is exactly 2^(n-1).
That is, each output bit was flipped exactly half the time by each input
bit. Therefore the <em>bias</em> of the hash function is the distance (the
error) of the computed table from the ideal table.</p>

<p>For example, the ideal bias table for an 8-bit hash function would be:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
</code></pre></div></div>

<p>The hash prospector computes the standard deviation in order to turn
this into a single, normalized measurement. Lower scores are better.</p>

<p>However, there’s still one problem: the input space for a 32-bit hash
function is over 4 billion values. The full test takes my computer about
an hour and a half. Evaluating a 64-bit hash function is right out.</p>

<p>Again, <a href="/blog/2017/09/21/">Monte Carlo to the rescue</a>! Rather than sample the entire
space, just sample a random subset. This provides a good estimate in
less than a second, allowing lots of terrible hash functions to be
discarded early. The full test can be saved only for the known good
32-bit candidates. 64-bit functions will only ever receive the estimate.</p>

<h3 id="what-did-i-find">What did I find?</h3>

<p>Once I got the bias issue sorted out, and after hours and hours of
running, followed up with some manual tweaking on my part, the
<strong>prospector stumbled across this little gem</strong>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// DO use this one!</span>
<span class="kt">uint32_t</span>
<span class="nf">prospector32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x2c1b3c6dU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x297a2d39U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>According to a full (e.g. not estimated) bias evaluation, this function
beats <em>the snot</em> out of most of 32-bit hash functions I could find. It
even comes out ahead of this well known hash function that I <em>believe</em>
originates from the H2 SQL Database. (Update: Thomas Mueller has
confirmed that, indeed, this is his hash function.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s still an excellent hash function, just slightly more biased than
mine.</p>

<p>Very briefly, <code class="language-plaintext highlighter-rouge">prospector32()</code> was the best 32-bit hash function I could
find, and I thought I had a major breakthrough. Then I noticed the
finalizer function for <a href="https://en.wikipedia.org/wiki/MurmurHash#Algorithm">the 32-bit variant of MurmurHash3</a>. It’s
also a 32-bit hash function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">murmurhash32_mix32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x85ebca6bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">13</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xc2b2ae35U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This one is just <em>barely</em> less biased than mine. So I still haven’t
discovered the best 32-bit hash function, only the <em>second</em> best one.
:-)</p>

<h3 id="a-pattern-emerges">A pattern emerges</h3>

<p>If you’re paying close enough attention, you may have noticed that all
three functions above have the same structure. The prospector had
stumbled upon it all on its own without knowledge of the existing
functions. It may not be so obvious for the second function, but here it
is refactored:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I hadn’t noticed this until after the prospector had come across it on
its own. The pattern for all three is XOR-right-shift, multiply,
XOR-right-shift, multiply, XOR-right-shift. There’s something
particularly useful about this <a href="http://www.pcg-random.org/posts/developing-a-seed_seq-alternative.html#multiplyxorshift">multiply-xorshift construction</a>
(<a href="http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/#designing-a-diffusion-function--by-example">also</a>). The XOR-right-shift diffuses bits rightward and the
multiply diffuses bits leftward. I like to think it’s “sloshing” the
bits right, left, right, left.</p>

<p>It seems that multiplication is particularly good at diffusion, so it
makes perfect sense to exploit it in non-cryptographic hash functions,
especially since modern CPUs are so fast at it. Despite this, it’s not
used much in cryptography due to <a href="http://cr.yp.to/snuffle/design.pdf">issues with completing it in constant
time</a>.</p>

<p>I like to think of this construction in terms of a five-tuple. For the
three functions it’s the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()
</code></pre></div></div>

<p>The prospector actually found lots of decent functions following this
pattern, especially where the middle shift is smaller than the outer
shift. Thinking of it in terms of this tuple, I specifically directed
it to try different tuple constants. That’s what I meant by
“tweaking.” Eventually my new function popped out with its really low
bias.</p>

<p>The prospector has a template option (<code class="language-plaintext highlighter-rouge">-p</code>) if you want to try it
yourself:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr,mul,xorr,mul,xorr
</code></pre></div></div>

<p>If you really have your heart set on certain constants, such as my
specific selection of shifts, you can lock those in while randomizing
the other constants:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15
</code></pre></div></div>

<p>Or the other way around:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr
</code></pre></div></div>

<p>My function seems a little strange using shifts of 15 bits rather than
a nice, round 16 bits. However, changing those constants to 16
increases the bias. Similarly, neither of the two 32-bit constants is
a prime number, but <strong>nudging those constants to the nearest prime
increases the bias</strong>. These parameters really do seem to be a local
minima in the bias, and using prime numbers isn’t important.</p>

<h3 id="what-about-64-bit-integer-hash-functions">What about 64-bit integer hash functions?</h3>

<p>So far I haven’t been able to improve on 64-bit hash functions. The main
function to beat is SplittableRandom / <a href="http://xoshiro.di.unimi.it/splitmix64.c">SplitMix64</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">splittable64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xbf58476d1ce4e5b9U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x94d049bb133111ebU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">31</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s its inverse since it’s sometimes useful:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">splittable64_r</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">31</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">62</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x319642b2d24d8ec3U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">54</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x96de1b173f119089U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">60</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I also came across <a href="https://gist.github.com/degski/6e2069d6035ae04d5d6f64981c995ec2">this function</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">hash64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd6e8feb86659fd93U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd6e8feb86659fd93U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Again, these follow the same construction as before. There really is
something special about it, and many other people have noticed, too.</p>

<p>Both functions have about the same bias. (Remember, I can only estimate
the bias for 64-bit hash functions.) The prospector has found lots of
functions with about the same bias, but nothing provably better. Until
it does, I have no new 64-bit integer hash functions to offer.</p>

<h3 id="beyond-random-search">Beyond random search</h3>

<p>Right now the prospector does a completely random, unstructured search
hoping to stumble upon something good by chance. Perhaps it would be
worth using a genetic algorithm to breed those 5-tuples towards
optimum? Others have had <a href="https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html">success in this area with simulated
annealing</a>.</p>

<p>There’s probably more to exploit from the multiply-xorshift construction
that keeps popping up. If anything, the prospector is searching too
broadly, looking at constructions that could never really compete no
matter what the constants. In addition to everything above, I’ve been
looking for good 32-bit hash functions that don’t use any 32-bit
constants, but I’m really not finding any with a competitively low bias.</p>

<h3 id="update-after-one-week">Update after one week</h3>

<p>About one week after publishing this article I found an even better hash
function. I believe <strong>this is the least biased 32-bit integer hash
function <em>of this form</em> ever devised</strong>. It’s even less biased than the
MurmurHash3 finalizer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// exact bias: 0.17353355999581582</span>
<span class="kt">uint32_t</span>
<span class="nf">lowbias32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x7feb352dU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x846ca68bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// inverse</span>
<span class="kt">uint32_t</span>
<span class="nf">lowbias32_r</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x43021123U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x1d69e2a5U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re willing to use an additional round of multiply-xorshift, this
next function actually reaches the theoretical bias limit (bias =
~0.021) as exhibited by a perfect integer hash function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// exact bias: 0.020888578919738908</span>
<span class="kt">uint32_t</span>
<span class="nf">triple32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xed5ad4bbU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xac4c1b51U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x31848babU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><del>It’s statistically indistinguishable from a random permutation of all
32-bit integers.</del>(<em>Update 2025</em>: Peter Schmidt-Nielsen has provided a
second-order characteristic test that quickly identifies statistically
significant biases in <code class="language-plaintext highlighter-rouge">triple32</code>.)</p>

<h3 id="update-february-2020">Update, February 2020</h3>

<p>Some people have been experimenting with using my hash functions in GLSL
shaders, and the results are looking good:</p>

<ul>
  <li><a href="https://www.shadertoy.com/view/WttXWX">https://www.shadertoy.com/view/WttXWX</a></li>
  <li><a href="https://www.shadertoy.com/view/ttVGDV">https://www.shadertoy.com/view/ttVGDV</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The Value of Undefined Behavior</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/07/20/"/>
    <id>urn:uuid:9758e9ea-46b6-3904-5166-52c7e6922892</id>
    <updated>2018-07-20T21:31:18Z</updated>
    <category term="c"/><category term="cpp"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In several places, the C and C++ language specifications use a
curious, and fairly controversial, phrase: <em>undefined behavior</em>. For
certain program constructs, the specification prescribes no specific
behavior, instead allowing <a href="http://www.catb.org/jargon/html/N/nasal-demons.html">anything to happen</a>. Such constructs
are considered erroneous, and so the result depends on the particulars
of the platform and implementation. The original purpose of undefined
behavior was for implementation flexibility. In other words, it’s
slack that allows a compiler to produce appropriate and efficient code
for its target platform.</p>

<p>Specifying a particular behavior would have put unnecessary burden on
implementations — especially in the earlier days of computing — making
for inefficient programs on some platforms. For example, if the result
of dereferencing a null pointer was defined to trap — to cause the
program to halt with an error — then platforms that do not have
hardware trapping, such as those without virtual memory, would be
required to instrument, in software, each pointer dereference.</p>

<p>In the 21st century, undefined behavior has taken on a somewhat
different meaning. Optimizers use it — or <em>ab</em>use it depending on your
point of view — to lift <a href="/blog/2016/12/22/">constraints</a> that would otherwise
inhibit more aggressive optimizations. It’s not so much a
fundamentally different application of undefined behavior, but it does
take the concept to an extreme.</p>

<p>The reasoning works like this: A program that evaluates a construct
whose behavior is undefined cannot, by definition, have any meaningful
behavior, and so that program would be useless. As a result,
<a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">compilers assume programs never invoke undefined behavior</a> and
use those assumptions to prove its optimizations.</p>

<p>Under this newer interpretation, mistakes involving undefined behavior
are more <a href="https://kristerw.blogspot.com/2017/09/why-undefined-behavior-may-call-never.html">punishing</a> and <a href="/blog/2018/05/01/">surprising</a> than before. Programs
that <em>seem</em> to make some sense when run on a particular architecture may
actually compile into a binary with a security vulnerability due to
conclusions reached from an analysis of its undefined behavior.</p>

<p>This can be frustrating if your programs are intended to run on a very
specific platform. In this situation, all behavior really <em>could</em> be
locked down and specified in a reasonable, predictable way. Such a
language would be like an extended, less portable version of C or C++.
But your toolchain still insists on running your program on the
<em>abstract machine</em> rather than the hardware you actually care about.
However, <strong>even in this situation undefined behavior can still be
desirable</strong>. I will provide a couple of examples in this article.</p>

<h3 id="signed-integer-overflow">Signed integer overflow</h3>

<p>To start things off, let’s look at one of my all time favorite examples
of useful undefined behavior, a situation involving signed integer
overflow. The result of a signed integer overflow isn’t just
unspecified, it’s undefined behavior. Full stop.</p>

<p>This goes beyond a simple matter of whether or not the underlying
machine uses a two’s complement representation. From the perspective of
the abstract machine, just the act a signed integer overflowing is
enough to throw everything out the window, even if the overflowed result
is never actually used in the program.</p>

<p>On the other hand, unsigned integer overflow is defined — or, more
accurately, defined to wrap, <em>not</em> overflow. Both the undefined signed
overflow and defined unsigned overflow are useful in different
situations.</p>

<p>For example, here’s a fairly common situation, much like what <a href="https://www.youtube.com/watch?v=yG1OZ69H_-o&amp;t=38m18s">actually
happened in bzip2</a>. Consider this function that does substring
comparison:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">cmp_signed</span><span class="p">(</span><span class="kt">int</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">cmp_unsigned</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this function, the indices <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> will always be some small,
non-negative value. Since it’s non-negative, it should be <code class="language-plaintext highlighter-rouge">unsigned</code>,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually <em>don’t</em> want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.</p>

<p>The constraint is that <strong>the behavior of <code class="language-plaintext highlighter-rouge">i1</code> or <code class="language-plaintext highlighter-rouge">i2</code> overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior.</strong> On x86-64, where <code class="language-plaintext highlighter-rouge">int</code> is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.</p>

<p>In the signed case, incrementing the integers cannot overflow since that
would be undefined behavior. This permits the compiler to perform the
increment only in 64-bit precision without truncation if it would be
more efficient, which, in this case, it is.</p>

<p>Here’s the output of Clang 6.0.0 with <code class="language-plaintext highlighter-rouge">-Os</code> on x86-64. Pay close
attention to the main loop, which I named <code class="language-plaintext highlighter-rouge">.loop</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">cmp_signed:</span>
        <span class="nf">movsxd</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; use i1 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">movsxd</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; use i2 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">jmp</span>    <span class="nv">.check</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">rdx</span>                  <span class="c1">; increment only the base pointer</span>
<span class="nl">.check:</span> <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

        <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>             <span class="c1">; return c1 - c2</span>
        <span class="nf">ret</span>

<span class="nl">cmp_unsigned:</span>
        <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">jne</span>    <span class="nv">.ret</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; truncated i1 overflow</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; truncated i2 overflow</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>                  <span class="c1">; increment i1</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>                  <span class="c1">; increment i2</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

<span class="nl">.ret:</span>   <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>As unsigned values, <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, <code class="language-plaintext highlighter-rouge">buf</code>, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).</p>

<p>The signed integer helps to communicate the <em>narrow contract</em> of the
function — the limited range of <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> — to the compiler. In a
variant of C where signed integer overflow is defined (i.e. <code class="language-plaintext highlighter-rouge">-fwrapv</code>),
this capability is lost. In fact, using <code class="language-plaintext highlighter-rouge">-fwrapv</code> deoptimizes the signed
version of this function.</p>

<p>Side note: Using <code class="language-plaintext highlighter-rouge">size_t</code> (an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.</p>

<h3 id="strict-aliasing">Strict aliasing</h3>

<p>Another controversial undefined behavior is <a href="https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8"><em>strict aliasing</em></a>.
This particular term doesn’t actually appear anywhere in the C
specification, but it’s the popular name for C’s aliasing rules. In
short, variables with types that aren’t compatible are not allowed to
alias through pointers.</p>

<p>Here’s the classic example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">b</span><span class="p">;</span> <span class="c1">// load</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Naively one might assume the <code class="language-plaintext highlighter-rouge">return *b</code> could be optimized to a simple
<code class="language-plaintext highlighter-rouge">return 0</code>. However, since <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.</p>

<p>If <code class="language-plaintext highlighter-rouge">foo</code> has a narrow contract that forbids <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> to alias, we
have a couple of options for helping our compiler.</p>

<p>First, we could manually resolve the aliasing issue by returning 0
explicitly. In more complicated functions this might mean making local
copies of values, working only with those local copies, then storing the
results back before returning. Then aliasing would no longer matter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Second, C99 introduced a <code class="language-plaintext highlighter-rouge">restrict</code> qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to <code class="language-plaintext highlighter-rouge">memcpy()</code> are qualified with <code class="language-plaintext highlighter-rouge">restrict</code> as of C99.
Passing aliasing pointers through <code class="language-plaintext highlighter-rouge">restrict</code> parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">long</code> are never compatible even
when they have the same representation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>If you use an extended or modified version of C without strict
aliasing (<code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.</p>

<p>What <a href="https://lkml.org/lkml/2003/2/26/158">irritates</a> a lot of people is that compilers will still
apply the strict aliasing rule even when it’s trivial for the compiler
to prove that aliasing is occurring:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* note: forbidden */</span>
<span class="kt">long</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s not just a simple matter of making exceptions for these cases.
The language specification would need to define all the rules about
when and where incompatible types are permitted to alias, and
developers would have to understand all these rules if they wanted to
take advantage of the exceptions. It can’t just come down to trusting
that the compiler is smart enough to see the aliasing when it’s
sufficiently simple. It would need to be carefully defined.</p>

<p>Besides, there are probably <a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">conforming, portable solutions</a>
that, with contemporary compilers, will safely compile to the efficient
code you actually want anyway.</p>

<p>There <em>is</em> one special exception for strict aliasing: <code class="language-plaintext highlighter-rouge">char *</code> is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a <code class="language-plaintext highlighter-rouge">char *</code> pointer could force the compiler to
generate additional, unnecessary loads.</p>

<p>In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: <code class="language-plaintext highlighter-rouge">uint8_t</code> is not necessarily <code class="language-plaintext highlighter-rouge">unsigned char</code>.
That’s just one possible <code class="language-plaintext highlighter-rouge">typedef</code> definition for it. It could instead
<code class="language-plaintext highlighter-rouge">typedef</code> to, say, some internal <code class="language-plaintext highlighter-rouge">__byte</code> type.</p>

<p>In other words, technically speaking, <code class="language-plaintext highlighter-rouge">uint8_t</code> does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110">willing to exploit it</a> for optimization purposes.</p>

<h3 id="other-undefined-behaviors">Other undefined behaviors</h3>

<p>Some kinds of undefined behavior don’t have performance or portability
benefits. They’re only there to make the compiler’s job a little
simpler. Today, most of these are caught trivially at compile time as
syntax or semantic issues (i.e. a pointer cast to a float).</p>

<p>Some others are obvious about their performance benefits and don’t
require much explanation. For example, it’s undefined behavior to
index out of bounds (with some special exceptions for one past the
end), meaning compilers are not obligated to generate those checks,
instead relying on the programmer to arrange, by whatever means, that
it doesn’t happen.</p>

<p>Undefined behavior is like nitro, a dangerous, volatile substance that
makes things go really, really fast. You could argue that it’s <em>too</em>
dangerous to use in practice, but the aggressive use of undefined
behavior is <a href="http://thoughtmesh.net/publish/367.php">not without merit</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>When FFI Function Calls Beat Native C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/27/"/>
    <id>urn:uuid:cb339e3b-382e-3762-4e5c-10cf049f7627</id>
    <updated>2018-05-27T20:03:15Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update: There’s a good discussion on <a href="https://news.ycombinator.com/item?id=17171252">Hacker News</a>.</em></p>

<p>Over on GitHub, David Yu has an interesting performance benchmark for
function calls of various Foreign Function Interfaces (<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a>):</p>

<p><a href="https://github.com/dyu/ffi-overhead">https://github.com/dyu/ffi-overhead</a></p>

<p>He created a shared object (<code class="language-plaintext highlighter-rouge">.so</code>) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.</p>

<p>For the C “FFI” he used standard dynamic linking, not <code class="language-plaintext highlighter-rouge">dlopen()</code>. This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.</p>

<p>The most surprising result of the benchmark is that
<strong><a href="http://luajit.org/">LuaJIT’s</a> FFI is substantially faster than C</strong>. It’s about
25% faster than a native C function call to a shared object function.
How could a weakly and dynamically typed scripting language come out
ahead on a benchmark? Is this accurate?</p>

<p>It’s actually quite reasonable. The benchmark was run on Linux, so the
performance penalty we’re seeing comes the <em>Procedure Linkage Table</em>
(PLT). I’ve put together a really simple experiment to demonstrate the
same effect in plain old C:</p>

<p><a href="https://github.com/skeeto/dynamic-function-benchmark">https://github.com/skeeto/dynamic-function-benchmark</a></p>

<p>Here are the results on an Intel i7-6700 (Skylake):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
</code></pre></div></div>

<p>These are three different types of function calls:</p>

<ol>
  <li>Through the PLT</li>
  <li>An indirect function call (via <code class="language-plaintext highlighter-rouge">dlsym(3)</code>)</li>
  <li>A direct function call (via a JIT-compiled function)</li>
</ol>

<p>As shown, the last one is the fastest. It’s typically not an option
for C programs, but it’s natural in the presence of a JIT compiler,
including, apparently, LuaJIT.</p>

<p>In my benchmark, the function being called is named <code class="language-plaintext highlighter-rouge">empty()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">empty</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
</code></pre></div></div>

<p>And to compile it into a shared object:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -Os -o empty.so empty.c
</code></pre></div></div>

<p>Just as in my <a href="/blog/2017/09/21/">PRNG shootout</a>, the benchmark calls this function
repeatedly as many times as possible before an alarm goes off.</p>

<h3 id="procedure-linkage-tables">Procedure Linkage Tables</h3>

<p>When a program or library calls a function in another shared object,
the compiler cannot know where that function will be located in
memory. That information isn’t known until run time, after the program
and its dependencies are loaded into memory. These are usually at
randomized locations — e.g. <em>Address Space Layout Randomization</em>
(ASLR).</p>

<p>How is this resolved? Well, there are a couple of options.</p>

<p>One option is to make a note about each such call in the binary’s
metadata. The run-time dynamic linker can then <em>patch</em> in the correct
address at each call site. How exactly this would work depends on the
particular <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">code model</a> used when compiling the binary.</p>

<p>The downside to this approach is slower loading, larger binaries, and
less <a href="/blog/2016/04/10/">sharing of code pages</a> between different processes. It’s
slower loading because every dynamic call site needs to be patched
before the program can begin execution. The binary is larger because
each of these call sites needs an entry in the relocation table. And the
lack of sharing is due to the code pages being modified.</p>

<p>On the other hand, the overhead for dynamic function calls would be
eliminated, giving JIT-like performance as seen in the benchmark.</p>

<p>The second option is to route all dynamic calls through a table. The
original call site calls into a stub in this table, which jumps to the
actual dynamic function. With this approach the code does not need to
be patched, meaning it’s <a href="/blog/2016/12/23/">trivially shared</a> between processes.
Only one place needs to be patched per dynamic function: the entries
in the table. Even more, these patches can be performed <em>lazily</em>, on
the first function call, making the load time even faster.</p>

<p>On systems using ELF binaries, this table is called the Procedure
Linkage Table (PLT). The PLT itself doesn’t actually get patched —
it’s mapped read-only along with the rest of the code. Instead the
<em>Global Offset Table</em> (GOT) gets patched. The PLT stub fetches the
dynamic function address from the GOT and <em>indirectly</em> jumps to that
address. To lazily load function addresses, these GOT entries are
initialized with an address of a function that locates the target
symbol, updates the GOT with that address, and then jumps to that
function. Subsequent calls use the lazily discovered address.</p>

<p><img src="/img/diagram/plt.svg" alt="" /></p>

<p>The downside of a PLT is extra overhead per dynamic function call,
which is what shows up in the benchmark. Since the benchmark <em>only</em>
measures function calls, this appears to be pretty significant, but in
practice it’s usually drowned out in noise.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Cleared by an alarm signal. */</span>
<span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">plt_benchmark</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">empty</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">empty()</code> is in the shared object, that call goes through the PLT.</p>

<h3 id="indirect-dynamic-calls">Indirect dynamic calls</h3>

<p>Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
<code class="language-plaintext highlighter-rouge">dlsym(3)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">h</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="s">"path/to/lib.so"</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span>
<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="s">"f"</span><span class="p">);</span>
<span class="n">f</span><span class="p">();</span>
</code></pre></div></div>

<p>Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then <code class="language-plaintext highlighter-rouge">dlsym(3)</code> may actually return the address of
the PLT stub.)</p>

<p>However, this is still an <em>indirect</em> function call. On conventional
architectures, <em>direct</em> function calls have an immediate relative
address. That is, the target of the call is some hard-coded offset from
the call site. The CPU can see well ahead of time where the call is
going.</p>

<p>An indirect function call has more overhead. First, the address has to
be stored somewhere. Even if that somewhere is just a register, it
increases register pressure by using up a register. Second, it
provokes the CPU’s branch predictor since the call target isn’t
static, making for extra bookkeeping in the CPU. In the worst case the
function call may even cause a pipeline stall.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">indirect_benchmark</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">f</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function passed to this benchmark is fetched with <code class="language-plaintext highlighter-rouge">dlsym(3)</code> so the
compiler can’t <a href="/blog/2018/05/01/">do something tricky</a> like convert that indirect
call back into a direct call.</p>

<p>If the body of the loop was complicated enough that there was register
pressure, thereby requiring the address to be spilled onto the stack,
this benchmark might not fare as well against the PLT benchmark.</p>

<h3 id="direct-function-calls">Direct function calls</h3>

<p>The first two types of dynamic function calls are simple and easy to
use. <em>Direct</em> calls to dynamic functions is trickier business since it
requires modifying code at run time. In my benchmark I put together a
<a href="/blog/2015/03/19/">little JIT compiler</a> to generate the direct call.</p>

<p>There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, <code class="language-plaintext highlighter-rouge">empty()</code>. If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.</p>

<p>To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this <a href="/blog/2016/09/03/">requires parsing virtual files under <code class="language-plaintext highlighter-rouge">/proc</code></a>.</p>

<p>Here’s what my JIT’s memory allocation looks like. It assumes
<a href="/blog/2016/05/30/">reasonable behavior for <code class="language-plaintext highlighter-rouge">uintptr_t</code> casts</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="n">jit_func</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">empty</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">desired</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="n">addr</span> <span class="o">-</span> <span class="n">SAFETY_MARGIN</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">PAGEMASK</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">desired</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It allocates two pages, one writable and the other containing
non-writable code. Similar to <a href="/blog/2017/01/08/">my closure library</a>, the lower
page is writable and holds the <code class="language-plaintext highlighter-rouge">running</code> variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">jit_benchmark:</span>
        <span class="nf">push</span>  <span class="nb">rbx</span>
        <span class="nf">xor</span>   <span class="nb">ebx</span><span class="p">,</span> <span class="nb">ebx</span>
<span class="nl">.loop:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">running</span><span class="p">]</span>
        <span class="nf">test</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">je</span>    <span class="nv">.done</span>
        <span class="nf">call</span>  <span class="nv">empty</span>
        <span class="nf">inc</span>   <span class="nb">ebx</span>
        <span class="nf">jmp</span>   <span class="nv">.loop</span>
<span class="nl">.done:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebx</span>
        <span class="nf">pop</span>   <span class="nb">rbx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">call empty</code> is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the <em>end</em> of the instruction):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// call empty</span>
    <span class="kt">uintptr_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="mh">0xe8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">empty()</code> wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for <code class="language-plaintext highlighter-rouge">plt_benchmark()</code>, assuming somehow it didn’t inline
<code class="language-plaintext highlighter-rouge">empty()</code>.</p>

<p>Ironically, calling the JIT-compiled code requires an indirect call
(e.g. via a function pointer), and there’s no way around this. What
are you going to do, JIT compile another function that makes the
direct call? Fortunately this doesn’t matter since the part being
measured in the loop is only a direct call.</p>

<h3 id="its-no-mystery">It’s no mystery</h3>

<p>Given these results, it’s really no mystery that LuaJIT can generate
more efficient dynamic function calls than a PLT, <em>even if they still
end up being indirect calls</em>. In my benchmark, the non-PLT indirect
calls were 28% faster than the PLT, and the direct calls 43% faster
than the PLT. That’s a small edge that JIT-enabled programs have over
plain old native programs, though it comes at the cost of absolutely
no code sharing between processes.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>When the Compiler Bites</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/01/"/>
    <id>urn:uuid:02b974e1-e25b-397d-a16f-c754338e9c1e</id>
    <updated>2018-05-01T23:28:06Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="ai"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Update: There are discussions <a href="https://old.reddit.com/r/cpp/comments/8gfhq3/when_the_compiler_bites/">on Reddit</a> and <a href="https://news.ycombinator.com/item?id=16974770">on Hacker
News</a>.</em></p>

<p>So far this year I’ve been bitten three times by compiler edge cases
in GCC and Clang, each time catching me totally by surprise. Two were
caused by historical artifacts, where an ambiguous specification lead
to diverging implementations. The third was a compiler optimization
being far more clever than I expected, behaving almost like an
artificial intelligence.</p>

<p>In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.</p>

<h3 id="x86-64-abi-ambiguity">x86-64 ABI ambiguity</h3>

<p>The first time I was bit — or, well, narrowly avoided being bit — was
when I examined a missed floating point optimization in both Clang and
GCC. Consider this function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function multiplies its argument by zero and returns the result. Any
number multiplied by zero is zero, so this should always return zero,
right? Unfortunately, no. IEEE 754 floating point arithmetic supports
NaN, infinities, and signed zeros. This function can return NaN,
positive zero, or negative zero. (In some cases, the operation could
also potentially produce a hardware exception.)</p>

<p>As a result, both GCC and Clang perform the multiply:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorpd</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">mulsd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-ffast-math</code> option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
<a href="https://possiblywrong.wordpress.com/2017/09/12/floating-point-agreement-between-matlab-and-c/">consistency</a>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Side note: <code class="language-plaintext highlighter-rouge">-ffast-math</code> doesn’t necessarily mean “less precise.”
Sometimes it will actually <a href="https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add">improve precision</a>.</p>

<p>Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a <code class="language-plaintext highlighter-rouge">short</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply_short</span><span class="p">(</span><span class="kt">short</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s no longer possible for the argument to be one of those special
values. The <code class="language-plaintext highlighter-rouge">short</code> will be promoted to one of 65,535 possible <code class="language-plaintext highlighter-rouge">double</code>
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (<code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">movsx</span>     <span class="nb">edi</span><span class="p">,</span> <span class="nb">di</span>       <span class="c1">; sign-extend 16-bit argument</span>
    <span class="nf">xorps</span>     <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>    <span class="c1">; xmm1 = 0.0</span>
    <span class="nf">cvtsi2sd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nb">edi</span>     <span class="c1">; convert int to double</span>
    <span class="nf">mulsd</span>     <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Clang also misses this optimization:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">cvtsi2sd</span> <span class="nv">xmm1</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">xorpd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">mulsd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (<code class="language-plaintext highlighter-rouge">movsx</code>)? Clang is treating that
<code class="language-plaintext highlighter-rouge">short</code> argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?</p>

<p>It turns out that the <a href="https://www.uclibc.org/docs/psABI-x86_64.pdf">x86-64 ABI</a> didn’t specify what happens with
the upper bits in argument registers. Are they garbage? Are they zeroed?
GCC takes the conservative position of assuming the upper bits are
arbitrary garbage. Clang takes the boldest position of assuming
arguments smaller than 32 bits have been promoted to 32 bits by the
caller. This is what the ABI specification <em>should</em> have said, but
currently it does not.</p>

<p>Fortunately GCC also conservative when passing arguments. It promotes
arguments to 32 bits as necessary, so there are no conflicts when
linking against Clang-compiled code. However, this is not true for
Intel’s ICC compiler: <a href="https://web.archive.org/web/20180908113552/https://stackoverflow.com/a/36760539"><strong>Clang and ICC are not ABI-compatible on
x86-64</strong></a>.</p>

<p>I don’t use ICC, so that particular issue wouldn’t bite me, <em>but</em> if I
was ever writing assembly routines that called Clang-compiled code, I’d
eventually get bit by this.</p>

<h3 id="floating-point-precision">Floating point precision</h3>

<p>Without looking it up or trying it, what does this function return?
Think carefully.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">float_compare</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Confident in your answer? This is a trick question, because it can
return either 0 or 1 depending on the compiler. Boy was I confused when
this comparison returned 0 in my real world code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1
</code></pre></div></div>

<p>So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations <a href="https://news.ycombinator.com/item?id=16974770">all did it differently</a>. The C99 specification
cleaned this all up and introduced <a href="https://en.wikipedia.org/wiki/C99#IEEE_754_floating_point_support"><code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD</code></a>.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.</p>

<p>Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in <code class="language-plaintext highlighter-rouge">long double</code>
precision and truncated afterward (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 2</code>).</p>

<p>In <code class="language-plaintext highlighter-rouge">float_compare()</code> the left-hand side is truncated to a <code class="language-plaintext highlighter-rouge">float</code> by the
assignment, but the right-hand side, <em>despite being a <code class="language-plaintext highlighter-rouge">float</code> literal</em>,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!</p>

<p>The remnants of this high precision trend are still in JavaScript, where
all arithmetic is double precision (even if <a href="http://thibaultlaurens.github.io/javascript/2013/04/29/how-the-v8-engine-works/#more-example-on-how-v8-optimized-javascript-code">simulated using
integers</a>), and great pains have been made <a href="https://blog.mozilla.org/javascript/2013/11/07/efficient-float32-arithmetic-in-javascript/">to work around</a>
the performance consequences of this. <a href="http://tirania.org/blog/archive/2018/Apr-11.html">Until recently</a>, Mono had
similar issues.</p>

<p>The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 0</code>). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323">backwards compatible</a> GCC on the old x86.</p>

<p>I’m a little ashamed that I’m only finding out about this now. However,
by the time I was competent enough to notice and understand this issue,
I was already doing nearly all my programming on the x86-64.</p>

<h3 id="built-in-function-elimination">Built-in Function Elimination</h3>

<p>I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, <code class="language-plaintext highlighter-rouge">new_image()</code>, that allocates a greyscale image
for, say, <a href="/blog/2017/11/03/">some multimedia library</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">shade</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a static function because this would be part of some <a href="https://github.com/nothings/stb">slick
header library</a> (and, secretly, because it’s necessary for
illustrating the issue). Being a responsible citizen, the function
even <a href="/blog/2017/07/19/">checks for integer overflow</a> before allocating anything.</p>

<p>I write a unit test to make sure it detects overflow. This function
should return 0.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_overflow</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far my test passes. Good.</p>

<p>I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make <code class="language-plaintext highlighter-rouge">malloc()</code> fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a <code class="language-plaintext highlighter-rouge">malloc(SIZE_MAX)</code>, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_oom</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I compile with GCC, test passes. I compile with Clang and the test
fails. That is, <strong>the test somehow managed to allocate 16 exibytes of
memory, <em>and</em> initialize it</strong>. Wat?</p>

<p>Disassembling the test reveals what’s going on:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">test_new_image_overflow:</span>
    <span class="nf">xor</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
    <span class="nf">ret</span>

<span class="nl">test_new_image_oom:</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with <code class="language-plaintext highlighter-rouge">malloc()</code> became dead code and
was trivially eliminated.</p>

<p>In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the <code class="language-plaintext highlighter-rouge">memset()</code>, so it eliminated the
allocation altogether and then <em>simulated</em> a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.</p>

<p>I soon realized I can take this further and trick Clang into
performing an invalid optimization, <a href="https://bugs.llvm.org/show_bug.cgi?id=37304">revealing a bug</a>. Consider
this slightly-optimized version that uses <code class="language-plaintext highlighter-rouge">calloc()</code> when the shade is
zero (black). The <code class="language-plaintext highlighter-rouge">calloc()</code> function does its own overflow check, so
<code class="language-plaintext highlighter-rouge">new_image()</code> doesn’t need to do it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">shade</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// shortcut</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">color</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With this change, my overflow unit test is now also failing. The
situation is even worse than before. The <code class="language-plaintext highlighter-rouge">calloc()</code> is being
eliminated <em>despite the overflow</em>, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, <strong>this could introduce a vulnerability in a
real program</strong>. The OpenBSD folks are so worried about this sort of
thing that <a href="https://marc.info/?l=openbsd-cvs&amp;m=150125592126437&amp;w=2">they’ve disabled this optimization</a>.</p>

<p>Here’s a slightly-contrived example of this. Imagine a program that
maintains a table of unsigned integers, and we want to keep track of
how many times the program has accessed each table entry. The “access
counter” table is initialized to zero, but the table of values need
not be initialized, since they’ll be written before first access (or
something like that).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">table</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">counter</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">table_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">table</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Overflow already tested above */</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">free</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">);</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// success</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function relies on the overflow test in <code class="language-plaintext highlighter-rouge">calloc()</code> for the second
<code class="language-plaintext highlighter-rouge">malloc()</code> allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the <code class="language-plaintext highlighter-rouge">counter</code> table, and Clang is able to
statically determine this fact, it may eliminate the <code class="language-plaintext highlighter-rouge">calloc()</code>. This
would also <strong>eliminate the overflow test, introducing a
vulnerability</strong>. If an attacker can control <code class="language-plaintext highlighter-rouge">n</code>, then they can
overwrite arbitrary memory through that <code class="language-plaintext highlighter-rouge">values</code> pointer.</p>

<h3 id="the-takeaway">The takeaway</h3>

<p>Besides this surprising little bug, the main lesson for me is that I
should probably isolate unit tests from the code being tested. The
easiest solution is to put them in separate translation units and don’t
use link-time optimization (LTO). Allowing tested functions to be
inlined into the unit tests is probably a bad idea.</p>

<p>The unit test issues in my <em>real</em> program, which was <a href="https://github.com/skeeto/growable-buf">a bit more
sophisticated</a> than what was presented here, gave me artificial
intelligence vibes. It’s that situation where a computer algorithm did
something really clever and I felt it outsmarted me. It’s creepy to
consider <a href="https://wiki.lesswrong.com/wiki/Paperclip_maximizer">how far that can go</a>. I’ve gotten that even from
observing <a href="/blog/2017/04/27/">AI I’ve written myself</a>, and I know for sure no human
taught it some particularly clever trick.</p>

<p>My favorite AI story along these lines is about <a href="https://www.youtube.com/watch?v=xOCurBYI_gY">an AI that learned
how to play games on the Nintendo Entertainment System</a>. It
didn’t understand the games it was playing. It’s optimization task was
simply to choose controller inputs that maximized memory values,
because that’s generally associated with doing well — higher scores,
more progress, etc. The most unexpected part came when playing Tetris.
Eventually the screen would fill up with blocks, and the AI would face
the inevitable situation of losing the game, with all that memory
being reinitialized to low values. So what did it do?</p>

<p>Just before the end it would pause the game and wait… forever.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Branchless UTF-8 Decoder</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/10/06/"/>
    <id>urn:uuid:d62a6a1f-0e34-325e-9196-d66a354bc9b1</id>
    <updated>2017-10-06T23:29:02Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>This week I took a crack at writing a branchless <a href="https://tools.ietf.org/html/rfc3629">UTF-8</a> decoder:
a function that decodes a single UTF-8 code point from a byte stream
without any <code class="language-plaintext highlighter-rouge">if</code> statements, loops, short-circuit operators, or other
sorts of conditional jumps. You can find the source code here along
with a test suite and benchmark:</p>

<ul>
  <li><a href="https://github.com/skeeto/branchless-utf8">https://github.com/skeeto/branchless-utf8</a></li>
</ul>

<p>In addition to decoding the next code point, it detects any errors and
returns a pointer to the next code point. It’s the complete package.</p>

<p>Why branchless? Because high performance CPUs are pipelined. That is,
a single instruction is executed over a series of stages, and many
instructions are executed in overlapping time intervals, each at a
different stage.</p>

<p>The usual analogy is laundry. You can have more than one load of
laundry in process at a time because laundry is typically a pipelined
process. There’s a washing machine stage, dryer stage, and folding
stage. One load can be in the washer, a second in the drier, and a
third being folded, all at once. This greatly increases throughput
because, under ideal circumstances with a full pipeline, an
instruction is completed each clock cycle despite any individual
instruction taking many clock cycles to complete.</p>

<p>Branches are the enemy of pipelines. The CPU can’t begin work on the
next instruction if it doesn’t know which instruction will be executed
next. It must finish computing the branch condition before it can
know. To deal with this, pipelined CPUs are also equipped with <em>branch
predictors</em>. It makes a guess at which branch will be taken and begins
executing instructions on that branch. The prediction is initially
made using static heuristics, and later those predictions are improved
<a href="http://www.agner.org/optimize/microarchitecture.pdf">by learning from previous behavior</a>. This even includes
predicting the number of iterations of a loop so that the final
iteration isn’t mispredicted.</p>

<p>A mispredicted branch has two dire consequences. First, all the
progress on the incorrect branch will need to be discarded. Second,
the pipeline will be flushed, and the CPU will be inefficient until
the pipeline fills back up with instructions on the correct branch.
With a sufficiently deep pipeline, it can easily be <strong>more efficient
to compute and discard an unneeded result than to avoid computing it
in the first place</strong>. Eliminating branches means eliminating the
hazards of misprediction.</p>

<p>Another hazard for pipelines is <em>dependencies</em>. If an instruction
depends on the result of a previous instruction, it may have to wait for
the previous instruction to make sufficient progress before it can
complete one of its stages. This is known as a <em>pipeline stall</em>, and it
is an important consideration in instruction set architecture (ISA)
design.</p>

<p>For example, on the x86-64 architecture, storing a 32-bit result in a
64-bit register will automatically clear the upper 32 bits of that
register. Any further use of that destination register cannot depend on
prior instructions since all bits have been set. This particular
optimization was missed in the design of the i386: Writing a 16-bit
result to 32-bit register leaves the upper 16 bits intact, creating
false dependencies.</p>

<p>Dependency hazards are mitigated using <em>out-of-order execution</em>.
Rather than execute two dependent instructions back to back, which
would result in a stall, the CPU may instead executing an independent
instruction further away in between. A good compiler will also try to
spread out dependent instructions in its own instruction scheduling.</p>

<p>The effects of out-of-order execution are typically not visible to a
single thread, where everything will appear to have executed in order.
However, when multiple processes or threads can access the same memory
<a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">out-of-order execution can be observed</a>. It’s one of the
many <a href="/blog/2014/09/02/">challenges of writing multi-threaded software</a>.</p>

<p>The focus of my UTF-8 decoder was to be branchless, but there was one
interesting dependency hazard that neither GCC nor Clang were able to
resolve themselves. More on that later.</p>

<h3 id="what-is-utf-8">What is UTF-8?</h3>

<p>Without getting into the history of it, you can generally think of
<a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8</a> as a method for encoding a series of 21-bit integers
(<em>code points</em>) into a stream of bytes.</p>

<ul>
  <li>
    <p>Shorter integers encode to fewer bytes than larger integers. The
shortest available encoding must be chosen, meaning there is one
canonical encoding for a given sequence of code points.</p>
  </li>
  <li>
    <p>Certain code points are off limits: <em>surrogate halves</em>. These are
code points <code class="language-plaintext highlighter-rouge">U+D800</code> through <code class="language-plaintext highlighter-rouge">U+DFFF</code>. Surrogates are used in UTF-16
to represent code points above U+FFFF and serve no purpose in UTF-8.
This has <a href="https://simonsapin.github.io/wtf-8/">interesting consequences</a> for pseudo-Unicode
strings, such “wide” strings in the Win32 API, where surrogates may
appear unpaired. Such sequences cannot legally be represented in
UTF-8.</p>
  </li>
</ul>

<p>Keeping in mind these two rules, the entire format is summarized by
this table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>length byte[0]  byte[1]  byte[2]  byte[3]
1      0xxxxxxx
2      110xxxxx 10xxxxxx
3      1110xxxx 10xxxxxx 10xxxxxx
4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">x</code> placeholders are the bits of the encoded code point.</p>

<p>UTF-8 has some really useful properties:</p>

<ul>
  <li>
    <p>It’s backwards compatible with ASCII, which never used the highest
bit.</p>
  </li>
  <li>
    <p>Sort order is preserved. Sorting a set of code point sequences has the
same result as sorting their UTF-8 encoding.</p>
  </li>
  <li>
    <p>No additional zero bytes are introduced. In C we can continue using
null terminated <code class="language-plaintext highlighter-rouge">char</code> buffers, often without even realizing they
hold UTF-8 data.</p>
  </li>
  <li>
    <p>It’s self-synchronizing. A leading byte will never be mistaken for a
continuation byte. This allows for byte-wise substring searches,
meaning UTF-8 unaware functions like <code class="language-plaintext highlighter-rouge">strstr(3)</code> continue to work
without modification (except for normalization issues). It also
allows for unambiguous recovery of a damaged stream.</p>
  </li>
</ul>

<p>A straightforward approach to decoding might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">utf8_simple</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mh">0x80</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x1f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">3</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf8</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xf0</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mh">0xf4</span><span class="p">))</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">4</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// invalid</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// skip this byte</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;=</span> <span class="mh">0xd800</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">c</span> <span class="o">&lt;=</span> <span class="mh">0xdfff</span><span class="p">)</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// surrogate half</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It branches off on the highest bits of the leading byte, extracts all of
those <code class="language-plaintext highlighter-rouge">x</code> bits from each byte, concatenates those bits, checks if it’s a
surrogate half, and returns a pointer to the next character. (This
implementation does <em>not</em> check that the highest two bits of each
continuation byte are correct.)</p>

<p>The CPU must correctly predict the length of the code point or else it
will suffer a hazard. An incorrect guess will stall the pipeline and
slow down decoding.</p>

<p>In real world text this is probably not a serious issue. For the
English language, the encoded length is nearly always a single byte.
However, even for non-English languages, text is <a href="http://utf8everywhere.org/">usually accompanied
by markup from the ASCII range of characters</a>, and, overall,
the encoded lengths will still have consistency. As I said, the CPU
predicts branches based on the program’s previous behavior, so this
means it will temporarily learn some of the statistical properties of
the language being actively decoded. Pretty cool, eh?</p>

<p>Eliminating branches from the decoder side-steps any issues with
mispredicting encoded lengths. Only errors in the stream will cause
stalls. Since that’s probably the unusual case, the branch predictor
will be very successful by continually predicting success. That’s one
optimistic CPU.</p>

<h3 id="the-branchless-decoder">The branchless decoder</h3>

<p>Here’s the interface to my branchless decoder:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">utf8_decode</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">e</span><span class="p">);</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">void *</code> for the buffer so that it doesn’t care what type was
actually chosen to represent the buffer. It could be a <code class="language-plaintext highlighter-rouge">uint8_t</code>,
<code class="language-plaintext highlighter-rouge">char</code>, <code class="language-plaintext highlighter-rouge">unsigned char</code>, etc. Doesn’t matter. The encoder accesses it
only as bytes.</p>

<p>On the other hand, with this interface you’re forced to use <code class="language-plaintext highlighter-rouge">uint32_t</code>
to represent code points. You could always change the function to suit
your own needs, though.</p>

<p>Errors are returned in <code class="language-plaintext highlighter-rouge">e</code>. It’s zero for success and non-zero when an
error was detected, without any particular meaning for different values.
Error conditions are mixed into this integer, so a zero simply means the
absence of error.</p>

<p>This is where you could accuse me of “cheating” a little bit. The
caller probably wants to check for errors, and so <em>they</em> will have to
branch on <code class="language-plaintext highlighter-rouge">e</code>. It seems I’ve just smuggled the branches outside of the
decoder.</p>

<p>However, as I pointed out, unless you’re expecting lots of errors, the
real cost is branching on encoded lengths. Furthermore, the caller
could instead accumulate the errors: count them, or make the error
“sticky” by ORing all <code class="language-plaintext highlighter-rouge">e</code> values together. Neither of these require a
branch. The caller could decode a huge stream and only check for
errors at the very end. The only branch would be the main loop (“are
we done yet?”), which is trivial to predict with high accuracy.</p>

<p>The first thing the function does is extract the encoded length of the
next code point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">lengths</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
        <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span>
    <span class="p">};</span>

    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">lengths</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">3</span><span class="p">];</span>
</code></pre></div></div>

<p>Looking back to the UTF-8 table above, only the highest 5 bits determine
the length. That’s 32 possible values. The zeros are for invalid
prefixes. This will later cause a bit to be set in <code class="language-plaintext highlighter-rouge">e</code>.</p>

<p>With the length in hand, it can compute the position of the next code
point in the buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="n">len</span> <span class="o">+</span> <span class="o">!</span><span class="n">len</span><span class="p">;</span>
</code></pre></div></div>

<p>Originally this expression was the return value, computed at the very
end of the function. However, after inspecting the compiler’s assembly
output, I decided to move it up, and the result was a solid performance
boost. That’s because it spreads out dependent instructions. With the
address of the next code point known so early, <a href="https://www.youtube.com/watch?v=2EWejmkKlxs">the instructions that
decode the next code point can get started early</a>.</p>

<p>The reason for the <code class="language-plaintext highlighter-rouge">!len</code> is so that the pointer is advanced one byte
even in the face of an error (length of zero). Adding that <code class="language-plaintext highlighter-rouge">!len</code> is
actually somewhat costly, though I couldn’t figure out why.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shiftc</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">c</span>  <span class="o">=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">masks</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;=</span> <span class="n">shiftc</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>This reads four bytes regardless of the actual length. Avoiding doing
something is branching, so this can’t be helped. The unneeded bits are
shifted out based on the length. That’s all it takes to decode UTF-8
without branching.</p>

<p>One important consequence of always reading four bytes is that <strong>the
caller <em>must</em> zero-pad the buffer to at least four bytes</strong>. In practice,
this means padding the entire buffer with three bytes in case the last
character is a single byte.</p>

<p>The padding must be zero in order to detect errors. Otherwise the
padding might look like legal continuation bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">mins</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">4194304</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">2048</span><span class="p">,</span> <span class="mi">65536</span><span class="p">};</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shifte</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">e</span>  <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&lt;</span> <span class="n">mins</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">((</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x1b</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">7</span><span class="p">;</span>  <span class="c1">// surrogate half?</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>       <span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">^=</span> <span class="mh">0x2a</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">&gt;&gt;=</span> <span class="n">shifte</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>The first line checks if the shortest encoding was used, setting a bit
in <code class="language-plaintext highlighter-rouge">e</code> if it wasn’t. For a length of 0, this always fails.</p>

<p>The second line checks for a surrogate half by checking for a certain
prefix.</p>

<p>The next three lines accumulate the highest two bits of each
continuation byte into <code class="language-plaintext highlighter-rouge">e</code>. Each should be the bits <code class="language-plaintext highlighter-rouge">10</code>. These bits are
“compared” to <code class="language-plaintext highlighter-rouge">101010</code> (<code class="language-plaintext highlighter-rouge">0x2a</code>) using XOR. The XOR clears these bits as
long as they exactly match.</p>

<p><img src="/img/diagram/utf8-bits.svg" alt="" /></p>

<p>Finally the continuation prefix bits that don’t matter are shifted out.</p>

<h3 id="the-goal">The goal</h3>

<p>My primary — and totally arbitrary — goal was to beat the performance of
<a href="http://bjoern.hoehrmann.de/utf-8/decoder/dfa/">Björn Höhrmann’s DFA-based decoder</a>. Under favorable (and
artificial) benchmark conditions I had moderate success. You can try it
out on your own system by cloning the repository and running <code class="language-plaintext highlighter-rouge">make
bench</code>.</p>

<p>With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the
DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.</p>

<p><em>Update</em>: <a href="https://github.com/skeeto/branchless-utf8/issues/1">Björn pointed out</a> that his site includes a faster
variant of his DFA decoder. It is only 10% slower than the branchless
decoder with GCC, and it’s 20% faster than the branchless decoder with
Clang. So, in a sense, it’s still faster on average, even on a
benchmark that favors a branchless decoder.</p>

<p>The benchmark operates very similarly to <a href="/blog/2017/09/21/">my PRNG shootout</a> (e.g.
<code class="language-plaintext highlighter-rouge">alarm(2)</code>). First a buffer is filled with random UTF-8 data, then the
decoder decodes it again and again until the alarm fires. The
measurement is the number of bytes decoded.</p>

<p>The number of errors is printed at the end (always 0) in order to force
errors to actually get checked for each code point. Otherwise the sneaky
compiler omits the error checking from the branchless decoder, making it
appear much faster than it really is — a serious letdown once I noticed
my error. Since the other decoder is a DFA and error checking is built
into its graph, the compiler can’t really omit its error checking.</p>

<p>I called this “favorable” because the buffer being decoded isn’t
anything natural. Each time a code point is generated, first a length is
chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that
length is generated. The <strong>even distribution of lengths greatly favors a
branchless decoder</strong>. The random distribution inhibits branch
prediction. Real text has a far more favorable distribution.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">randchar</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">rand32</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">r</span> <span class="o">&amp;</span> <span class="mh">0x3</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">&gt;&gt;=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">r</span> <span class="o">%</span> <span class="mi">128</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">128</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">2048</span> <span class="o">-</span> <span class="mi">128</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">2048</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">65536</span> <span class="o">-</span> <span class="mi">2048</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">4</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">65536</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">131072</span> <span class="o">-</span> <span class="mi">65536</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">abort</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Given the odd input zero-padding requirement and the artificial
parameters of the benchmark, despite the supposed 20% speed boost
under GCC, my branchless decoder is not really any better than the DFA
decoder in practice. It’s just a different approach. In practice I’d
prefer Björn’s DFA decoder.</p>

<p><em>Update</em>: Bryan Donlan has followed up with <a href="https://github.com/bdonlan/branchless-utf8/commit/3802d3b0e10ea16810dd40f8116243971ff7603d">a SIMD UTF-8 decoder</a>.</p>

<p><em>Update 2024</em>: NRK has followed up with <a href="https://nrk.neocities.org/articles/utf8-pext.html">parallel extract decoder</a>.</p>

<p><em>Update 2025</em>: Charles Eckman followed up <a href="https://cceckman.com/writing/branchless-utf8-encoding/">sharing a branchless
encoder</a>, which inspired me to <a href="https://github.com/skeeto/scratch/blob/master/misc/utf8_branchless.c">give it a shot</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Finding the Best 64-bit Simulation PRNG</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/21/"/>
    <id>urn:uuid:637af55f-6e33-31e5-25fa-edb590a16d44</id>
    <updated>2017-09-21T21:25:00Z</updated>
    <category term="c"/><category term="compsci"/><category term="x86"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>August 2018 Update</strong>: <em>xoroshiro128+ fails <a href="http://pracrand.sourceforge.net/">PractRand</a> very
badly. Since this article was published, its authors have supplanted it
with <strong>xoshiro256**</strong>. It has essentially the same performance, but
better statistical properties. xoshiro256** is now my preferred PRNG.</em></p>

<p>I use pseudo-random number generators (PRNGs) a whole lot. They’re an
essential component in lots of algorithms and processes.</p>

<ul>
  <li>
    <p><strong>Monte Carlo simulations</strong>, where PRNGs are used to <a href="https://possiblywrong.wordpress.com/2015/09/15/kanoodle-iq-fit-and-dancing-links/">compute
numeric estimates</a> for problems that are difficult or impossible
to solve analytically.</p>
  </li>
  <li>
    <p><a href="/blog/2017/04/27/"><strong>Monte Carlo tree search AI</strong></a>, where massive numbers of games
are played out randomly in search of an optimal move. This is a
specific application of the last item.</p>
  </li>
  <li>
    <p><a href="https://github.com/skeeto/carpet-fractal-genetics"><strong>Genetic algorithms</strong></a>, where a PRNG creates the initial
population, and then later guides in mutation and breeding of selected
solutions.</p>
  </li>
  <li>
    <p><a href="https://blog.cr.yp.to/20140205-entropy.html"><strong>Cryptography</strong></a>, where a cryptographically-secure PRNGs
(CSPRNGs) produce output that is predictable for recipients who know
a particular secret, but not for anyone else. This article is only
concerned with plain PRNGs.</p>
  </li>
</ul>

<p>For the first three “simulation” uses, there are two primary factors
that drive the selection of a PRNG. These factors can be at odds with
each other:</p>

<ol>
  <li>
    <p>The PRNG should be <em>very</em> fast. The application should spend its
time running the actual algorithms, not generating random numbers.</p>
  </li>
  <li>
    <p>PRNG output should have robust statistical qualities. Bits should
appear to be independent and the output should closely follow the
desired distribution. Poor quality output will negatively effect
the algorithms using it. Also just as important is <a href="http://mumble.net/~campbell/2014/04/28/uniform-random-float">how you use
it</a>, but this article will focus only on generating bits.</p>
  </li>
</ol>

<p>In other situations, such as in cryptography or online gambling,
another important property is that an observer can’t learn anything
meaningful about the PRNG’s internal state from its output. For the
three simulation cases I care about, this is not a concern. Only speed
and quality properties matter.</p>

<p>Depending on the programming language, the PRNGs found in various
standard libraries may be of dubious quality. They’re slower than they
need to be, or have poorer quality than required. In some cases, such
as <code class="language-plaintext highlighter-rouge">rand()</code> in C, the algorithm isn’t specified, and you can’t rely on
it for anything outside of trivial examples. In other cases the
algorithm and behavior <em>is</em> specified, but you could easily do better
yourself.</p>

<p>My preference is to BYOPRNG: <em>Bring Your Own Pseudo-random Number
Generator</em>. You get reliable, identical output everywhere. Also, in
the case of C and C++ — and if you do it right — by embedding the PRNG
in your project, it will get inlined and unrolled, making it far more
efficient than a <a href="/blog/2016/10/27/">slow call into a dynamic library</a>.</p>

<p>A fast PRNG is going to be small, making it a great candidate for
embedding as, say, a header library. That leaves just one important
question, “Can the PRNG be small <em>and</em> have high quality output?” In
the 21st century, the answer to this question is an emphatic “yes!”</p>

<p>For the past few years my main go to for a drop-in PRNG has been
<a href="https://en.wikipedia.org/wiki/Xorshift">xorshift*</a>. The body of the function is 6 lines of C, and its
entire state is a 64-bit integer, directly seeded. However, there are a
number of choices here, including other variants of Xorshift. How do I
know which one is best? The only way to know is to test it, hence my
64-bit PRNG shootout:</p>

<ul>
  <li><a href="https://github.com/skeeto/prng64-shootout"><strong>64-bit PRNG Shootout</strong></a></li>
</ul>

<p>Sure, there <a href="http://xoroshiro.di.unimi.it/">are other such shootouts</a>, but they’re all missing
something I want to measure. I also want to test in an environment very
close to how I’d use these PRNGs myself.</p>

<h3 id="shootout-results">Shootout results</h3>

<p>Before getting into the details of the benchmark and each generator,
here are the results. These tests were run on an i7-6700 (Skylake)
running Linux 4.9.0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060
</code></pre></div></div>

<p><strong>The clear winner is <a href="http://xoroshiro.di.unimi.it/">xoroshiro128+</a></strong>, with a function body of
just 7 lines of C. It’s clearly the fastest, and the output had no
observed statistical failures. However, that’s not the whole story. A
couple of the other PRNGS have advantages that situationally makes
them better suited than xoroshiro128+. I’ll go over these in the
discussion below.</p>

<p>These two versions of GCC and Clang were chosen because these are the
latest available in Debian 9 “Stretch.” It’s easy to build and run the
benchmark yourself if you want to try a different version.</p>

<h3 id="speed-benchmark">Speed benchmark</h3>

<p>In the speed benchmark, the PRNG is initialized, a 1-second <code class="language-plaintext highlighter-rouge">alarm(1)</code>
is set, then the PRNG fills a large <code class="language-plaintext highlighter-rouge">volatile</code> buffer of 64-bit unsigned
integers again and again as quickly as possible until the alarm fires.
The amount of memory written is measured as the PRNG’s speed.</p>

<p>The baseline “PRNG” writes zeros into the buffer. This represents the
absolute speed limit that no PRNG can exceed.</p>

<p>The purpose for making the buffer <code class="language-plaintext highlighter-rouge">volatile</code> is to force the entire
output to actually be “consumed” as far as the compiler is concerned.
Otherwise the compiler plays nasty tricks to make the program do as
little work as possible. Another way to deal with this would be to
<code class="language-plaintext highlighter-rouge">write(2)</code> buffer, but of course I didn’t want to introduce
unnecessary I/O into a benchmark.</p>

<p>On Linux, SIGALRM was impressively consistent between runs, meaning it
was perfectly suitable for this benchmark. To account for any process
scheduling wonkiness, the bench mark was run 8 times and only the
fastest time was kept.</p>

<p>The SIGALRM handler sets a <code class="language-plaintext highlighter-rouge">volatile</code> global variable that tells the
generator to stop. The PRNG call was unrolled 8 times to avoid the
alarm check from significantly impacting the benchmark. You can see
the effect for yourself by changing <code class="language-plaintext highlighter-rouge">UNROLL</code> to 1 (i.e. “don’t
unroll”) in the code. Unrolling beyond 8 times had no measurable
effect to my tests.</p>

<p>Due to the PRNGs being inlined, this unrolling makes the benchmark
less realistic, and it shows in the results. Using <code class="language-plaintext highlighter-rouge">volatile</code> for the
buffer helped to counter this effect and reground the results. This is
a fuzzy problem, and there’s not really any way to avoid it, but I
will also discuss this below.</p>

<h3 id="statistical-benchmark">Statistical benchmark</h3>

<p>To measure the statistical quality of each PRNG — mostly as a sanity
check — the raw binary output was run through <a href="http://webhome.phy.duke.edu/~rgb/General/dieharder.php">dieharder</a> 3.31.1:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prng | dieharder -g200 -a -m4
</code></pre></div></div>

<p>This statistical analysis has no timing characteristics and the
results should be the same everywhere. You would only need to re-run
it to test with a different version of dieharder, or a different
analysis tool.</p>

<p>There’s not much information to glean from this part of the shootout.
It mostly confirms that all of these PRNGs would work fine for
simulation purposes. The WEAK results are not very significant and is
only useful for breaking ties. Even a true RNG will get some WEAK
results. For example, the <a href="https://en.wikipedia.org/wiki/RdRand">x86 RDRAND</a> instruction (not
included in actual shootout) got 7 WEAK results in my tests.</p>

<p>The FAIL results are more significant, but a single failure doesn’t
mean much. A non-failing PRNG should be preferred to an otherwise
equal PRNG with a failure.</p>

<h3 id="individual-prngs">Individual PRNGs</h3>

<p>Admittedly the definition for “64-bit PRNG” is rather vague. My high
performance targets are all 64-bit platforms, so the highest PRNG
throughput will be built on 64-bit operations (<a href="/blog/2015/07/10/">if not wider</a>).
The original plan was to focus on PRNGs built from 64-bit operations.</p>

<p>Curiosity got the best of me, so I included some PRNGs that don’t use
<em>any</em> 64-bit operations. I just wanted to see how they stacked up.</p>

<h4 id="blowfish">Blowfish</h4>

<p>One of the reasons I <a href="/blog/2017/09/15/">wrote a Blowfish implementation</a> was to
evaluate its performance and statistical qualities, so naturally I
included it in the benchmark. It only uses 32-bit addition and 32-bit
XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit
integer. There are two different properties that combine to make four
variants in the benchmark: number of rounds and block mode.</p>

<p>Blowfish normally uses 16 rounds. This makes it a lot slower than a
non-cryptographic PRNG but gives it a <em>security margin</em>. I don’t care
about the security margin, so I included a 4-round variant. At
expected, it’s about four times faster.</p>

<p>The other feature I tested is the block mode: <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CBC">Cipher Block
Chaining</a> (CBC) versus <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29">Counter</a> (CTR) mode. In CBC mode it
encrypts zeros as plaintext. This just means it’s encrypting its last
output. The ciphertext is the PRNG’s output.</p>

<p>In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster
than CBC in the 16-round variant and 23% faster in the 4-round variant.
The reason is simple, and it’s in part an artifact of unrolling the
generation loop in the benchmark.</p>

<p>In CBC mode, each output depends on the previous, but in CTR mode all
blocks are independent. Work can begin on the next output before the
previous output is complete. The x86 architecture uses out-of-order
execution to achieve many of its performance gains: Instructions may
be executed in a different order than they appear in the program,
though their observable effects must <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">generally be ordered
correctly</a>. Breaking dependencies between instructions allows
out-of-order execution to be fully exercised. It also gives the
compiler more freedom in instruction scheduling, though the <code class="language-plaintext highlighter-rouge">volatile</code>
accesses cannot be reordered with respect to each other (hence it
helping to reground the benchmark).</p>

<p>Statistically, the 4-round cipher was not significantly worse than the
16-round cipher. For simulation purposes the 4-round cipher would be
perfectly sufficient, though xoroshiro128+ is still more than 9 times
faster without sacrificing quality.</p>

<p>On the other hand, CTR mode had a single failure in both the 4-round
(dab_filltree2) and 16-round (dab_filltree) variants. At least for
Blowfish, is there something that makes CTR mode less suitable than CBC
mode as a PRNG?</p>

<p>In the end Blowfish is too slow and too complicated to serve as a
simulation PRNG. This was entirely expected, but it’s interesting to see
how it stacks up.</p>

<h4 id="mersenne-twister-mt19937-64">Mersenne Twister (MT19937-64)</h4>

<p>Nobody ever got fired for choosing <a href="https://en.wikipedia.org/wiki/Mersenne_Twister">Mersenne Twister</a>. It’s the
classical choice for simulations, and is still usually recommended to
this day. However, Mersenne Twister’s best days are behind it. I
tested the 64-bit variant, MT19937-64, and there are four problems:</p>

<ul>
  <li>
    <p>It’s between 1/4 and 1/5 the speed of xoroshiro128+.</p>
  </li>
  <li>
    <p>It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.</p>
  </li>
  <li>
    <p>Its implementation is three times bigger than xoroshiro128+, and much
more complicated.</p>
  </li>
  <li>
    <p>It had one statistical failure (dab_filltree2).</p>
  </li>
</ul>

<p>Curiously my implementation is 16% faster with Clang than GCC. Since
Mersenne Twister isn’t seriously in the running, I didn’t take time to
dig into why.</p>

<p>Ultimately I would never choose Mersenne Twister for anything anymore.
This was also not surprising.</p>

<h4 id="permuted-congruential-generator-pcg">Permuted Congruential Generator (PCG)</h4>

<p>The <a href="http://www.pcg-random.org/">Permuted Congruential Generator</a> (PCG) has some really
interesting history behind it, particularly with its somewhat <a href="http://www.pcg-random.org/paper.html">unusual
paper</a>, controversial for both its excessive length (58 pages)
and informal style. It’s in close competition with Xorshift and
xoroshiro128+. I was really interested in seeing how it stacked up.</p>

<p>PCG is really just a Linear Congruential Generator (LCG) that doesn’t
output the lowest bits (too poor quality), and has an extra
permutation step to make up for the LCG’s other weaknesses. I included
two variants in my benchmark: the official PCG and a “simplified” PCG
(sPCG) with a simple permutation step. sPCG is just the first PCG
presented in the paper (34 pages in!).</p>

<p>Here’s essentially what the simplified version looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">spcg32</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The third line with the modular multiplication and addition is the
LCG. The bit shift is the permutation. This PCG uses the most
significant three bits of the result to determine which 32 bits to
output. That’s <em>the</em> novel component of PCG.</p>

<p>The two constants are entirely my own devising. It’s two 64-bit primes
generated using Emacs’ <code class="language-plaintext highlighter-rouge">M-x calc</code>: <code class="language-plaintext highlighter-rouge">2 64 ^ k r k n k p k p k p</code>.</p>

<p>Heck, that’s so simple that I could easily memorize this and code it
from scratch on demand. Key takeaway: This is <strong>one way that PCG is
situationally better than xoroshiro128+</strong>. In a pinch I could use Emacs
to generate a couple of primes and code the rest from memory. If you
participate in coding competitions, take note.</p>

<p>However, you probably also noticed PCG only generates 32-bit integers
despite using 64-bit operations. To properly generate a 64-bit value
we’d need 128-bit operations, which would need to be implemented in
software.</p>

<p>Instead, I doubled up on everything to run two PRNGs in parallel.
Despite the doubling in state size, the period doesn’t get any larger
since the PRNGs don’t interact with each other. We get something in
return, though. Remember what I said about out-of-order execution?
Except for the last step combining their results, since the two PRNGs
are independent, doubling up shouldn’t <em>quite</em> halve the performance,
particularly with the benchmark loop unrolling business.</p>

<p>Here’s my doubled-up version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">spcg64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span>  <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a0</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a1</span> <span class="o">=</span> <span class="mh">0x8b260b70b8e98891</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">p0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p1</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r0</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">uint64_t</span> <span class="n">high</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="n">r0</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">low</span>  <span class="o">=</span> <span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="n">r1</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">high</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span><span class="p">)</span> <span class="o">|</span> <span class="n">low</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The “full” PCG has some extra shifts that makes it 25% (GCC) to 50%
(Clang) slower than the “simplified” PCG, but it does halve the WEAK
results.</p>

<p>In this 64-bit form, both are significantly slower than xoroshiro128+.
However, if you find yourself only needing 32 bits at a time (always
throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is
faster than using xoroshiro128+ and throwing away half its output.</p>

<h4 id="rc4">RC4</h4>

<p>This is another CSPRNG where I was curious how it would stack up. It
only uses 8-bit operations, and it generates a 64-bit integer one byte
at a time. It’s the slowest after 16-round Blowfish and generally not
useful as a simulation PRNG.</p>

<h4 id="xoroshiro128">xoroshiro128+</h4>

<p>xoroshiro128+ is the obvious winner in this benchmark and it seems to be
the best 64-bit simulation PRNG available. If you need a fast, quality
PRNG, just drop these 11 lines into your C or C++ program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xoroshiro128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">s0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">s1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">result</span> <span class="o">=</span> <span class="n">s0</span> <span class="o">+</span> <span class="n">s1</span><span class="p">;</span>
    <span class="n">s1</span> <span class="o">^=</span> <span class="n">s0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">s0</span> <span class="o">&lt;&lt;</span> <span class="mi">55</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s0</span> <span class="o">&gt;&gt;</span> <span class="mi">9</span><span class="p">))</span> <span class="o">^</span> <span class="n">s1</span> <span class="o">^</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">14</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">36</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&gt;&gt;</span> <span class="mi">28</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s one important caveat: <strong>That 16-byte state must be
well-seeded.</strong> Having lots of zero bytes will lead <em>terrible</em> initial
output until the generator mixes it all up. Having all zero bytes will
completely break the generator. If you’re going to seed from, say, the
unix epoch, then XOR it with 16 static random bytes.</p>

<h4 id="xorshift128-and-xorshift64">xorshift128+ and xorshift64*</h4>

<p>These generators are closely related and, like I said, xorshift64* was
what I used for years. Looks like it’s time to retire it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift64star</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">25</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x2545f4914f6cdd1d</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will
tolerate weak seeding so long as it’s not literally zero. Zero will also
break this generator.</p>

<p>If it weren’t for xoroshiro128+, then xorshift128+ would have been the
winner of the benchmark and my new favorite choice.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">y</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">23</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">^</span> <span class="n">y</span> <span class="o">^</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">26</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a lot like xoroshiro128+, including the need to be well-seeded,
but it’s just slow enough to lose out. There’s no reason to use
xorshift128+ instead of xoroshiro128+.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My own takeaway (until I re-evaluate some years in the future):</p>

<ul>
  <li>The best 64-bit simulation PRNG is xoroshiro128+.</li>
  <li>“Simplified” PCG can be useful in a pinch.</li>
  <li>When only 32-bit integers are necessary, use PCG.</li>
</ul>

<p>Things can change significantly between platforms, though. Here’s the
shootout on a ARM Cortex-A53:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080
</code></pre></div></div>

<p>LLVM is not as mature on this platform, but, with GCC, both
xoroshiro128+ and xorshift128+ matched the baseline! It seems memory
is the bottleneck.</p>

<p>So don’t necessarily take my word for it. You can run this shootout in
your own environment — perhaps even tossing in more PRNGs — to find
what’s appropriate for your own situation.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>OpenMP and pwrite()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/01/"/>
    <id>urn:uuid:dfdf8ca6-51aa-3a15-6bf0-98b39f20652a</id>
    <updated>2017-03-01T21:22:24Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>The most common way I introduce multi-threading to <a href="/blog/2015/07/10/">small C
programs</a> is with OpenMP (Open Multi-Processing). It’s typically
used as compiler pragmas to parallelize computationally expensive
loops — iterations are processed by different threads in some
arbitrary order.</p>

<p>Here’s an example that computes the <a href="/blog/2011/11/28/">frames of a video</a> in
parallel. Despite being computed out of order, each frame is written
in order to a large buffer, then written to standard output all at
once at the end.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_frames</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">output</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">DEFAULT_BETA</span><span class="p">;</span>

<span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="o">&amp;</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
</code></pre></div></div>

<p>Adding OpenMP to this program is much simpler than introducing
low-level threading semantics with, say, Pthreads. With care, there’s
often no need for explicit thread synchronization. It’s also fairly
well supported by many vendors, even Microsoft (up to OpenMP 2.0), so
a multi-threaded OpenMP program is quite portable without <code class="language-plaintext highlighter-rouge">#ifdef</code>.</p>

<p>There’s real value this pragma API: <strong>The above example would still
compile and run correctly even when OpenMP isn’t available.</strong> The
pragma is ignored and the program just uses a single core like it
normally would. It’s a slick fallback.</p>

<p>When a program really <em>does</em> require synchronization there’s
<code class="language-plaintext highlighter-rouge">omp_lock_t</code> (mutex lock) and the expected set of functions to operate
on them. This doesn’t have the nice fallback, so I don’t like to use
it. Instead, I prefer <code class="language-plaintext highlighter-rouge">#pragma omp critical</code>. It nicely maintains the
OpenMP-unsupported fallback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="cp">#pragma omp critical
</span>    <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would append the output to some output file in an arbitrary
order. The critical section <a href="/blog/2016/08/03/">prevents interleaving of
outputs</a>.</p>

<p>There are a couple of problems with this example:</p>

<ol>
  <li>
    <p>Only one thread can write at a time. If the write takes too long,
other threads will queue up behind the critical section and wait.</p>
  </li>
  <li>
    <p>The output frames will be out of order, which is probably
inconvenient for consumers. If the output is seekable this can be
solved with <code class="language-plaintext highlighter-rouge">lseek()</code>, but that only makes the critical section
even more important.</p>
  </li>
</ol>

<p>There’s an easy fix for both, and eliminates the need for a critical
section: POSIX <code class="language-plaintext highlighter-rouge">pwrite()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">pwrite</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">off_t</span> <span class="n">offset</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s like <code class="language-plaintext highlighter-rouge">write()</code> but has an offset parameter. Unlike <code class="language-plaintext highlighter-rouge">lseek()</code>
followed by a <code class="language-plaintext highlighter-rouge">write()</code>, multiple threads and processes can, in
parallel, safely write to the same file descriptor at different file
offsets. The catch is that <strong>the output must be a file, not a pipe</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">size</span> <span class="o">*</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s no critical section, the writes can interleave, and the output
is in order.</p>

<p>If you’re concerned about standard output not being seekable (it often
isn’t), keep in mind that it will work just fine when invoked like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./compute_frames &gt; frames.ppm
</code></pre></div></div>

<h3 id="windows-portability">Windows Portability</h3>

<p>I talked about OpenMP being really portable, then used POSIX
functions. Fortunately the Win32 <code class="language-plaintext highlighter-rouge">WriteFile()</code> function has an
“overlapped” parameter that works just like <code class="language-plaintext highlighter-rouge">pwrite()</code>. Typically
rather than call either directly, I’d wrap the write like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">out</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">written</span><span class="p">;</span>
    <span class="n">OVERLAPPED</span> <span class="n">offset</span> <span class="o">=</span> <span class="p">{.</span><span class="n">Offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">};</span>
    <span class="k">return</span> <span class="n">WriteFile</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">offset</span><span class="p">);</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* POSIX */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">offset</span><span class="p">)</span> <span class="o">==</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Except for switching to <code class="language-plaintext highlighter-rouge">write_frame()</code>, the OpenMP part remains
untouched.</p>

<h3 id="real-world-example">Real World Example</h3>

<p>Here’s an example in a real program:</p>

<p><a href="https://gist.github.com/skeeto/d7e17bb2aa40907a3405c3933cb1f936" class="download">julia.c</a></p>

<p>Notice because of <code class="language-plaintext highlighter-rouge">pwrite()</code> there’s no piping directly into
<code class="language-plaintext highlighter-rouge">ppmtoy4m</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./julia &gt; output.ppm
$ ppmtoy4m -F 60:1 &lt; output.ppm &gt; output.y4m
$ x264 -o output.mp4 output.y4m
</code></pre></div></div>

<p><a href="/video/?v=julia-256" class="download">output.mp4</a></p>

<video src="https://skeeto.s3.amazonaws.com/share/julia-256.mp4" controls="" loop="" crossorigin="anonymous">
</video>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>How to Write Fast(er) Emacs Lisp</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/30/"/>
    <id>urn:uuid:cee07e3d-08cc-3465-1a29-c1e30b5bd0e2</id>
    <updated>2017-01-30T21:08:19Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Not everything written in Emacs Lisp needs to be fast. Most of Emacs
itself — around 82% — is written in Emacs Lisp <em>because</em> those parts
are generally not performance-critical. Otherwise these functions
would be built-ins written in C. Extensions to Emacs don’t have a
choice and — outside of a few exceptions like <a href="/blog/2016/11/05/">dynamic modules</a>
and inferior processes — must be written in Emacs Lisp, including
their performance-critical bits. Common performance hot spots are
automatic indentation, <a href="https://github.com/mooz/js2-mode">AST parsing</a>, and <a href="/blog/2016/12/11/">interactive
completion</a>.</p>

<p>Here are 5 guidelines, each very specific to Emacs Lisp, that will
result in faster code. The non-intrusive guidelines could be applied
at all times as a matter of style — choosing one equally expressive
and maintainable form over another just because it performs better.</p>

<p>There’s one caveat: These guidelines are focused on Emacs 25.1 and
“nearby” versions. Emacs is constantly evolving. Changes to the
<a href="/blog/2014/01/04/">virtual machine</a> and byte-code compiler may transform
currently-slow expressions into fast code, obsoleting some of these
guidelines. In the future I’ll add notes to this article for anything
that changes.</p>

<h3 id="1-use-lexical-scope">(1) Use lexical scope</h3>

<p>This guideline refers to the following being the first line of every
Emacs Lisp source file you write:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; -*- lexical-binding: t; -*-</span>
</code></pre></div></div>

<p>This point is worth mentioning again and again. Not only will <a href="/blog/2016/12/22/">your
code be more correct</a>, it will be measurably faster. Dynamic
scope is still opt-in through the explicit use of <em>special variables</em>,
so there’s absolutely no reason not to be using lexical scope. If
you’ve written clean, dynamic scope code, then switching to lexical
scope won’t have any effect on its behavior.</p>

<p>Along similar lines, special variables are a lot slower than local,
lexical variables. Only use them when necessary.</p>

<h3 id="2-prefer-built-in-functions">(2) Prefer built-in functions</h3>

<p>Built-in functions are written in C and are, as expected,
significantly faster than the equivalent written in Emacs Lisp.
Complete as much work as possible inside built-in functions, even if
it might mean taking more conceptual steps overall.</p>

<p>For example, what’s the fastest way to accumulate a list of items?
That is, new items go on the tail but, for algorithm reasons, the list
must be constructed from the head.</p>

<p>You might be tempted to keep track of the tail of the list, appending
new elements directly to the tail with <code class="language-plaintext highlighter-rouge">setcdr</code> (via <code class="language-plaintext highlighter-rouge">setf</code> below).</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">fib-track-tail</span> <span class="p">(</span><span class="nv">n</span><span class="p">)</span>
  <span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">a</span> <span class="mi">0</span><span class="p">)</span>
         <span class="p">(</span><span class="nv">b</span> <span class="mi">1</span><span class="p">)</span>
         <span class="p">(</span><span class="nv">head</span> <span class="p">(</span><span class="nb">list</span> <span class="mi">1</span><span class="p">))</span>
         <span class="p">(</span><span class="nv">tail</span> <span class="nv">head</span><span class="p">))</span>
    <span class="p">(</span><span class="nb">dotimes</span> <span class="p">(</span><span class="nv">_</span> <span class="nv">n</span> <span class="nv">head</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">psetf</span> <span class="nv">a</span> <span class="nv">b</span>
             <span class="nv">b</span> <span class="p">(</span><span class="nb">+</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">setf</span> <span class="p">(</span><span class="nb">cdr</span> <span class="nv">tail</span><span class="p">)</span> <span class="p">(</span><span class="nb">list</span> <span class="nv">b</span><span class="p">)</span>
            <span class="nv">tail</span> <span class="p">(</span><span class="nb">cdr</span> <span class="nv">tail</span><span class="p">)))))</span>

<span class="p">(</span><span class="nv">fib-track-tail</span> <span class="mi">8</span><span class="p">)</span>
<span class="c1">;; =&gt; (1 1 2 3 5 8 13 21 34)</span>
</code></pre></div></div>

<p>Actually, it’s much faster to construct the list in reverse, then
destructively reverse it at the end.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">fib-nreverse</span> <span class="p">(</span><span class="nv">n</span><span class="p">)</span>
  <span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">a</span> <span class="mi">0</span><span class="p">)</span>
         <span class="p">(</span><span class="nv">b</span> <span class="mi">1</span><span class="p">)</span>
         <span class="p">(</span><span class="nb">list</span> <span class="p">(</span><span class="nb">list</span> <span class="mi">1</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">dotimes</span> <span class="p">(</span><span class="nv">_</span> <span class="nv">n</span> <span class="p">(</span><span class="nb">nreverse</span> <span class="nb">list</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">psetf</span> <span class="nv">a</span> <span class="nv">b</span>
             <span class="nv">b</span> <span class="p">(</span><span class="nb">+</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">push</span> <span class="nv">b</span> <span class="nb">list</span><span class="p">))))</span>
</code></pre></div></div>

<p>It might not look it, but <code class="language-plaintext highlighter-rouge">nreverse</code> is <em>very</em> fast. Not only is it a
built-in, it’s got its own opcode. Using <code class="language-plaintext highlighter-rouge">push</code> in a loop, then
finishing with <code class="language-plaintext highlighter-rouge">nreverse</code> is the canonical and fastest way to
accumulate a list of items.</p>

<p>In <code class="language-plaintext highlighter-rouge">fib-track-tail</code>, the added complexity of tracking the tail in
Emacs Lisp is much slower than zipping over the entire list a second
time in C.</p>

<h3 id="3-avoid-unnecessary-lambda-functions">(3) Avoid unnecessary lambda functions</h3>

<p>I’m talking about <code class="language-plaintext highlighter-rouge">mapcar</code> and friends.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; Slower</span>
<span class="p">(</span><span class="nb">defun</span> <span class="nv">expt-list</span> <span class="p">(</span><span class="nb">list</span> <span class="nv">e</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">mapcar</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span> <span class="p">(</span><span class="nb">expt</span> <span class="nv">x</span> <span class="nv">e</span><span class="p">))</span> <span class="nb">list</span><span class="p">))</span>
</code></pre></div></div>

<p>Listen, I know you love <a href="https://github.com/magnars/dash.el">dash.el</a> and higher order functions,
but <em>this habit ain’t cheap</em>. The byte-code compiler does not know how
to inline these lambdas, so there’s an additional per-element function
call overhead.</p>

<p>Worse, if you’re using lexical scope like I told you, the above
example forms a <em>closure</em> over <code class="language-plaintext highlighter-rouge">e</code>. This means a new function object
is created (e.g. <code class="language-plaintext highlighter-rouge">make-byte-code</code>) each time <code class="language-plaintext highlighter-rouge">expt-list</code> is called. To
be clear, I don’t mean that the lambda is recompiled each time — the
same byte-code string is shared between all instances of the same
lambda. A unique function vector (<code class="language-plaintext highlighter-rouge">#[...]</code>) and constants vector are
allocated and initialized each time <code class="language-plaintext highlighter-rouge">expt-list</code> is invoked.</p>

<p>Related mini-guideline: Don’t create any more garbage than strictly
necessary in performance-critical code.</p>

<p>Compare to an implementation with an explicit loop, using the
<code class="language-plaintext highlighter-rouge">nreverse</code> list-accumulation technique.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">expt-list-fast</span> <span class="p">(</span><span class="nb">list</span> <span class="nv">e</span><span class="p">)</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">result</span> <span class="p">()))</span>
    <span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nv">x</span> <span class="nb">list</span> <span class="p">(</span><span class="nb">nreverse</span> <span class="nv">result</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">push</span> <span class="p">(</span><span class="nb">expt</span> <span class="nv">x</span> <span class="nv">e</span><span class="p">)</span> <span class="nv">result</span><span class="p">))))</span>
</code></pre></div></div>

<ul>
  <li>No unnecessary garbage is created.</li>
  <li>No unnecessary per-element function calls.</li>
</ul>

<p>This is the fastest possible definition for this function, and it’s
what you need to use in performance-critical code.</p>

<p>Personally I prefer the list comprehension approach, using <code class="language-plaintext highlighter-rouge">cl-loop</code>
from <code class="language-plaintext highlighter-rouge">cl-lib</code>.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">expt-list-fast</span> <span class="p">(</span><span class="nb">list</span> <span class="nv">e</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">x</span> <span class="nv">in</span> <span class="nb">list</span>
           <span class="nv">collect</span> <span class="p">(</span><span class="nb">expt</span> <span class="nv">x</span> <span class="nv">e</span><span class="p">)))</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">cl-loop</code> macro will expand into essentially the previous
definition, making them practically equivalent. It takes some getting
used to, but writing efficient loops is a whole lot less tedious with
<code class="language-plaintext highlighter-rouge">cl-loop</code>.</p>

<p>In Emacs 24.4 and earlier, <code class="language-plaintext highlighter-rouge">catch</code>/<code class="language-plaintext highlighter-rouge">throw</code> is implemented by
converting the body of the <code class="language-plaintext highlighter-rouge">catch</code> into a lambda function and calling
it. If code inside the <code class="language-plaintext highlighter-rouge">catch</code> accesses a variable outside the <code class="language-plaintext highlighter-rouge">catch</code>
(very likely), then, in lexical scope, it turns into a closure,
resulting in the garbage function object like before.</p>

<p>In Emacs 24.5 and later, the byte-code compiler uses a new opcode,
<code class="language-plaintext highlighter-rouge">pushcatch</code>. It’s a whole lot more efficient, and there’s no longer a
reason to shy away from <code class="language-plaintext highlighter-rouge">catch</code>/<code class="language-plaintext highlighter-rouge">throw</code> in performance-critical code.
This is important because it’s often the only way to perform an early
bailout.</p>

<h3 id="4-prefer-using-functions-with-dedicated-opcodes">(4) Prefer using functions with dedicated opcodes</h3>

<p>When following the guideline about using built-in functions, you might
have several to pick from. Some built-in functions have dedicated
virtual machine opcodes, making them much faster to invoke. Prefer
these functions when possible.</p>

<p>How can you tell when a function has an assigned opcode? Take a peek
at the <code class="language-plaintext highlighter-rouge">byte-defop</code> listings in <a href="https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/bytecomp.el">bytecomp.el</a>. Optimization often
involves getting into the weeds, so don’t be shy.</p>

<p>For example, the <code class="language-plaintext highlighter-rouge">assq</code> and <code class="language-plaintext highlighter-rouge">assoc</code> functions search for a matching
key in an association list (alist). Both are built-in functions, and
the only difference is that the former compares keys with <code class="language-plaintext highlighter-rouge">eq</code> (e.g.
symbol or integer keys) and the latter with <code class="language-plaintext highlighter-rouge">equal</code> (typically string
keys). The difference in performance between <code class="language-plaintext highlighter-rouge">eq</code> and <code class="language-plaintext highlighter-rouge">equal</code> isn’t as
important as another factor: <code class="language-plaintext highlighter-rouge">assq</code> has its own opcode (158).</p>

<p>This means in performance-critical code you should prefer <code class="language-plaintext highlighter-rouge">assq</code>,
perhaps even going as far as restructuring your alists specifically to
have <code class="language-plaintext highlighter-rouge">eq</code> keys. That last step is probably a trade-off, which means
you’ll want to make some benchmarks to help with that decision.</p>

<p>Another example is <code class="language-plaintext highlighter-rouge">eq</code>, <code class="language-plaintext highlighter-rouge">=</code>, <code class="language-plaintext highlighter-rouge">eql</code>, and <code class="language-plaintext highlighter-rouge">equal</code>. Some macros and
functions use <code class="language-plaintext highlighter-rouge">eql</code>, especially <code class="language-plaintext highlighter-rouge">cl-lib</code> which inherits <code class="language-plaintext highlighter-rouge">eql</code> as a
default from Common Lisp. Take <code class="language-plaintext highlighter-rouge">cl-case</code>, which is like <code class="language-plaintext highlighter-rouge">switch</code> from
the C family of languages. It compares elements with <code class="language-plaintext highlighter-rouge">eql</code>.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">op-apply</span> <span class="p">(</span><span class="nv">op</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-case</span> <span class="nv">op</span>
    <span class="p">(</span><span class="ss">:norm</span> <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">a</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">b</span> <span class="nv">b</span><span class="p">)))</span>
    <span class="p">(</span><span class="ss">:disp</span> <span class="p">(</span><span class="nb">abs</span> <span class="p">(</span><span class="nb">-</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)))</span>
    <span class="p">(</span><span class="ss">:isin</span> <span class="p">(</span><span class="nb">/</span> <span class="nv">b</span> <span class="p">(</span><span class="nb">sin</span> <span class="nv">a</span><span class="p">)))))</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">cl-case</code> expands into a <code class="language-plaintext highlighter-rouge">cond</code>. Since Emacs byte-code lacks
support for jump tables, there’s not much room for cleverness.</p>

<p><strong>Update</strong>: Emacs 26.1, released May 2018, introduced a jump table
opcode.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">op-apply</span> <span class="p">(</span><span class="nv">op</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">cond</span>
   <span class="p">((</span><span class="nb">eql</span> <span class="nv">op</span> <span class="ss">:norm</span><span class="p">)</span> <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">a</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">b</span> <span class="nv">b</span><span class="p">)))</span>
   <span class="p">((</span><span class="nb">eql</span> <span class="nv">op</span> <span class="ss">:disp</span><span class="p">)</span> <span class="p">(</span><span class="nb">abs</span> <span class="p">(</span><span class="nb">-</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)))</span>
   <span class="p">((</span><span class="nb">eql</span> <span class="nv">op</span> <span class="ss">:isin</span><span class="p">)</span> <span class="p">(</span><span class="nb">/</span> <span class="nv">b</span> <span class="p">(</span><span class="nb">sin</span> <span class="nv">a</span><span class="p">)))))</span>
</code></pre></div></div>

<p>It turns out <code class="language-plaintext highlighter-rouge">eql</code> is pretty much always the worst choice for
<code class="language-plaintext highlighter-rouge">cl-case</code>. Of the four equality functions I listed, the only one
lacking an opcode is <code class="language-plaintext highlighter-rouge">eql</code>. A faster definition would use <code class="language-plaintext highlighter-rouge">eq</code>. (In
theory, <code class="language-plaintext highlighter-rouge">cl-case</code> <em>could</em> have done this itself because it knows all
the keys are symbols.)</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">op-apply</span> <span class="p">(</span><span class="nv">op</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">cond</span>
   <span class="p">((</span><span class="nb">eq</span> <span class="nv">op</span> <span class="ss">:norm</span><span class="p">)</span> <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">a</span> <span class="nv">a</span><span class="p">)</span> <span class="p">(</span><span class="nb">*</span> <span class="nv">b</span> <span class="nv">b</span><span class="p">)))</span>
   <span class="p">((</span><span class="nb">eq</span> <span class="nv">op</span> <span class="ss">:disp</span><span class="p">)</span> <span class="p">(</span><span class="nb">abs</span> <span class="p">(</span><span class="nb">-</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)))</span>
   <span class="p">((</span><span class="nb">eq</span> <span class="nv">op</span> <span class="ss">:isin</span><span class="p">)</span> <span class="p">(</span><span class="nb">/</span> <span class="nv">b</span> <span class="p">(</span><span class="nb">sin</span> <span class="nv">a</span><span class="p">)))))</span>
</code></pre></div></div>

<p>Fortunately <code class="language-plaintext highlighter-rouge">eq</code> can safely compare integers in Emacs Lisp. You only
need <code class="language-plaintext highlighter-rouge">eql</code> when comparing symbols, integers, and floats all at once,
which is unusual.</p>

<h3 id="5-unroll-loops-using-andor">(5) Unroll loops using and/or</h3>

<p>Consider the following function which checks its argument against a
list of numbers, bailing out on the first match. I used <code class="language-plaintext highlighter-rouge">%</code> instead of
<code class="language-plaintext highlighter-rouge">mod</code> since the former has an opcode (166) and the latter does not.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">detect</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
  <span class="p">(</span><span class="k">catch</span> <span class="ss">'found</span>
    <span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nv">f</span> <span class="o">'</span><span class="p">(</span><span class="mi">2</span> <span class="mi">3</span> <span class="mi">5</span> <span class="mi">7</span> <span class="mi">11</span> <span class="mi">13</span> <span class="mi">17</span> <span class="mi">19</span> <span class="mi">23</span> <span class="mi">29</span> <span class="mi">31</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="nv">f</span><span class="p">))</span>
        <span class="p">(</span><span class="k">throw</span> <span class="ss">'found</span> <span class="nv">f</span><span class="p">)))))</span>
</code></pre></div></div>

<p>The byte-code compiler doesn’t know how to unroll loops. Fortunately
that’s something we can do for ourselves using <code class="language-plaintext highlighter-rouge">and</code> and <code class="language-plaintext highlighter-rouge">or</code>. The
compiler will turn this into clean, efficient jumps in the byte-code.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">detect-unrolled</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">or</span> <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">2</span><span class="p">))</span> <span class="mi">2</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">3</span><span class="p">))</span> <span class="mi">3</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">5</span><span class="p">))</span> <span class="mi">5</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">7</span><span class="p">))</span> <span class="mi">7</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">11</span><span class="p">))</span> <span class="mi">11</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">13</span><span class="p">))</span> <span class="mi">13</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">17</span><span class="p">))</span> <span class="mi">17</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">19</span><span class="p">))</span> <span class="mi">19</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">23</span><span class="p">))</span> <span class="mi">23</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">29</span><span class="p">))</span> <span class="mi">29</span><span class="p">)</span>
      <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="nv">x</span> <span class="mi">31</span><span class="p">))</span> <span class="mi">31</span><span class="p">)))</span>
</code></pre></div></div>

<p>In Emacs 24.4 and earlier with the old-fashioned lambda-based <code class="language-plaintext highlighter-rouge">catch</code>,
the unrolled definition is seven times faster. With the faster
<code class="language-plaintext highlighter-rouge">pushcatch</code>-based <code class="language-plaintext highlighter-rouge">catch</code> it’s about twice as fast. This means the
loop overhead accounts for about half the work of the first definition
of this function.</p>

<p>Update: It was pointed out in the comments that this particular
example is equivalent to a <code class="language-plaintext highlighter-rouge">cond</code>. That’s literally true all the way
down to the byte-code, and it would be a clearer way to express the
unrolled code. In real code it’s often not <em>quite</em> equivalent.</p>

<p>Unlike some of the other guidelines, this is certainly something you’d
only want to do in code you know for sure is performance-critical.
Maintaining unrolled code is tedious and error-prone.</p>

<p>I’ve had the most success with this approach by not by unrolling these
loops myself, but by <a href="/blog/2016/12/27/">using a macro</a>, or <a href="/blog/2016/12/11/">similar</a>, to
generate the unrolled form.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defmacro</span> <span class="nv">with-detect</span> <span class="p">(</span><span class="nv">var</span> <span class="nb">list</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">e</span> <span class="nv">in</span> <span class="nb">list</span>
           <span class="nv">collect</span> <span class="o">`</span><span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">=</span> <span class="mi">0</span> <span class="p">(</span><span class="nv">%</span> <span class="o">,</span><span class="nv">var</span> <span class="o">,</span><span class="nv">e</span><span class="p">))</span> <span class="o">,</span><span class="nv">e</span><span class="p">)</span> <span class="nv">into</span> <span class="nv">conditions</span>
           <span class="nv">finally</span> <span class="nb">return</span> <span class="o">`</span><span class="p">(</span><span class="nb">or</span> <span class="o">,@</span><span class="nv">conditions</span><span class="p">)))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">detect-unrolled</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">with-detect</span> <span class="nv">x</span> <span class="p">(</span><span class="mi">2</span> <span class="mi">3</span> <span class="mi">5</span> <span class="mi">7</span> <span class="mi">11</span> <span class="mi">13</span> <span class="mi">17</span> <span class="mi">19</span> <span class="mi">23</span> <span class="mi">29</span> <span class="mi">31</span><span class="p">)))</span>
</code></pre></div></div>

<h3 id="how-can-i-find-more-optimization-opportunities-myself">How can I find more optimization opportunities myself?</h3>

<p>Use <code class="language-plaintext highlighter-rouge">M-x disassemble</code> to inspect the byte-code for your own hot spots.
Observe how the byte-code changes in response to changes in your
functions. Take note of the sorts of forms that allow the byte-code
compiler to produce the best code, and then exploit it where you can.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Domain-Specific Language Compilation in Elfeed</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/27/"/>
    <id>urn:uuid:6a6cd6a2-b44d-35b5-503c-c496d9094ac0</id>
    <updated>2016-12-27T21:46:30Z</updated>
    <category term="elfeed"/><category term="emacs"/><category term="elisp"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Last night I pushed another performance enhancement for Elfeed, this
time reducing the time spent parsing feeds. It’s accomplished by
compiling, during macro expansion, a jQuery-like domain-specific
language within Elfeed.</p>

<h3 id="heuristic-parsing">Heuristic parsing</h3>

<p>Given the nature of the domain — <a href="/blog/2013/09/23/">an under-specified standard</a>
and a lack of robust adherence — feed parsing is much more heuristic
than strict. Sure, everyone’s feed XML is strictly conforming since
virtually no feed reader tolerates invalid XML (thank you, XML
libraries), but, for the schema, the situation resembles the <em>de
facto</em> looseness of HTML. Sometimes important or required information
is missing, or is only available in <a href="https://www.intertwingly.net/wiki/pie/DublinCore">a different namespace</a>.
Sometimes, especially in the case of timestamps, it’s in the wrong
format, or encoded incorrectly, or ambiguous. It’s real world data.</p>

<p>To get a particular piece of information, Elfeed looks in a number of
different places within the feed, starting with the preferred source
and stopping when the information is found. For example, to find the
date of an Atom entry, Elfeed first searches for elements in this
order:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">&lt;published&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;updated&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;date&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;modified&gt;</code></li>
  <li><code class="language-plaintext highlighter-rouge">&lt;issued&gt;</code></li>
</ol>

<p>Failing to find any of these elements, or if no parsable date is
found, it settles on the current time. Only the <code class="language-plaintext highlighter-rouge">updated</code> element is
required, but <code class="language-plaintext highlighter-rouge">published</code> usually has the desired information, so it
goes first. The last three are only valid for another namespace, but
are useful fallbacks.</p>

<p>Before Elfeed even starts this search, the XML text is parsed into an
s-expression using <code class="language-plaintext highlighter-rouge">xml-parse-region</code> — a pure Elisp XML parser
included in Emacs. The search is made over the resulting s-expression.</p>

<p>For example, here’s a sample <a href="https://tools.ietf.org/html/rfc4287">from the Atom specification</a>.</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">&lt;?xml version="1.0" encoding="utf-8"?&gt;</span>
<span class="nt">&lt;feed</span> <span class="na">xmlns=</span><span class="s">"http://www.w3.org/2005/Atom"</span><span class="nt">&gt;</span>

  <span class="nt">&lt;title&gt;</span>Example Feed<span class="nt">&lt;/title&gt;</span>
  <span class="nt">&lt;link</span> <span class="na">href=</span><span class="s">"http://example.org/"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;updated&gt;</span>2003-12-13T18:30:02Z<span class="nt">&lt;/updated&gt;</span>
  <span class="nt">&lt;author&gt;</span>
    <span class="nt">&lt;name&gt;</span>John Doe<span class="nt">&lt;/name&gt;</span>
  <span class="nt">&lt;/author&gt;</span>
  <span class="nt">&lt;id&gt;</span>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6<span class="nt">&lt;/id&gt;</span>

  <span class="nt">&lt;entry&gt;</span>
    <span class="nt">&lt;title&gt;</span>Atom-Powered Robots Run Amok<span class="nt">&lt;/title&gt;</span>
    <span class="nt">&lt;link</span> <span class="na">rel=</span><span class="s">"alternate"</span> <span class="na">href=</span><span class="s">"http://example.org/2003/12/13/atom03"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;id&gt;</span>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a<span class="nt">&lt;/id&gt;</span>
    <span class="nt">&lt;updated&gt;</span>2003-12-13T18:30:02Z<span class="nt">&lt;/updated&gt;</span>
    <span class="nt">&lt;summary&gt;</span>Some text.<span class="nt">&lt;/summary&gt;</span>
  <span class="nt">&lt;/entry&gt;</span>

<span class="nt">&lt;/feed&gt;</span>
</code></pre></div></div>

<p>Which is parsed to into this s-expression.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">((</span><span class="nv">feed</span> <span class="p">((</span><span class="nv">xmlns</span> <span class="o">.</span> <span class="s">"http://www.w3.org/2005/Atom"</span><span class="p">))</span>
       <span class="p">(</span><span class="nv">title</span> <span class="p">()</span> <span class="s">"Example Feed"</span><span class="p">)</span>
       <span class="p">(</span><span class="nv">link</span> <span class="p">((</span><span class="nv">href</span> <span class="o">.</span> <span class="s">"http://example.org/"</span><span class="p">)))</span>
       <span class="p">(</span><span class="nv">updated</span> <span class="p">()</span> <span class="s">"2003-12-13T18:30:02Z"</span><span class="p">)</span>
       <span class="p">(</span><span class="nv">author</span> <span class="p">()</span> <span class="p">(</span><span class="nv">name</span> <span class="p">()</span> <span class="s">"John Doe"</span><span class="p">))</span>
       <span class="p">(</span><span class="nv">id</span> <span class="p">()</span> <span class="s">"urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6"</span><span class="p">)</span>
       <span class="p">(</span><span class="nv">entry</span> <span class="p">()</span>
              <span class="p">(</span><span class="nv">title</span> <span class="p">()</span> <span class="s">"Atom-Powered Robots Run Amok"</span><span class="p">)</span>
              <span class="p">(</span><span class="nv">link</span> <span class="p">((</span><span class="nv">rel</span> <span class="o">.</span> <span class="s">"alternate"</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">href</span> <span class="o">.</span> <span class="s">"http://example.org/2003/12/13/atom03"</span><span class="p">)))</span>
              <span class="p">(</span><span class="nv">id</span> <span class="p">()</span> <span class="s">"urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a"</span><span class="p">)</span>
              <span class="p">(</span><span class="nv">updated</span> <span class="p">()</span> <span class="s">"2003-12-13T18:30:02Z"</span><span class="p">)</span>
              <span class="p">(</span><span class="nv">summary</span> <span class="p">()</span> <span class="s">"Some text."</span><span class="p">))))</span>
</code></pre></div></div>

<p>Each XML element is converted to a list. The first item is a symbol
that is the element’s name. The second item is an alist of attributes
— cons pairs of symbols and strings. And the rest are its children,
both string nodes and other elements. I’ve trimmed the extraneous
string nodes from the sample s-expression.</p>

<p>A subtle detail is that <code class="language-plaintext highlighter-rouge">xml-parse-region</code> doesn’t just return the
root element. It returns a <em>list of elements</em>, which always happens to
be a single element list, which is the root element. I don’t know why
this is, but I’ve built everything to assume this structure as input.</p>

<p>Elfeed strips all namespaces stripped from both elements and
attributes to make parsing simpler. As I said, it’s heuristic rather
than strict, so namespaces are treated as noise.</p>

<h3 id="a-domain-specific-language">A domain-specific language</h3>

<p>Coding up Elfeed’s s-expression searches in straight Emacs Lisp would
be tedious, error-prone, and difficult to understand. It’s a lot of
loops, <code class="language-plaintext highlighter-rouge">assoc</code>, etc. So instead I invented a jQuery-like, CSS
selector-like, domain-specific language (DSL) to express these
searches concisely and clearly.</p>

<p>For example, all of the entry links are “selected” using this
expression:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">link</span> <span class="nv">[rel</span> <span class="s">"alternate"</span><span class="nv">]</span> <span class="ss">:href</span><span class="p">)</span>
</code></pre></div></div>

<p>Reading right-to-left, this matches every <code class="language-plaintext highlighter-rouge">href</code> attribute under every
<code class="language-plaintext highlighter-rouge">link</code> element with the <code class="language-plaintext highlighter-rouge">rel="alternate"</code> attribute, under every
<code class="language-plaintext highlighter-rouge">entry</code> element, under the <code class="language-plaintext highlighter-rouge">feed</code> root element. Symbols match element
names, two-element vectors match elements with a particular attribute
pair, and keywords (which must come last) narrow the selection to a
specific attribute value.</p>

<p>Imagine hand-writing the code to navigate all these conditions for
each piece of information that Elfeed requires. The RSS parser makes
up to 16 such queries, and the Atom parser makes as many as 24. That
would add up to a lot of tedious code.</p>

<p>The package (included with Elfeed) that executes this query is called
“xml-query.” It comes in two flavors: <code class="language-plaintext highlighter-rouge">xml-query</code> and <code class="language-plaintext highlighter-rouge">xml-query-all</code>.
The former returns just the first match, and the latter returns all
matches. The naming parallels the <code class="language-plaintext highlighter-rouge">querySelector()</code> and
<code class="language-plaintext highlighter-rouge">querySelectorAll()</code> DOM methods in JavaScript.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">xml</span> <span class="p">(</span><span class="nv">elfeed-xml-parse-region</span><span class="p">)))</span>
  <span class="p">(</span><span class="nv">xml-query-all</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">link</span> <span class="nv">[rel</span> <span class="s">"alternate"</span><span class="nv">]</span> <span class="ss">:href</span><span class="p">)</span> <span class="nv">xml</span><span class="p">))</span>

<span class="c1">;; =&gt; ("http://example.org/2003/12/13/atom03")</span>
</code></pre></div></div>

<p>That date search I mentioned before looks roughly like this. The <code class="language-plaintext highlighter-rouge">*</code>
matches text nodes within the selected element. It must come last just
like the keyword matcher.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">or</span> <span class="p">(</span><span class="nv">xml-query</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">published</span> <span class="nb">*</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">xml-query</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">updated</span> <span class="nb">*</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">xml-query</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">date</span> <span class="nb">*</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">xml-query</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">modified</span> <span class="nb">*</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">xml-query</span> <span class="o">'</span><span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">issued</span> <span class="nb">*</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">current-time</span><span class="p">))</span>
</code></pre></div></div>

<p>Over the past three years, Elfeed has gained more and more of these
selectors as it collects more and more information from feeds. Most
recently, Elfeed collects author and category information provided by
feeds. Each new query slows feed parsing a little bit, and it’s a
perfect example of a program slowing down as it gains more features
and capabilities.</p>

<p>But I don’t want Elfeed to slow down. I want it to get <em>faster</em>!</p>

<h3 id="optimizing-the-domain-specific-language">Optimizing the domain-specific language</h3>

<p>Just like the primary jQuery function (<code class="language-plaintext highlighter-rouge">$</code>), both <code class="language-plaintext highlighter-rouge">xml-query</code> and
<code class="language-plaintext highlighter-rouge">xml-query-all</code> are functions. The xml-query engine processes the
selector from scratch on each invocation. It examines the first
element, dispatches on its type/value to apply it to the input, and
then recurses on the rest of selector with the narrowed input,
stopping when it hits the end of the list. That’s the way it’s worked
from the start.</p>

<p>However, every selector argument in Elfeed is a static, quoted list.
<a href="/blog/2016/12/11/">Unlike user-supplied filters</a>, I know exactly what I want to
execute ahead of time. It would be much better if the engine didn’t
have to waste time reparsing the DSL for each query.</p>

<p>This is the classic split between interpreters and compilers. An
interpreter reads input and immediately executes it, doing what the
input tells it to do. A compiler reads input and, rather than execute
it, produces output, usually in a simpler language, that, when
evaluated, has the same effect as executing the input.</p>

<p>Rather than interpret the selector, it would be better to compile it
into Elisp code, compile that <a href="/blog/2014/01/04/">into byte-code</a>, and then have the
Emacs byte-code virtual machine (VM) execute the query each time it’s
needed. The extra work of parsing the DSL is performed ahead of time,
the dispatch is entirely static, and the selector ultimately executes
on a much faster engine (byte-code VM). This should be a lot faster!</p>

<p>So I wrote a function that accepts a selector expression and emits
Elisp source that implements that selector: a compiler for my DSL.
Having a readily-available syntax tree is one of the <a href="https://en.wikipedia.org/wiki/Homoiconicity">big advantages
of homoiconicity</a>, and this sort of function makes perfect sense
in a lisp. For the external interface, this compiler function is
called by a new pair of macros, <code class="language-plaintext highlighter-rouge">xml-query*</code> and <code class="language-plaintext highlighter-rouge">xml-query-all*</code>.
These macros consume a static selector and expand into the compiled
Elisp form of the selector.</p>

<p>To demonstrate, remember that link query from before? Here’s the macro
version of that selection, but only returning the first match. Notice
the selector is no longer quoted. This is because it’s consumed by the
macro, not evaluated.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">xml-query*</span> <span class="p">(</span><span class="nv">feed</span> <span class="nv">entry</span> <span class="nv">title</span> <span class="nv">[rel</span> <span class="s">"alternate"</span><span class="nv">]</span> <span class="ss">:href</span><span class="p">)</span> <span class="nv">xml</span><span class="p">)</span>
</code></pre></div></div>

<p>This will expand into the following code.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">catch</span> <span class="ss">'done</span>
  <span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nv">v</span> <span class="nv">xml</span><span class="p">)</span>
    <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">consp</span> <span class="nv">v</span><span class="p">)</span> <span class="p">(</span><span class="nb">eq</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">v</span><span class="p">)</span> <span class="ss">'feed</span><span class="p">))</span>
      <span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nv">v</span> <span class="p">(</span><span class="nb">cddr</span> <span class="nv">v</span><span class="p">))</span>
        <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">consp</span> <span class="nv">v</span><span class="p">)</span> <span class="p">(</span><span class="nb">eq</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">v</span><span class="p">)</span> <span class="ss">'entry</span><span class="p">))</span>
          <span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nv">v</span> <span class="p">(</span><span class="nb">cddr</span> <span class="nv">v</span><span class="p">))</span>
            <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">and</span> <span class="p">(</span><span class="nb">consp</span> <span class="nv">v</span><span class="p">)</span> <span class="p">(</span><span class="nb">eq</span> <span class="p">(</span><span class="nb">car</span> <span class="nv">v</span><span class="p">)</span> <span class="ss">'title</span><span class="p">))</span>
              <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">value</span> <span class="p">(</span><span class="nb">cdr</span> <span class="p">(</span><span class="nv">assq</span> <span class="ss">'rel</span> <span class="p">(</span><span class="nb">cadr</span> <span class="nv">v</span><span class="p">)))))</span>
                <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">equal</span> <span class="nv">value</span> <span class="s">"alternate"</span><span class="p">)</span>
                  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">v</span> <span class="p">(</span><span class="nb">cdr</span> <span class="p">(</span><span class="nv">assq</span> <span class="ss">'href</span> <span class="p">(</span><span class="nb">cadr</span> <span class="nv">v</span><span class="p">)))))</span>
                    <span class="p">(</span><span class="nb">when</span> <span class="nv">v</span>
                      <span class="p">(</span><span class="k">throw</span> <span class="ss">'done</span> <span class="nv">v</span><span class="p">))))))))))))</span>
</code></pre></div></div>

<p>As soon as it finds a match, it’s thrown to the top level and
returned. Without the DSL, the expansion is essentially what would
have to be written by hand. <strong>This is <em>exactly</em> the sort of leverage
you should be getting from a compiler.</strong> It compiles to around 130
byte-code instructions.</p>

<p>The <code class="language-plaintext highlighter-rouge">xml-query-all*</code> form is nearly the same, but instead of a
<code class="language-plaintext highlighter-rouge">throw</code>, it pushes the result into the return list. Only the prologue
(the outermost part) and the epilogue (the innermost part) are
different.</p>

<p>Parsing feeds is a hot spot for Elfeed, so I wanted the compiler’s
output to be as efficient as possible. I had three goals for this:</p>

<ul>
  <li>
    <p><strong>No extraneous code.</strong> It’s easy for the compiler to emit
unnecessary code. The byte-code compiler might be able to eliminate
some of it, but I don’t want to rely on that. Except for the
identifiers, it should basically look like a human wrote it.</p>
  </li>
  <li>
    <p><strong>Avoid function calls.</strong> I don’t want to pay function call
overhead, and, with some care, it’s easy to avoid. In the
<code class="language-plaintext highlighter-rouge">xml-query*</code> expansion, the only function call is <code class="language-plaintext highlighter-rouge">throw</code>, which is
unavoidable. The <code class="language-plaintext highlighter-rouge">xml-query-all*</code> version makes no function calls
whatsoever. Notice that I used <code class="language-plaintext highlighter-rouge">assq</code> rather than <code class="language-plaintext highlighter-rouge">assoc</code>. First, it
only needs to match symbols, so it should be faster. Second, <code class="language-plaintext highlighter-rouge">assq</code>
has its own byte-code instruction (158) and <code class="language-plaintext highlighter-rouge">assoc</code> does not.</p>
  </li>
  <li>
    <p><strong>No unnecessary memory allocations</strong>. The <code class="language-plaintext highlighter-rouge">xml-query*</code> expansion
makes <em>no</em> allocations. The <code class="language-plaintext highlighter-rouge">xml-query-all*</code> version only conses
once per output, which is the minimum possible.</p>
  </li>
</ul>

<p>The end result is at least as optimal as hand-written code, but
without the chance of human error (typos, fat fingering) and sourced
from an easy-to-read DSL.</p>

<h3 id="performance">Performance</h3>

<p>In my tests, the <strong>xml-query macros are a full order of magnitude
faster than the functions</strong>. Yes, ten times faster! It’s an even
bigger gain than I expected.</p>

<p>In the full picture, xml-query is only one part of parsing a feed.
Measuring the time starting from raw XML text (as <a href="/blog/2016/06/16/">delivered by
cURL</a>) to a list of database entry objects, I’m seeing an
<strong>overall 25% speedup</strong> with the macros. The remaining time is
dominated by <code class="language-plaintext highlighter-rouge">xml-parse-region</code>, which is mostly out of my control.</p>

<p>With xml-query so computationally cheap, I don’t need to worry about
using it more often. Compared to parsing XML text, it’s virtually
free.</p>

<p>When it came time to validate my DSL compiler, I was <em>really</em> happy
that Elfeed had a test suite. I essentially rewrote a core component
from scratch, and passing all of the unit tests was a strong sign that
it was correct. Many times that test suite has provided confidence in
changes made both by me and by others.</p>

<p>I’ll end by describing another possible application: Apply this
technique to regular expressions, such that static strings containing
regular expressions are compiled into Elisp/byte-code via macro
expansion. I wonder if situationally this would be faster than Emacs’
own regular expression engine.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Relocatable Global Data on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/23/"/>
    <id>urn:uuid:56be19e0-ce9a-3f37-dc85-578f397ed3e1</id>
    <updated>2016-12-23T22:50:51Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Relocatable code — program code that executes correctly from any
properly-aligned address — is an essential feature for shared
libraries. Otherwise all of a system’s shared libraries would need to
coordinate their virtual load addresses. Loading programs and
libraries to random addresses is also a valuable security feature:
Address Space Layout Randomization (ASLR). But how does a compiler
generate code for a function that accesses a global variable if that
variable’s address isn’t known at compile time?</p>

<p>Consider this simple C code sample.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function needs the base address of <code class="language-plaintext highlighter-rouge">values</code> in order to
dereference it for <code class="language-plaintext highlighter-rouge">values[x]</code>. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -Os -fPIC get_value.c
</code></pre></div></div>

<p>I optimized for size (<code class="language-plaintext highlighter-rouge">-Os</code>) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with <code class="language-plaintext highlighter-rouge">objdump</code>. Alternatively I
could have asked for the compiler’s assembly output with <code class="language-plaintext highlighter-rouge">-S</code>, but
this will be good reverse engineering practice.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -d -Mintel get_value.o
0000000000000000 &lt;get_value&gt;:
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 &lt;get_value+0x16&gt;
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret
</code></pre></div></div>

<p>There are a couple of interesting things going on, but let’s start
from the beginning.</p>

<ol>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">The ABI</a> specifies that the first integer/pointer argument
(the 32-bit integer <code class="language-plaintext highlighter-rouge">x</code>) is passed through the <code class="language-plaintext highlighter-rouge">edi</code> register. The
function compares <code class="language-plaintext highlighter-rouge">x</code> to 3, to satisfy <code class="language-plaintext highlighter-rouge">x &lt; 4</code>.</p>
  </li>
  <li>
    <p>The ABI specifies that floating point values are returned through
the <a href="/blog/2015/07/10/">SSE2 SIMD register</a> <code class="language-plaintext highlighter-rouge">xmm0</code>. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of <code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>It then uses the result of the previous comparison to perform a
jump, <code class="language-plaintext highlighter-rouge">ja</code> (“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to <code class="language-plaintext highlighter-rouge">cmp</code>
(<code class="language-plaintext highlighter-rouge">edi</code>) comes after the first operand (<code class="language-plaintext highlighter-rouge">0x3</code>) as <em>unsigned</em> values.
Its cousin, <code class="language-plaintext highlighter-rouge">jg</code> (“jump if greater”), is for signed values. If <code class="language-plaintext highlighter-rouge">x</code>
is outside the array bounds, it jumps straight to <code class="language-plaintext highlighter-rouge">ret</code>, returning
<code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">x</code> was in bounds, it uses a <code class="language-plaintext highlighter-rouge">lea</code> (“load effective address”) to
load <em>something</em> into the 64-bit <code class="language-plaintext highlighter-rouge">rax</code> register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into <code class="language-plaintext highlighter-rouge">rax</code> is the address of the <code class="language-plaintext highlighter-rouge">values</code> array. More on
this in a moment.</p>
  </li>
  <li>
    <p>Finally it uses <code class="language-plaintext highlighter-rouge">x</code> as an index into address in <code class="language-plaintext highlighter-rouge">rax</code>. The <code class="language-plaintext highlighter-rouge">movss</code>
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of <code class="language-plaintext highlighter-rouge">xmm0</code>, where the caller expects to find the
return value. This is all preceded by a <code class="language-plaintext highlighter-rouge">mov edi, edi</code> which
<a href="/blog/2016/03/31/"><em>looks</em> like a hotpatch nop</a>, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses <code class="language-plaintext highlighter-rouge">rdi</code> not <code class="language-plaintext highlighter-rouge">edi</code>.
All 32-bit register assignments clear the upper 32 bits, and so
this <code class="language-plaintext highlighter-rouge">mov</code> zero-extends <code class="language-plaintext highlighter-rouge">edi</code> into <code class="language-plaintext highlighter-rouge">rdi</code>. This is in case of the
unlikely event that the caller left garbage in those upper bits.</p>
  </li>
</ol>

<h3 id="clearing-xmm0">Clearing <code class="language-plaintext highlighter-rouge">xmm0</code></h3>

<p>The first interesting part: <code class="language-plaintext highlighter-rouge">xmm0</code> is cleared even when its first lane
is loaded with a value. There are two reasons to do this.</p>

<p>The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra <code class="language-plaintext highlighter-rouge">ret</code> or an conditional <code class="language-plaintext highlighter-rouge">jmp</code> over the “else” branch.</p>

<p>The less obvious reason is that it breaks a <em>data dependency</em>. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called <a href="https://en.wikipedia.org/wiki/Register_renaming">register renaming</a>. <em>Architectural
registers</em> (<code class="language-plaintext highlighter-rouge">rax</code>, <code class="language-plaintext highlighter-rouge">edi</code>, etc.) are just temporary names for
underlying <em>physical registers</em>. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.</p>

<p>For example, take this assembly sample. It assembles to 9 bytes of
machine code.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>This reads a 32-bit value from the address stored in <code class="language-plaintext highlighter-rouge">rcx</code>, then
assigns <code class="language-plaintext highlighter-rouge">ecx</code> and uses <code class="language-plaintext highlighter-rouge">cl</code> (the lowest byte of <code class="language-plaintext highlighter-rouge">rcx</code>) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of <code class="language-plaintext highlighter-rouge">rcx</code>, wiping the unused parts of
register.</p>

<p>So after the second instruction, it’s guaranteed that the value in
<code class="language-plaintext highlighter-rouge">rcx</code> has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, <em>before</em> the load. Ingenious!</p>

<p>Compare it to this example, where the second instruction assigns to
<code class="language-plaintext highlighter-rouge">cl</code> instead of <code class="language-plaintext highlighter-rouge">ecx</code>. This assembles to just 6 bytes.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">cl</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of <code class="language-plaintext highlighter-rouge">rcx</code> still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.</p>

<p>By clearing <code class="language-plaintext highlighter-rouge">xmm0</code>, instructions in <code class="language-plaintext highlighter-rouge">get_value</code> involving <code class="language-plaintext highlighter-rouge">xmm0</code> have
the opportunity to be executed prior to instructions in the callee
that use <code class="language-plaintext highlighter-rouge">xmm0</code>.</p>

<h3 id="rip-relative-addressing">RIP-relative addressing</h3>

<p>Going back to the instruction that computes the address of <code class="language-plaintext highlighter-rouge">values</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
</code></pre></div></div>

<p>Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
<em>RIP-relative addressing</em> that’s still the case, but the register with
the absolute address is <code class="language-plaintext highlighter-rouge">rip</code>, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.</p>

<p>That means this instruction copies the instruction pointer (pointing
to the next instruction) into <code class="language-plaintext highlighter-rouge">rax</code>, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you <em>want</em> a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a <em>relocation entry</em> to the object file so that the linker knows how
to do this.</p>

<p>On platforms that <a href="/blog/2016/11/17/">use ELF</a> we can inspect relocations this with
<code class="language-plaintext highlighter-rouge">readelf</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
</code></pre></div></div>

<p>The relocation type is <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code>. In the <a href="http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_amd64.pdf">AMD64 Architecture
Processor Supplement</a>, this is defined as “S + A - P”.</p>

<ul>
  <li>
    <p>S: Represents the value of the symbol whose index resides in the
relocation entry.</p>
  </li>
  <li>
    <p>A: Represents the addend used to compute the value of the
relocatable field.</p>
  </li>
  <li>
    <p>P: Represents the place of the storage unit being relocated.</p>
  </li>
</ul>

<p>The symbol, S, is <code class="language-plaintext highlighter-rouge">.rodata</code> — the final address for this object file’s
portion of <code class="language-plaintext highlighter-rouge">.rodata</code> (where <code class="language-plaintext highlighter-rouge">values</code> resides). The addend, A, is <code class="language-plaintext highlighter-rouge">-4</code>
since the instruction pointer points at the <em>next</em> instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the <code class="language-plaintext highlighter-rouge">lea</code> instruction. These values are all known at
link-time, so no run-time support is necessary.</p>

<p>Being “S - P” (overall), this will be the displacement between these
two addresses: the 32-bit value is relative. It’s relocatable so long
as these two parts of the binary (code and data) maintain a fixed
distance from each other. The binary is relocated as a whole, so this
assumption holds.</p>

<h3 id="32-bit-relocation">32-bit relocation</h3>

<p>Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the <code class="language-plaintext highlighter-rouge">-m32</code> flag for a 32-bit target, and <code class="language-plaintext highlighter-rouge">-fomit-frame-pointer</code> to
make it simpler for explanatory purposes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 &lt;get_value&gt;:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 &lt;get_value+0x7&gt;
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f &lt;get_value+0x1f&gt;
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 &lt;__x86.get_pc_thunk.cx&gt;:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret
</code></pre></div></div>

<p>Hmm, this one includes an extra function.</p>

<ol>
  <li>
    <p>In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, <code class="language-plaintext highlighter-rouge">x</code>, into <code class="language-plaintext highlighter-rouge">eax</code>.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">fldz</code> instruction clears the x87 floating pointer return
register, just like clearing <code class="language-plaintext highlighter-rouge">xmm0</code> in the x86-64 version.</p>
  </li>
  <li>
    <p>Next it calls <code class="language-plaintext highlighter-rouge">__x86.get_pc_thunk.cx</code>. The call pushes the
instruction pointer, <code class="language-plaintext highlighter-rouge">eip</code>, onto the stack. This function reads
that value off the stack into <code class="language-plaintext highlighter-rouge">ecx</code> and returns. In other words,
calling this function copies <code class="language-plaintext highlighter-rouge">eip</code> into <code class="language-plaintext highlighter-rouge">ecx</code>. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.</p>
  </li>
  <li>
    <p>Next a 32-bit displacement is added to <code class="language-plaintext highlighter-rouge">ecx</code>. In this case it’s
<code class="language-plaintext highlighter-rouge">2</code>, but, like before, this is actually going be filled in later by
the linker.</p>
  </li>
  <li>
    <p>Then it’s just like before: a branch to optionally load a value.
The floating pointer load (<code class="language-plaintext highlighter-rouge">fld</code>) is another relocation.</p>
  </li>
</ol>

<p>Let’s look at the relocations. There are three this time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata
</code></pre></div></div>

<p>The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is <code class="language-plaintext highlighter-rouge">R_386_PC32</code>: a code-relative
displacement just like on x86-64.</p>

<p>The next is of type <code class="language-plaintext highlighter-rouge">R_386_GOTPC</code> and sets the second operand in that
<code class="language-plaintext highlighter-rouge">add ecx</code>. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since <code class="language-plaintext highlighter-rouge">values</code> is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.</p>

<p>The final relocation is of type <code class="language-plaintext highlighter-rouge">R_386_GOTOFF</code>. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, <code class="language-plaintext highlighter-rouge">fld</code>. Ultimately the load adds these last
two relocations together, canceling the GOT:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
</code></pre></div></div>

<p>So the GOT isn’t relevant in this case. It’s just a mechanism for
constructing a custom relocation type.</p>

<h3 id="branch-optimization">Branch optimization</h3>

<p>Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will <code class="language-plaintext highlighter-rouge">x</code> be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.</p>

<p>The typical way to provide such a compiler hint is with a pair of
macros, <code class="language-plaintext highlighter-rouge">likely()</code> and <code class="language-plaintext highlighter-rouge">unlikely()</code>. With GCC and Clang, these would
be defined to use <code class="language-plaintext highlighter-rouge">__builtin_expect</code>. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)
</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">unlikely</span><span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this makes no difference even in the latest version of
GCC. In Clang it changes branch fall-through (for <a href="http://www.agner.org/optimize/microarchitecture.pdf">static branch
prediction</a>), but still always calls the thunk. It seems
compilers <a href="https://ewontfix.com/18/">have difficulty</a> with <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232">optimizing relocatable
code</a> on x86.</p>

<h3 id="x86-64-isnt-just-about-more-memory">x86-64 isn’t just about more memory</h3>

<p>It’s commonly understood that the advantage of 64-bit versus 32-bit
systems is processes having access to more than 4GB of memory. But as
this shows, there’s more to it than that. Even programs that don’t
need that much memory can really benefit from newer features like
RIP-relative addressing.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Some Performance Advantages of Lexical Scope</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/22/"/>
    <id>urn:uuid:21bc4afa-caa8-37ed-a912-a35f35d0e432</id>
    <updated>2016-12-22T02:33:36Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="optimization"/><category term="compsci"/>
    <content type="html">
      <![CDATA[<p>I recently had a discussion with <a href="http://ergoemacs.org/">Xah Lee</a> about lexical scope in
Emacs Lisp. The topic was why <code class="language-plaintext highlighter-rouge">lexical-binding</code> exists at a file-level
when there was already <code class="language-plaintext highlighter-rouge">lexical-let</code> (from <code class="language-plaintext highlighter-rouge">cl-lib</code>), prompted by my
previous article on <a href="/blog/2016/12/11/">JIT byte-code compilation</a>. The specific
context is Emacs Lisp, but these concepts apply to language design in
general.</p>

<p>Until Emacs 24.1 (June 2012), Elisp only had dynamically scoped
variables — a feature, mostly by accident, common to old lisp
dialects. While dynamic scope has some selective uses, it’s widely
regarded as a mistake for local variables, and virtually no other
languages have adopted it.</p>

<p>Way back in 1993, Dave Gillespie’s deviously clever <code class="language-plaintext highlighter-rouge">lexical-let</code>
macro <a href="http://git.savannah.gnu.org/cgit/emacs.git/commit/?h=fcd73769&amp;id=fcd737693e8e320acd70f91ec8e0728563244805">was committed</a> to the <code class="language-plaintext highlighter-rouge">cl</code> package, providing a rudimentary
form of opt-in lexical scope. The macro walks its body replacing local
variable names with guaranteed-unique gensym names: the exact same
technique used in macros to create “hygienic” bindings that aren’t
visible to the macro body. It essentially “fakes” lexical scope within
Elisp’s dynamic scope by preventing variable name collisions.</p>

<p>For example, here’s one of the consequences of dynamic scope.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">inner</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">setq</span> <span class="nv">v</span> <span class="ss">:inner</span><span class="p">))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">outer</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">v</span> <span class="ss">:outer</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">inner</span><span class="p">)</span>
    <span class="nv">v</span><span class="p">))</span>

<span class="p">(</span><span class="nv">outer</span><span class="p">)</span>
<span class="c1">;; =&gt; :inner</span>
</code></pre></div></div>

<p>The “local” variable <code class="language-plaintext highlighter-rouge">v</code> in <code class="language-plaintext highlighter-rouge">outer</code> is visible to its callee, <code class="language-plaintext highlighter-rouge">inner</code>,
which can access and manipulate it. The meaning of the <em>free variable</em>
<code class="language-plaintext highlighter-rouge">v</code> in <code class="language-plaintext highlighter-rouge">inner</code> depends entirely on the run-time call stack. It might
be a global variable, or it might be a local variable for a caller,
direct or indirect.</p>

<p>Using <code class="language-plaintext highlighter-rouge">lexical-let</code> deconflicts these names, giving the effect of
lexical scope.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defvar</span> <span class="nv">v</span><span class="p">)</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">lexical-outer</span> <span class="p">()</span>
  <span class="p">(</span><span class="nv">lexical-let</span> <span class="p">((</span><span class="nv">v</span> <span class="ss">:outer</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">inner</span><span class="p">)</span>
    <span class="nv">v</span><span class="p">))</span>

<span class="p">(</span><span class="nv">lexical-outer</span><span class="p">)</span>
<span class="c1">;; =&gt; :outer</span>
</code></pre></div></div>

<p>But there’s more to lexical scope than this. Closures only make sense
in the context of lexical scope, and the most useful feature of
<code class="language-plaintext highlighter-rouge">lexical-let</code> is that lambda expressions evaluate to closures. The
macro implements this using a technique called <a href="https://en.wikipedia.org/wiki/Lambda_lifting"><em>closure
conversion</em></a>. Additional parameters are added to the original
lambda function, one for each lexical variable (and not just each
closed-over variable), and the whole thing is wrapped in <em>another</em>
lambda function that invokes the original lambda function with the
additional parameters filled with the closed-over variables — yes, the
variables (e.g. symbols) themselves, <em>not</em> just their values, (e.g.
pass-by-reference). The last point means different closures can
properly close over the same variables, and they can bind new values.</p>

<p>To roughly illustrate how this works, the first lambda expression
below, which closes over the lexical variables <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code>, would be
converted into the latter by <code class="language-plaintext highlighter-rouge">lexical-let</code>. The <code class="language-plaintext highlighter-rouge">#:</code> is Elisp’s syntax
for uninterned variables. So <code class="language-plaintext highlighter-rouge">#:x</code> is <em>a</em> symbol <code class="language-plaintext highlighter-rouge">x</code>, but not <em>the</em>
symbol <code class="language-plaintext highlighter-rouge">x</code> (see <code class="language-plaintext highlighter-rouge">print-gensym</code>).</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; Before conversion:</span>
<span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
  <span class="p">(</span><span class="nb">+</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">))</span>

<span class="c1">;; After conversion:</span>
<span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="k">&amp;rest</span> <span class="nv">args</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">apply</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span> <span class="nv">y</span><span class="p">)</span>
           <span class="p">(</span><span class="nb">+</span> <span class="p">(</span><span class="nb">symbol-value</span> <span class="nv">x</span><span class="p">)</span>
              <span class="p">(</span><span class="nb">symbol-value</span> <span class="nv">y</span><span class="p">)))</span>
         <span class="o">'</span><span class="ss">#:x</span> <span class="o">'</span><span class="ss">#:y</span> <span class="nv">args</span><span class="p">))</span>
</code></pre></div></div>

<p>I’ve said on multiple occasions that <code class="language-plaintext highlighter-rouge">lexical-binding: t</code> has
significant advantages, both in performance and static analysis, and
so it should be used for all future Elisp code. The only reason it’s
not the default is because it breaks some old (badly written) code.
However, <strong><code class="language-plaintext highlighter-rouge">lexical-let</code> doesn’t realize any of these advantages</strong>! In
fact, it has worse performance than straightforward dynamic scope with
<code class="language-plaintext highlighter-rouge">let</code>.</p>

<ol>
  <li>
    <p>New symbol objects are allocated and initialized (<code class="language-plaintext highlighter-rouge">make-symbol</code>) on
each run-time evaluation, one per lexical variable.</p>
  </li>
  <li>
    <p>Since it’s just faking it, <code class="language-plaintext highlighter-rouge">lexical-let</code> still uses dynamic
bindings, which are more expensive than lexical bindings. It varies
depending on the C compiler that built Emacs, but dynamic variable
accesses (opcode <code class="language-plaintext highlighter-rouge">varref</code>) take around 30% longer than lexical
variable accesses (opcode <code class="language-plaintext highlighter-rouge">stack-ref</code>). Assignment is far worse,
where dynamic variable assignment (<code class="language-plaintext highlighter-rouge">varset</code>) takes 650% longer than
lexical variable assignment (<code class="language-plaintext highlighter-rouge">stack-set</code>). How I measured all this
is a topic for another article.</p>
  </li>
  <li>
    <p>The “lexical” variables are accessed using <code class="language-plaintext highlighter-rouge">symbol-value</code>, a full
function call, so they’re even slower than normal dynamic
variables.</p>
  </li>
  <li>
    <p>Because converted lambda expressions are constructed dynamically at
run-time within the body of <code class="language-plaintext highlighter-rouge">lexical-let</code>, the resulting closure is
only partially byte-compiled even if the code as a whole has been
byte-compiled. In contrast, <code class="language-plaintext highlighter-rouge">lexical-binding: t</code> closures are fully
compiled. How this works is worth <a href="/blog/2017/12/14/">its own article</a>.</p>
  </li>
  <li>
    <p>Converted lambda expressions include the additional internal
function invocation, making them slower.</p>
  </li>
</ol>

<p>While <code class="language-plaintext highlighter-rouge">lexical-let</code> is clever, and occasionally useful prior to Emacs
24, it may come at a hefty performance cost if evaluated frequently.
There’s no reason to use it anymore.</p>

<h3 id="constraints-on-code-generation">Constraints on code generation</h3>

<p>Another reason to be weary of dynamic scope is that it puts needless
constraints on the compiler, preventing a number of important
optimization opportunities. For example, consider the following
function, <code class="language-plaintext highlighter-rouge">bar</code>:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">bar</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">x</span> <span class="mi">1</span><span class="p">)</span>
        <span class="p">(</span><span class="nv">y</span> <span class="mi">2</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">foo</span><span class="p">)</span>
    <span class="p">(</span><span class="nb">+</span> <span class="nv">x</span> <span class="nv">y</span><span class="p">)))</span>
</code></pre></div></div>

<p>Byte-compile this function under dynamic scope (<code class="language-plaintext highlighter-rouge">lexical-binding:
nil</code>) and <a href="/blog/2014/01/04/">disassemble it</a> to see what it looks like.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">byte-compile</span> <span class="nf">#'</span><span class="nv">bar</span><span class="p">)</span>
<span class="p">(</span><span class="nb">disassemble</span> <span class="nf">#'</span><span class="nv">bar</span><span class="p">)</span>
</code></pre></div></div>

<p>That pops up a buffer with the disassembly listing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  1
1       constant  2
2       varbind   y
3       varbind   x
4       constant  foo
5       call      0
6       discard
7       varref    x
8       varref    y
9       plus
10      unbind    2
11      return
</code></pre></div></div>

<p>It’s 12 instructions, 5 of which deal with dynamic bindings. The
byte-compiler doesn’t always produce optimal byte-code, but this just
so happens to be <em>nearly</em> optimal byte-code. The <code class="language-plaintext highlighter-rouge">discard</code> (a very
fast instruction) isn’t necessary, but otherwise no more compiler
smarts can improve on this. Since the variables <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> are
visible to <code class="language-plaintext highlighter-rouge">foo</code>, they must be bound before the call and <a href="/blog/2016/07/25/">loaded after
the call</a>. While generally this function will return 3, the
compiler cannot assume so since it ultimately depends on the behavior
<code class="language-plaintext highlighter-rouge">foo</code>. Its hands are tied.</p>

<p>Compare this to the lexical scope version (<code class="language-plaintext highlighter-rouge">lexical-binding: t</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  1
1       constant  2
2       constant  foo
3       call      0
4       discard
5       stack-ref 1
6       stack-ref 1
7       plus
8       return
</code></pre></div></div>

<p>It’s only 8 instructions, none of which are expensive dynamic variable
instructions. And this isn’t even close to the optimal byte-code. In
fact, as of Emacs 25.1 the byte-compiler often doesn’t produce the
optimal byte-code for lexical scope code and still needs some work.
<strong>Despite not firing on all cylinders, lexical scope still manages to
beat dynamic scope in performance benchmarks.</strong></p>

<p>Here’s the optimal byte-code, should the byte-compiler become smarter
someday:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  foo
1       call      0
2       constant  3
3       return
</code></pre></div></div>

<p>It’s down to 4 instructions due to computing the math operation at
compile time. Emacs’ byte-compiler only has rudimentary constant
folding, so it doesn’t notice that <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> are constants and
misses this optimization. I speculate this is due to its roots
compiling under dynamic scope. Since <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> are no longer exposed
to <code class="language-plaintext highlighter-rouge">foo</code>, the compiler has the opportunity to optimize them out of
existence. I haven’t measured it, but I would expect this to be
significantly faster than the dynamic scope version of this function.</p>

<h3 id="optional-dynamic-scope">Optional dynamic scope</h3>

<p>You might be thinking, “What if I really <em>do</em> want <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> to be
dynamically bound for <code class="language-plaintext highlighter-rouge">foo</code>?” This is often useful. Many of Emacs’ own
functions are designed to have certain variables dynamically bound
around them. For example, the print family of functions use the global
variable <code class="language-plaintext highlighter-rouge">standard-output</code> to determine where to send output by
default.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">standard-output</span> <span class="p">(</span><span class="nv">current-buffer</span><span class="p">)))</span>
  <span class="p">(</span><span class="nb">princ</span> <span class="s">"value = "</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">prin1</span> <span class="nv">value</span><span class="p">))</span>
</code></pre></div></div>

<p>Have no fear: <strong>With <code class="language-plaintext highlighter-rouge">lexical-binding: t</code> you can have your cake and
eat it too.</strong> Variables declared with <code class="language-plaintext highlighter-rouge">defvar</code>, <code class="language-plaintext highlighter-rouge">defconst</code>, or
<code class="language-plaintext highlighter-rouge">defvaralias</code> are marked as “special” with an internal bit flag
(<code class="language-plaintext highlighter-rouge">declared_special</code> in C). When the compiler detects one of these
variables (<code class="language-plaintext highlighter-rouge">special-variable-p</code>), it uses a classical dynamic binding.</p>

<p>Declaring both <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> as special restores the original semantics,
reverting <code class="language-plaintext highlighter-rouge">bar</code> back to its old byte-code definition (next time it’s
compiled, that is). But it would be poor form to mark <code class="language-plaintext highlighter-rouge">x</code> or <code class="language-plaintext highlighter-rouge">y</code> as
special: You’d de-optimize all code (compiled <em>after</em> the declaration)
anywhere in Emacs that uses these names. As a package author, only do
this with the namespace-prefixed variables that belong to you.</p>

<p>The only way to unmark a special variable is with the undocumented
function <code class="language-plaintext highlighter-rouge">internal-make-var-non-special</code>. I expected <code class="language-plaintext highlighter-rouge">makunbound</code> to
do this, but as of Emacs 25.1 it does not. This could possibly be
considered a bug.</p>

<h3 id="accidental-closures">Accidental closures</h3>

<p>I’ve said there are absolutely no advantages to <code class="language-plaintext highlighter-rouge">lexical-binding: nil</code>.
It’s only the default for the sake of backwards-compatibility. However,
there <em>is</em> one case where <code class="language-plaintext highlighter-rouge">lexical-binding: t</code> introduces a subtle issue
that would otherwise not exist. Take this code for example (and
nevermind <code class="language-plaintext highlighter-rouge">prin1-to-string</code> for a moment):</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; -*- lexical-binding: t; -*-</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">function-as-string</span> <span class="p">()</span>
  <span class="p">(</span><span class="nv">with-temp-buffer</span>
    <span class="p">(</span><span class="nb">prin1</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span> <span class="ss">:example</span><span class="p">)</span> <span class="p">(</span><span class="nv">current-buffer</span><span class="p">))</span>
    <span class="p">(</span><span class="nv">buffer-string</span><span class="p">)))</span>
</code></pre></div></div>

<p>This creates and serializes a closure, which is <a href="/blog/2013/12/30/">one of Elisp’s unique
features</a>. It doesn’t close over any variables, so it should be
pretty simple. However, this function will only work correctly under
<code class="language-plaintext highlighter-rouge">lexical-binding: t</code> when byte-compiled.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">function-as-string</span><span class="p">)</span>
<span class="c1">;; =&gt; "(closure ((temp-buffer . #&lt;buffer  *temp*&gt;) t) nil :example)"</span>
</code></pre></div></div>

<p>The interpreter doesn’t analyze the closure, so just closes over
everything. This includes the hidden variable <code class="language-plaintext highlighter-rouge">temp-buffer</code> created by
the <code class="language-plaintext highlighter-rouge">with-temp-buffer</code> macro, resulting in an abstraction leak.
Buffers aren’t readable, so this will signal an error if an attempt is
made to read this function back into an s-expression. The
byte-compiler fixes this by noticing <code class="language-plaintext highlighter-rouge">temp-buffer</code> isn’t actually
closed over and so doesn’t include it in the closure, making it work
correctly.</p>

<p>Under <code class="language-plaintext highlighter-rouge">lexical-binding: nil</code> it works correctly either way:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">function-as-string</span><span class="p">)</span>
<span class="c1">;; -&gt; "(lambda nil :example)"</span>
</code></pre></div></div>

<p>This may seem contrived — it’s certainly unlikely — but <a href="https://github.com/jwiegley/emacs-async/issues/17">it has come
up in practice</a>. Still, it’s no reason to avoid <code class="language-plaintext highlighter-rouge">lexical-binding: t</code>.</p>

<h3 id="use-lexical-scope-in-all-new-code">Use lexical scope in all new code</h3>

<p>As I’ve said again and again, always use <code class="language-plaintext highlighter-rouge">lexical-binding: t</code>. Use
dynamic variables judiciously. And <code class="language-plaintext highlighter-rouge">lexical-let</code> is no replacement. It
has virtually none of the benefits, performs <em>worse</em>, and it only
applies to <code class="language-plaintext highlighter-rouge">let</code>, not any of the other places bindings are created:
function parameters, <code class="language-plaintext highlighter-rouge">dotimes</code>, <code class="language-plaintext highlighter-rouge">dolist</code>, and <code class="language-plaintext highlighter-rouge">condition-case</code>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Faster Elfeed Search Through JIT Byte-code Compilation</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/11/"/>
    <id>urn:uuid:47002cc3-816a-3cb8-b462-327364e3f943</id>
    <updated>2016-12-11T23:16:42Z</updated>
    <category term="emacs"/><category term="elfeed"/><category term="optimization"/><category term="elisp"/>
    <content type="html">
      <![CDATA[<p>Today I pushed an update for <a href="https://github.com/skeeto/elfeed">Elfeed</a> that doubles the speed
of the search filter in the worse case. This is the user-entered
expression that dynamically narrows the entry listing to a subset that
meets certain criteria: published after a particular date,
with/without particular tags, and matching/non-matching zero or more
regular expressions. The filter is live, applied to the database as
the expression is edited, so it’s important for usability that this
search completes under a threshold that the user might notice.</p>

<p><img src="/img/elfeed/filter.gif" alt="" /></p>

<p>The typical workaround for these kinds of interfaces is to make
filtering/searching asynchronous. It’s possible to do this well, but
it’s usually a terrible, broken design. If the user acts upon the
asynchronous results — say, by typing the query and hitting enter to
choose the current or expected top result — then the final behavior is
non-deterministic, a race between the user’s typing speed and the
asynchronous search. Elfeed will keep its synchronous live search.</p>

<p>For anyone not familiar with Elfeed, here’s a filter that finds all
entries from within the past year tagged “youtube” (<code class="language-plaintext highlighter-rouge">+youtube</code>) that
mention Linux or Linus (<code class="language-plaintext highlighter-rouge">linu[sx]</code>), but aren’t tagged “bsd” (<code class="language-plaintext highlighter-rouge">-bsd</code>),
limited to the most recent 15 entries (<code class="language-plaintext highlighter-rouge">#15</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@1-year-old +youtube linu[xs] -bsd #15
</code></pre></div></div>

<p>The database is primarily indexed over publication date, so filters on
publication dates are the most efficient filters. Entries are visited
in order starting with the most recently published, and the search can
bail out early once it crosses the filter threshold. Time-oriented
filters have been encouraged as the solution to keep the live search
feeling lively.</p>

<h3 id="filtering-overview">Filtering Overview</h3>

<p>The first step in filtering is parsing the filter text entered by the
user. This string is broken into its components using the
<code class="language-plaintext highlighter-rouge">elfeed-search-parse-filter</code> function. Date filter components are
converted into a unix epoch interval, tags are interned into symbols,
regular expressions are gathered up as strings, and the entry limit is
parsed into a plain integer. Absence of a filter component is
indicated by nil.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">elfeed-search-parse-filter</span> <span class="s">"@1-year-old +youtube linu[xs] -bsd #15"</span><span class="p">)</span>
<span class="c1">;; =&gt; (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)</span>
</code></pre></div></div>

<p>Previously, the next step was to apply the <code class="language-plaintext highlighter-rouge">elfeed-search-filter</code>
function with this structured filter representation to the database.
Except for special early-bailout situations, it works left-to-right
across the filter, checking each condition against each entry. This is
analogous to an interpreter, with the filter being a program.</p>

<p>Thinking about it that way, what if the filter was instead compiled
into an Emacs byte-code function and executed directly by the Emacs
virtual machine? That’s what this latest update does.</p>

<h3 id="benchmarks">Benchmarks</h3>

<p>With six different filter components, the actual filtering routine is
a bit too complicated for an article, so I’ll set up a simpler, but
roughly equivalent, scenario. With a reasonable cut-off date, the
filter was already sufficiently fast, so for benchmarking I’ll focus
on the worst case: no early bailout opportunities. An entry will be
just a list of tags (symbols), and the filter will have to test every
entry.</p>

<p>My <a href="/blog/2016/08/12/">real-world Elfeed database</a> currently has 46,772 entries with
36 distinct tags. For my benchmark I’ll round this up to a nice
100,000 entries, and use 26 distinct tags (A–Z), which has the nice
alphabet property and more closely reflects the number of tags I still
care about.</p>

<p>First, here’s <code class="language-plaintext highlighter-rouge">make-random-entry</code> to generate a random list of 1–5
tags (i.e. an entry). The <code class="language-plaintext highlighter-rouge">state</code> parameter is the random state,
allowing for deterministic benchmarks on a randomly-generated
database.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">cl-defun</span> <span class="nv">make-random-entry</span> <span class="p">(</span><span class="k">&amp;key</span> <span class="nv">state</span> <span class="p">(</span><span class="nb">min</span> <span class="mi">1</span><span class="p">)</span> <span class="p">(</span><span class="nb">max</span> <span class="mi">5</span><span class="p">))</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">repeat</span> <span class="p">(</span><span class="nb">+</span> <span class="nb">min</span> <span class="p">(</span><span class="nv">cl-random</span> <span class="p">(</span><span class="nb">1+</span> <span class="p">(</span><span class="nb">-</span> <span class="nb">max</span> <span class="nb">min</span><span class="p">))</span> <span class="nv">state</span><span class="p">))</span>
           <span class="nv">for</span> <span class="nv">letter</span> <span class="nb">=</span> <span class="p">(</span><span class="nb">+</span> <span class="nv">?A</span> <span class="p">(</span><span class="nv">cl-random</span> <span class="mi">26</span> <span class="nv">state</span><span class="p">))</span>
           <span class="nv">collect</span> <span class="p">(</span><span class="nb">intern</span> <span class="p">(</span><span class="nb">format</span> <span class="s">"%c"</span> <span class="nv">letter</span><span class="p">))))</span>
</code></pre></div></div>

<p>The database is just a big list of entries. In Elfeed this is actually
an AVL tree. Without dates, the order doesn’t matter.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">cl-defun</span> <span class="nv">make-random-database</span> <span class="p">(</span><span class="k">&amp;key</span> <span class="nv">state</span> <span class="p">(</span><span class="nb">count</span> <span class="mi">100000</span><span class="p">))</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">repeat</span> <span class="nb">count</span> <span class="nv">collect</span> <span class="p">(</span><span class="nv">make-random-entry</span> <span class="ss">:state</span> <span class="nv">state</span><span class="p">)))</span>
</code></pre></div></div>

<p>Here’s <a href="/blog/2009/05/28/">my old time macro</a>. An important change I’ve made since
years ago is to call <code class="language-plaintext highlighter-rouge">garbage-collect</code> before starting the clock,
eliminating bad samples from unlucky garbage collection events.
Depending on what you want to measure, it may even be worth disabling
garbage collection during the measurement by setting
<code class="language-plaintext highlighter-rouge">gc-cons-threshold</code> to a high value.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defmacro</span> <span class="nv">measure-time</span> <span class="p">(</span><span class="k">&amp;rest</span> <span class="nv">body</span><span class="p">)</span>
  <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="nv">indent</span> <span class="nb">defun</span><span class="p">))</span>
  <span class="p">(</span><span class="nv">garbage-collect</span><span class="p">)</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">start</span> <span class="p">(</span><span class="nb">make-symbol</span> <span class="s">"start"</span><span class="p">)))</span>
    <span class="o">`</span><span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="o">,</span><span class="nv">start</span> <span class="p">(</span><span class="nv">float-time</span><span class="p">)))</span>
       <span class="o">,@</span><span class="nv">body</span>
       <span class="p">(</span><span class="nb">-</span> <span class="p">(</span><span class="nv">float-time</span><span class="p">)</span> <span class="o">,</span><span class="nv">start</span><span class="p">))))</span>
</code></pre></div></div>

<p>Finally, the benchmark harness. It uses a hard-coded seed to generate
the same pseudo-random database. The test is run against the a filter
function, <code class="language-plaintext highlighter-rouge">f</code>, 100 times in search for the same 6 tags, and the timing
results are averaged.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">cl-defun</span> <span class="nv">benchmark</span> <span class="p">(</span><span class="nv">f</span> <span class="k">&amp;optional</span> <span class="p">(</span><span class="nv">n</span> <span class="mi">100</span><span class="p">)</span> <span class="p">(</span><span class="nv">tags</span> <span class="o">'</span><span class="p">(</span><span class="nv">A</span> <span class="nv">B</span> <span class="nv">C</span> <span class="nv">D</span> <span class="nv">E</span> <span class="nv">F</span><span class="p">)))</span>
  <span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">state</span> <span class="p">(</span><span class="nv">copy-sequence</span> <span class="nv">[cl-random-state-tag</span> <span class="mi">-1</span> <span class="mi">30</span> <span class="nv">267466518]</span><span class="p">))</span>
         <span class="p">(</span><span class="nv">db</span> <span class="p">(</span><span class="nv">make-random-database</span> <span class="ss">:state</span> <span class="nv">state</span><span class="p">)))</span>
    <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">repeat</span> <span class="nv">n</span>
             <span class="nv">sum</span> <span class="p">(</span><span class="nv">measure-time</span>
                   <span class="p">(</span><span class="nb">funcall</span> <span class="nv">f</span> <span class="nv">db</span> <span class="nv">tags</span><span class="p">))</span>
             <span class="nv">into</span> <span class="nv">total</span>
             <span class="nv">finally</span> <span class="nb">return</span> <span class="p">(</span><span class="nb">/</span> <span class="nv">total</span> <span class="p">(</span><span class="nb">float</span> <span class="nv">n</span><span class="p">)))))</span>
</code></pre></div></div>

<p>The baseline will be <code class="language-plaintext highlighter-rouge">memq</code> (test for membership using identity,
<code class="language-plaintext highlighter-rouge">eq</code>). There are two lists of tags to compare: the list that is the
entry, and the list from the filter. This requires a nested loop for
each entry, one explicit (<code class="language-plaintext highlighter-rouge">cl-loop</code>) and one implicit (<code class="language-plaintext highlighter-rouge">memq</code>), both
with early bailout.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">memq-count</span> <span class="p">(</span><span class="nv">db</span> <span class="nv">tags</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span> <span class="nb">count</span>
           <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">tag</span> <span class="nv">in</span> <span class="nv">tags</span>
                    <span class="nb">when</span> <span class="p">(</span><span class="nv">memq</span> <span class="nv">tag</span> <span class="nv">entry</span><span class="p">)</span>
                    <span class="nb">return</span> <span class="no">t</span><span class="p">)))</span>
</code></pre></div></div>

<p>Byte-code compiling everything and running the benchmark on my laptop
I get:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">memq-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.041 seconds</span>
</code></pre></div></div>

<p>That’s actually not too bad. One of the advantages of this definition
is that there are no function calls. The <code class="language-plaintext highlighter-rouge">memq</code> built-in function has
its own opcode (62), and the rest of the definition is special forms
and macros expanding to special forms (<code class="language-plaintext highlighter-rouge">cl-loop</code>). It’s exactly the
thing I need to exploit to make filters faster.</p>

<p>As a sanity check, what would happen if I used <code class="language-plaintext highlighter-rouge">member</code> instead of
<code class="language-plaintext highlighter-rouge">memq</code>? In theory it should be slower because it uses <code class="language-plaintext highlighter-rouge">equal</code> for
tests instead of <code class="language-plaintext highlighter-rouge">eq</code>.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">member-count</span> <span class="p">(</span><span class="nv">db</span> <span class="nv">tags</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span> <span class="nb">count</span>
           <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">tag</span> <span class="nv">in</span> <span class="nv">tags</span>
                    <span class="nb">when</span> <span class="p">(</span><span class="nb">member</span> <span class="nv">tag</span> <span class="nv">entry</span><span class="p">)</span>
                    <span class="nb">return</span> <span class="no">t</span><span class="p">)))</span>
</code></pre></div></div>

<p>It’s only slightly slower because <code class="language-plaintext highlighter-rouge">member</code>, <a href="/blog/2013/01/22/">like many other
built-ins</a>, also has an opcode (157). It’s just a tiny bit
more overhead.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">member-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.047 seconds</span>
</code></pre></div></div>

<p>To test function call overhead while still using the built-in (e.g.
written in C) <code class="language-plaintext highlighter-rouge">memq</code>, I’ll alias it so that the byte-code compiler is
forced to emit a function call.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">defalias</span> <span class="ss">'memq-alias</span> <span class="ss">'memq</span><span class="p">)</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">memq-alias-count</span> <span class="p">(</span><span class="nv">db</span> <span class="nv">tags</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span> <span class="nb">count</span>
           <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">tag</span> <span class="nv">in</span> <span class="nv">tags</span>
                    <span class="nb">when</span> <span class="p">(</span><span class="nv">memq-alias</span> <span class="nv">tag</span> <span class="nv">entry</span><span class="p">)</span>
                    <span class="nb">return</span> <span class="no">t</span><span class="p">)))</span>
</code></pre></div></div>

<p>To verify that this is doing what I expect, I <code class="language-plaintext highlighter-rouge">M-x disassemble</code> the
function and inspect the byte-code disassembly. Here’s a simple
example.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">disassemble</span>
 <span class="p">(</span><span class="nv">byte-compile</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nb">list</span><span class="p">)</span> <span class="p">(</span><span class="nv">memq</span> <span class="ss">:foo</span> <span class="nb">list</span><span class="p">))))</span>
</code></pre></div></div>

<p>When compiled under lexical scope (<code class="language-plaintext highlighter-rouge">lexical-binding</code> is true), here’s
the disassembly. To understand what this means, see <a href="/blog/2014/01/04/"><em>Emacs Byte-code
Internals</em></a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  :foo
1       stack-ref 1
2       memq
3       return
</code></pre></div></div>

<p>Notice the <code class="language-plaintext highlighter-rouge">memq</code> instruction. Try using <code class="language-plaintext highlighter-rouge">memq-alias</code> instead:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">disassemble</span>
 <span class="p">(</span><span class="nv">byte-compile</span> <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nb">list</span><span class="p">)</span> <span class="p">(</span><span class="nv">memq-alias</span> <span class="ss">:foo</span> <span class="nb">list</span><span class="p">))))</span>
</code></pre></div></div>

<p>Resulting in a function call:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  memq-alias
1       constant  :foo
2       stack-ref 2
3       call      2
4       return
</code></pre></div></div>

<p>And the benchmark:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">memq-alias-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.052 seconds</span>
</code></pre></div></div>

<p>So the function call adds about 27% overhead. This means it would be a
good idea to <strong>avoid calling functions in the filter</strong> if I can help
it. I should rely on these special opcodes.</p>

<p>Suppose <code class="language-plaintext highlighter-rouge">memq</code> was written in Emacs Lisp rather than C. How much would
that hurt performance? My version of <code class="language-plaintext highlighter-rouge">my-memq</code> below isn’t quite the
same since it returns t rather than the sublist, but it’s good enough
for this purpose. (I’m using <code class="language-plaintext highlighter-rouge">cl-loop</code> because writing early bailout
in plain Elisp without recursion is, in my opinion, ugly.)</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">my-memq</span> <span class="p">(</span><span class="nv">needle</span> <span class="nv">haystack</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">element</span> <span class="nv">in</span> <span class="nv">haystack</span>
           <span class="nb">when</span> <span class="p">(</span><span class="nb">eq</span> <span class="nv">needle</span> <span class="nv">element</span><span class="p">)</span>
           <span class="nb">return</span> <span class="no">t</span><span class="p">))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">my-memq-count</span> <span class="p">(</span><span class="nv">db</span> <span class="nv">tags</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span> <span class="nb">count</span>
           <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">tag</span> <span class="nv">in</span> <span class="nv">tags</span>
                    <span class="nb">when</span> <span class="p">(</span><span class="nv">my-memq</span> <span class="nv">tag</span> <span class="nv">entry</span><span class="p">)</span>
                    <span class="nb">return</span> <span class="no">t</span><span class="p">)))</span>
</code></pre></div></div>

<p>And the benchmark:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">my-memq-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.137 seconds</span>
</code></pre></div></div>

<p>Oof! It’s more than 3 times slower than the opcode. This means <strong>I
should use built-ins as much as possible</strong> in the filter.</p>

<h3 id="dynamic-vs-lexical-scope">Dynamic vs. lexical scope</h3>

<p>There’s one last thing to watch out for. Everything so far has been
compiled with lexical scope. You should really turn this on by default
for all new code that you write. It has three important advantages:</p>

<ol>
  <li>It allows the compiler to catch more mistakes.</li>
  <li>It eliminates a class of bugs related to dynamic scope: Local
variables are exposed to manipulation by callees.</li>
  <li><a href="/blog/2016/12/22/">Lexical scope has better performance</a>.</li>
</ol>

<p>Here are all the benchmarks with the default dynamic scope:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">memq-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.065 seconds</span>

<span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">member-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.070 seconds</span>

<span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">memq-alias-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.074 seconds</span>

<span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">my-memq-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.256 seconds</span>
</code></pre></div></div>

<p>It halves the performance in this benchmark, and for no benefit. Under
dynamic scope, local variables use the <code class="language-plaintext highlighter-rouge">varref</code> opcode — a global
variable lookup — instead of the <code class="language-plaintext highlighter-rouge">stack-ref</code> opcode — a simple array
index.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">norm</span> <span class="p">(</span><span class="nv">a</span> <span class="nv">b</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">*</span> <span class="p">(</span><span class="nb">-</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)</span> <span class="p">(</span><span class="nb">-</span> <span class="nv">a</span> <span class="nv">b</span><span class="p">)))</span>
</code></pre></div></div>

<p>Under dynamic scope, this compiles to:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       varref    a
1       varref    b
2       diff
3       varref    a
4       varref    b
5       diff
6       mult
7       return
</code></pre></div></div>

<p>And under lexical scope (notice the variable names disappear):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       stack-ref 1
1       stack-ref 1
2       diff
3       stack-ref 2
4       stack-ref 2
5       diff
6       mult
7       return
</code></pre></div></div>

<h3 id="jit-compiled-filters">JIT-compiled filters</h3>

<p>So far I’ve been moving in the wrong direction, making things slower
rather than faster. How can I make it faster than the straight <code class="language-plaintext highlighter-rouge">memq</code>
version? By compiling the filter into byte-code.</p>

<p>I won’t write the byte-code directly, but instead generate Elisp code
and use the byte-code compiler on it. This is safer, will work
correctly in future versions of Emacs, and leverages the optimizations
performed by the byte-compiler. This sort of thing recently <a href="http://emacshorrors.com/posts/when-data-becomes-code.html">got a bad
rap on Emacs Horrors</a>, but I was happy to see that this
technique is already established.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">jit-count</span> <span class="p">(</span><span class="nv">db</span> <span class="nv">tags</span><span class="p">)</span>
  <span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">memq-list</span> <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">tag</span> <span class="nv">in</span> <span class="nv">tags</span>
                             <span class="nv">collect</span> <span class="o">`</span><span class="p">(</span><span class="nv">memq</span> <span class="ss">',tag</span> <span class="nv">entry</span><span class="p">)))</span>
         <span class="p">(</span><span class="k">function</span> <span class="o">`</span><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">db</span><span class="p">)</span>
                      <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span>
                               <span class="nb">count</span> <span class="p">(</span><span class="nb">or</span> <span class="o">,@</span><span class="nv">memq-list</span><span class="p">))))</span>
         <span class="p">(</span><span class="nv">compiled</span> <span class="p">(</span><span class="nv">byte-compile</span> <span class="k">function</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">funcall</span> <span class="nv">compiled</span> <span class="nv">db</span><span class="p">)))</span>
</code></pre></div></div>

<p>It dynamically builds the code as an s-expression, runs that through
the byte-code compiler, executes it, and throws it away. It’s
“just-in-time,” though compiling to byte-code and not <a href="/blog/2015/03/19/">native
code</a>. For the benchmark tags of <code class="language-plaintext highlighter-rouge">(A B C D E F)</code>, this builds
the following:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">db</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">cl-loop</span> <span class="nv">for</span> <span class="nv">entry</span> <span class="nv">in</span> <span class="nv">db</span>
           <span class="nb">count</span> <span class="p">(</span><span class="nb">or</span> <span class="p">(</span><span class="nv">memq</span> <span class="ss">'A</span> <span class="nv">entry</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">memq</span> <span class="ss">'B</span> <span class="nv">entry</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">memq</span> <span class="ss">'C</span> <span class="nv">entry</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">memq</span> <span class="ss">'D</span> <span class="nv">entry</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">memq</span> <span class="ss">'E</span> <span class="nv">entry</span><span class="p">)</span>
                     <span class="p">(</span><span class="nv">memq</span> <span class="ss">'F</span> <span class="nv">entry</span><span class="p">))))</span>
</code></pre></div></div>

<p>Due to its short-circuiting behavior, <code class="language-plaintext highlighter-rouge">or</code> is a special form, so this
function is just special forms and <code class="language-plaintext highlighter-rouge">memq</code> in its opcode form. It’s as
fast as Elisp can get.</p>

<p>Having s-expressions is a real strength for lisp, since the
alternative (in, say, JavaScript) would be to assemble the function by
concatenating code strings. By contrast, this looks a lot like a
regular lisp macro. Invoking the byte-code compiler does add some
overhead compared to the interpreted filter, but it’s insignificant.</p>

<p>How much faster is this?</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">benchmark</span> <span class="nf">#'</span><span class="nv">jit-count</span><span class="p">)</span>
<span class="c1">;; =&gt; 0.017s</span>
</code></pre></div></div>

<p><strong>It’s more than twice as fast!</strong> The big gain here is through <em>loop
unrolling</em>. The outer loop has been unrolled into the <code class="language-plaintext highlighter-rouge">or</code> expression.
That section of byte-code looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       constant  A
1       stack-ref 1
2       memq
3       goto-if-not-nil-else-pop 1
6       constant  B
7       stack-ref 1
8       memq
9       goto-if-not-nil-else-pop 1
12      constant  C
13      stack-ref 1
14      memq
15      goto-if-not-nil-else-pop 1
18      constant  D
19      stack-ref 1
20      memq
21      goto-if-not-nil-else-pop 1
24      constant  E
25      stack-ref 1
26      memq
27      goto-if-not-nil-else-pop 1
30      constant  F
31      stack-ref 1
32      memq
33:1    return
</code></pre></div></div>

<p>In Elfeed, not only does it unroll these loops, it completely
eliminates the overhead for unused filter components. Comparing to
this benchmark, I’m seeing roughly matching gains in Elfeed’s worst
case. In Elfeed, I also bind <code class="language-plaintext highlighter-rouge">lexical-binding</code> around the
<code class="language-plaintext highlighter-rouge">byte-compile</code> call to force lexical scope, since otherwise it just
uses the buffer-local value (usually nil).</p>

<p>Filter compilation can be toggled on and off by setting
<code class="language-plaintext highlighter-rouge">elfeed-search-compile-filter</code>. If you’re up to date, try out live
filters with it both enabled and disabled. See if you can notice the
difference.</p>

<h3 id="result-summary">Result summary</h3>

<p>Here are the results in a table, all run with Emacs 24.4 on x86-64.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(ms)      memq      member    memq-alias my-memq   jit
lexical   41        47        52         137       17
dynamic   65        70        74         256       21
</code></pre></div></div>

<p>And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where
I also occasionally use Elfeed, and where I have been very interested
in improving performance.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(ms)      memq      member    memq-alias my-memq   jit
lexical   170       235       242        614       79
dynamic   274       340       345        1130      92
</code></pre></div></div>

<p>And here’s how you can run the benchmarks for yourself, perhaps with
different parameters:</p>

<ul>
  <li><a href="/download/jit-bench.el">jit-bench.el</a></li>
</ul>

<p>The header explains how to run the benchmark in batch mode:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ emacs -Q -batch -f batch-byte-compile jit-bench.el
$ emacs -Q -batch -l jit-bench.elc -f benchmark-batch
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Baking Data with Serialization</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/15/"/>
    <id>urn:uuid:365d1301-72b9-39d1-8023-20fb83e046ab</id>
    <updated>2016-11-15T05:27:53Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Suppose you want to bake binary data directly into a program’s
executable. It could be image pixel data (PNG, BMP, JPEG), a text
file, or some sort of complex data structure. Perhaps the purpose is
to build a single executable with no extraneous data files — easier to
install and manage, though harder to modify. Or maybe you’re lazy and
don’t want to worry about handling the various complications and
errors that arise when reading external data: Where to find it, and
what to do if you can’t find it or can’t read it. This article is
about two different approaches I’ve used a number of times for C
programs.</p>

<h3 id="the-linker-approach">The linker approach</h3>

<p>The simpler, less portable option is to have the linker do it. Both
the GNU linker and the <a href="http://www.airs.com/blog/archives/38">gold linker</a> (ELF only) can create
object files from arbitrary files using the <code class="language-plaintext highlighter-rouge">--format</code> (<code class="language-plaintext highlighter-rouge">-b</code>) option
set to <code class="language-plaintext highlighter-rouge">binary</code> (raw data). It’s combined with <code class="language-plaintext highlighter-rouge">--relocatable</code> (<code class="language-plaintext highlighter-rouge">-r</code>)
to make it linkable with the rest of the program. MinGW supports all
of this, too, so it’s fairly portable so long as you stick to GNU
Binutils.</p>

<p>For example, to create an object file, <code class="language-plaintext highlighter-rouge">my_msg.o</code> with the
contents of the text file <code class="language-plaintext highlighter-rouge">my_msg.txt</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ld -r -b binary -o my_file.o my_msg.txt
</code></pre></div></div>

<p>(<em>Update</em>: <a href="/blog/2019/11/15/">You probably also want to use <code class="language-plaintext highlighter-rouge">-z noexecstack</code></a>.)</p>

<p>The object file will have three symbols, each named after the input
file. Unfortunately there’s no control over the symbol names, section
(.data), alignment, or protections (e.g. read-only). You’re completely
at the whim of the linker, short of objcopy tricks.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nm my_msg.o
000000000000000e D _binary_my_msg_txt_end
000000000000000e A _binary_my_msg_txt_size
0000000000000000 D _binary_my_msg_txt_start
</code></pre></div></div>

<p>To access these in C, declare them as global variables like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_end</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>
</code></pre></div></div>

<p>The size symbol, <code class="language-plaintext highlighter-rouge">_binary_my_msg_txt_size</code>, is misleading. The “A”
from nm means it’s an absolute symbol, not relocated. It doesn’t refer
to an integer that holds the size of the raw data. The value of the
symbol itself is the size of the data. That is, take the address of it
and cast it to an integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">&amp;</span><span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>
</code></pre></div></div>

<p>Alternatively — and this is my own preference — just subtract the
other two symbols. It’s cleaner and easier to understand.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">_binary_my_msg_txt_end</span> <span class="o">-</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">;</span>
</code></pre></div></div>

<p>Here’s the “Hello, world” for this approach (<code class="language-plaintext highlighter-rouge">hello.c</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_end</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">_binary_my_msg_txt_end</span> <span class="o">-</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">;</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">_binary_my_msg_txt_start</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The program has to use <code class="language-plaintext highlighter-rouge">fwrite()</code> rather than <code class="language-plaintext highlighter-rouge">fputs()</code> because the
data won’t necessarily be null-terminated. That is, unless a null is
intentionally put at the end of the text file itself.</p>

<p>And for the build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat my_msg.txt
Hello, world!
$ ld -r -b binary -o my_msg.o my_msg.txt
$ gcc -o hello hello.c my_msg.o
$ ./hello
Hello, world!
</code></pre></div></div>

<p>If this was binary data, such as an image file, the program would
instead read the array as if it were a memory mapped file. In fact,
that’s what it really is: the raw data memory mapped by the loader
before the program started.</p>

<h4 id="how-about-a-data-structure-dump">How about a data structure dump?</h4>

<p>This could be taken further to dump out some kinds of data structures.
For example, this program (<code class="language-plaintext highlighter-rouge">table_gen.c</code>) fills out a table of the
first 90 Fibonacci numbers and dumps it to standard output.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="cp">#define TABLE_SIZE 90
</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[</span><span class="n">TABLE_SIZE</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">};</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TABLE_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">2</span><span class="p">];</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">table</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Build and run this intermediate helper program as part of the overall
build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen &gt; table.bin
$ ld -r -b binary -o table.o table.bin
</code></pre></div></div>

<p>And then the main program (<code class="language-plaintext highlighter-rouge">print_fib.c</code>) might look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">_binary_table_bin_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">_binary_table_bin_end</span><span class="p">[];</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">start</span> <span class="o">=</span> <span class="n">_binary_table_bin_start</span><span class="p">;</span>
    <span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">end</span>   <span class="o">=</span> <span class="n">_binary_table_bin_end</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="n">start</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">end</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%lld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="o">*</span><span class="n">x</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, there are some good reasons not to use this feature in this
way:</p>

<ol>
  <li>
    <p>The format of <code class="language-plaintext highlighter-rouge">table.bin</code> is specific to the host architecture
(byte order, size, padding, etc.). If the host is the same as the
target then this isn’t a problem, but it will prohibit
cross-compilation.</p>
  </li>
  <li>
    <p>The linker has no information about the alignment requirements of
the data. To the linker it’s just a byte buffer. In the final
program the <code class="language-plaintext highlighter-rouge">long long</code> array will not necessarily aligned properly
for its type, meaning the above program might crash. The Right Way
is to never dereference the data directly but rather <code class="language-plaintext highlighter-rouge">memcpy()</code> it
into a properly-aligned variable, just as if the data was an
unaligned buffer.</p>
  </li>
  <li>
    <p>The data structure cannot use any pointers. Pointer values are
meaningless to other processes and will be no different than
garbage.</p>
  </li>
</ol>

<h3 id="towards-a-more-portable-approach">Towards a more portable approach</h3>

<p>There’s an easy way to address all three of these problems <em>and</em>
eliminate the reliance on GNU linkers: serialize the data into C code.
<em>It’s metaprogramming, baby.</em></p>

<p>In the Fibonacci example, change the <code class="language-plaintext highlighter-rouge">fwrite()</code> in <code class="language-plaintext highlighter-rouge">table_gen.c</code> to
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">printf</span><span class="p">(</span><span class="s">"int table_size = %d;</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">TABLE_SIZE</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"long long table[] = {</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TABLE_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"    %lldLL,</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"};</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>The output of the program becomes text:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">table_size</span> <span class="o">=</span> <span class="mi">90</span><span class="p">;</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mi">1LL</span><span class="p">,</span>
    <span class="mi">1LL</span><span class="p">,</span>
    <span class="mi">2LL</span><span class="p">,</span>
    <span class="mi">3LL</span><span class="p">,</span>
    <span class="cm">/* ... */</span>
    <span class="mi">1779979416004714189LL</span><span class="p">,</span>
    <span class="mi">2880067194370816120LL</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">print_fib.c</code> is changed to:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">int</span> <span class="n">table_size</span><span class="p">;</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[];</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">table_size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%lld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Putting it all together:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen &gt; table.c
$ gcc -std=c99 -o print_fib print_fib.c table.c
</code></pre></div></div>

<p>Any C compiler and linker could do all of this, no problem, making it
more portable. The intermediate metaprogram isn’t a barrier to cross
compilation. It would be compiled for the host (typically identified
through <code class="language-plaintext highlighter-rouge">HOST_CC</code>) and the rest is compiled for the target (e.g.
<code class="language-plaintext highlighter-rouge">CC</code>).</p>

<p>The output of <code class="language-plaintext highlighter-rouge">table_gen.c</code> isn’t dependent on any architecture,
making it cross-compiler friendly. There are also no alignment
problems because it’s all visible to compiler. The type system isn’t
being undermined.</p>

<h3 id="dealing-with-pointers">Dealing with pointers</h3>

<p>The Fibonacci example doesn’t address the pointer problem — it has no
pointers to speak of. So let’s step it up to a trie using the <a href="/blog/2016/11/13/">trie
from the previous article</a>. As a reminder, here it is:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define TRIE_ALPHABET_SIZE  4
#define TRIE_TERMINAL_FLAG  (1U &lt;&lt; 0)
</span>
<span class="k">struct</span> <span class="n">trie</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">next</span><span class="p">[</span><span class="n">TRIE_ALPHABET_SIZE</span><span class="p">];</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Dumping these structures out raw would definitely be useless since
they’re almost entirely pointer data. So instead, fill out an array of
these structures, referencing the array itself to build up the
pointers (later filled in by either the linker or the loader). This
code uses the <a href="/blog/2016/11/13/">in-place breadth-first traversal technique</a> from
the previous article.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">trie_serialize</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"struct trie %s[] = {</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">head</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"    {​{"</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">comma</span> <span class="o">=</span> <span class="n">i</span> <span class="o">?</span> <span class="s">", "</span> <span class="o">:</span> <span class="s">""</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
                <span class="cm">/* Add child to the queue. */</span>
                <span class="n">tail</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
                <span class="n">tail</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
                <span class="n">next</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
                <span class="cm">/* Print the pointer to the child. */</span>
                <span class="n">printf</span><span class="p">(</span><span class="s">"%s%s + %zu"</span><span class="p">,</span> <span class="n">comma</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">++</span><span class="n">count</span><span class="p">);</span>
            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="n">printf</span><span class="p">(</span><span class="s">"%s0"</span><span class="p">,</span> <span class="n">comma</span><span class="p">);</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"}, 0, 0, %u},</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">);</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"};</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Remember that list of strings from before?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AAAAA
ABCD
CAA
CAD
CDBD
</code></pre></div></div>

<p>Which looks like this?</p>

<p><img src="/img/trie/trie.svg" alt="" /></p>

<p>That serializes to this C code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie</span> <span class="n">root</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">3</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">6</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">10</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">13</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">14</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This trie can be immediately used at program startup without
initialization, and it can even have new nodes inserted into it. It’s
not without its downsides, particularly because it’s a trie:</p>

<ol>
  <li>
    <p>It’s <em>really</em> going to blow up the size of the binary, especially
when it holds lots of strings. These nodes are anything but
compact.</p>
  </li>
  <li>
    <p>If the code is compiled to be position-independent (<code class="language-plaintext highlighter-rouge">-fPIC</code>), each
of those nodes is going to hold multiple dynamic relocations,
further exploding the size of the binary and <a href="/blog/2016/10/27/">preventing the trie
from being shared between processes</a>. It’s 24 bytes per
relocation on x86-64. This will also slow down program start up
time. With just a few thousand strings, the simple test program was
taking 5x longer to start (25ms instead of 5ms) than with an empty
trie.</p>
  </li>
  <li>
    <p>Even without being position-independent, the linker will have to
resolve all the compile-time relocations. I was able to overwhelm
linker and run it out of memory with just some tens of thousands of
strings. This would make for a decent linker stress test.</p>
  </li>
</ol>

<p>This technique obviously doesn’t scale well with trie data. You’re
better off baking in the flat string list and building the trie at run
time — though you <em>could</em> compute the exact number of needed nodes at
compile time and statically allocate them (in .bss). I’ve personally
had much better luck with <a href="https://github.com/skeeto/yavalath">other sorts of lookup tables</a>.
It’s a useful tool for the C programmer’s toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>An Array of Pointers vs. a Multidimensional Array</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/27/"/>
    <id>urn:uuid:d1302ff9-f958-3486-134d-01c8ab84aa51</id>
    <updated>2016-10-27T21:01:33Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In a C program, suppose I have a table of color names of similar
length. There are two straightforward ways to construct this table.
The most common would be an array of <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The other is a two-dimensional <code class="language-plaintext highlighter-rouge">char</code> array.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The initializers are identical, and the syntax by which these tables
are used is the same, but the underlying data structures are very
different. For example, suppose I had a lookup() function that
searches the table for a particular color.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lookup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">color</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">ncolors</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ncolors</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">color</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is <code class="language-plaintext highlighter-rouge">char
colors[][7]</code> or <code class="language-plaintext highlighter-rouge">char *colors[]</code>. It’s a little bit misleading because
the compiler generates different code depending on the type.</p>

<h3 id="memory-layout">Memory Layout</h3>

<p>Here’s what <code class="language-plaintext highlighter-rouge">colors_ptr</code>, a <em>jagged array</em>, typically looks like in
memory.</p>

<p><img src="/img/colortab/pointertab.png" alt="" /></p>

<p>The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression <code class="language-plaintext highlighter-rouge">colors_ptr[n]</code> is <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<p>On x86-64, suppose the base of the table is in <code class="language-plaintext highlighter-rouge">rax</code>, the index of the
string I want to retrieve is <code class="language-plaintext highlighter-rouge">rcx</code>, and I want to put the string’s
address back into <code class="language-plaintext highlighter-rouge">rax</code>. It’s one load instruction.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Contrast this with <code class="language-plaintext highlighter-rouge">colors_2d</code>: six 7-byte elements in a row. No
pointers or addresses. Only strings.</p>

<p><img src="/img/colortab/arraytab.png" alt="" /></p>

<p>The strings are in their defined order, packed together. The type of
the expression <code class="language-plaintext highlighter-rouge">colors_2d[n]</code> is <code class="language-plaintext highlighter-rouge">char [7]</code>, an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.</p>

<p>In the same scenario before with x86-64, it takes two instructions to
put the string’s address in <code class="language-plaintext highlighter-rouge">rax</code>, but neither is a load.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">imul</span>  <span class="nb">rcx</span><span class="p">,</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">7</span>
<span class="nf">add</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rcx</span>
</code></pre></div></div>

<p>In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. <code class="language-plaintext highlighter-rouge">char colors_2d[][8]</code>). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">lea</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="relocation">Relocation</h3>

<p>There’s another factor to consider: relocation. Nearly every process
running on a modern system takes advantage of a security feature
called Address Space Layout Randomization (ASLR). The virtual address
of code and data is randomized at process load time. For shared
libraries, it’s not just a security feature, it’s essential to their
basic operation. Libraries cannot possibly coordinate their preferred
load addresses with every other library on the system, and so must be
relocatable.</p>

<p>If the program is compiled with GCC or Clang configured for position
independent code — <code class="language-plaintext highlighter-rouge">-fPIC</code> (for libraries) or <code class="language-plaintext highlighter-rouge">-fpie</code> + <code class="language-plaintext highlighter-rouge">-pie</code> (for
programs) — extra work has to be done to support <code class="language-plaintext highlighter-rouge">colors_ptr</code>. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.</p>

<p>However, <code class="language-plaintext highlighter-rouge">colors_2d</code> doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.</p>

<p>On x86-64, in both cases the table itself typically doesn’t need a
relocation entry because it will be <em>RIP-relative</em> (in the <a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">small code
model</a>). That is, code that uses the table will be at a fixed
offset from the table no matter where the program is loaded. It won’t
need to be looked up using the Global Offset Table (GOT).</p>

<p>In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in <code class="language-plaintext highlighter-rouge">rax</code> looks
like so:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; NASM:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">address</span><span class="p">]</span>
<span class="c1">;; Some others:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rip</span> <span class="o">+</span> <span class="nv">address</span><span class="p">]</span>
</code></pre></div></div>

<p>Or in AT&amp;T syntax:</p>

<pre><code class="language-gas">lea    address(%rip), %rax
</code></pre>

<h3 id="virtual-memory">Virtual Memory</h3>

<p>Besides (trivially) more work for the loader, there’s another
consequence to relocations: Pages containing relocations are not
shared between processes (except after fork()). When loading a
program, the loader doesn’t copy programs and libraries to memory so
much as it memory maps their binaries with copy-on-write semantics. If
another process is running with the same binaries loaded (e.g.
libc.so), they’ll share the same physical memory so long as those
pages haven’t been modified by either process. Modifying the page
creates a unique copy for that process.</p>

<p>Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means <code class="language-plaintext highlighter-rouge">colors_2d</code> has the possibility of being shared
between processes, but <code class="language-plaintext highlighter-rouge">colors_ptr</code> (and its entire page) definitely
does not. Shucks.</p>

<p>This is one of the reasons why the Procedure Linkage Table (PLT)
exists. The PLT is an array of function stubs for shared library
functions, such as those in the C standard library. Sure, the loader
<em>could</em> go through the program and fill out the address of every
library function call, but this would modify lots and lots of code
pages, creating a unique copy of large parts of the program. Instead,
the dynamic linker <a href="https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html">lazily supplies jump addresses</a> for PLT
function stubs, one per accessed library function.</p>

<p>However, as I’ve written it above, it’s unlikely that even <code class="language-plaintext highlighter-rouge">colors_2d</code>
will be shared. It’s still missing an important ingredient: const.</p>

<h3 id="const">Const</h3>

<p>They say <a href="/blog/2016/07/25/">const isn’t for optimization</a> but, darnit, this
situation keeps coming up. Since <code class="language-plaintext highlighter-rouge">colors_ptr</code> and <code class="language-plaintext highlighter-rouge">colors_2d</code> are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
<code class="language-plaintext highlighter-rouge">colors_2d</code> to being a local copy.</p>

<p>Fortunately it’s trivial to fix by adding a const:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (<code class="language-plaintext highlighter-rouge">.rodata</code>) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.</p>

<p>Note, the equivalent for <code class="language-plaintext highlighter-rouge">colors_ptr</code> requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="k">const</span> <span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>String literals are already effectively const, though the C
specification (unlike C++) doesn’t actually define them to be this
way. But, like setting your relationship status on Facebook, declaring
it makes it official.</p>

<h3 id="its-just-micro-optimization">It’s just micro-optimization</h3>

<p>These little details are all deep down the path of micro-optimization
and will rarely ever matter in practice, but perhaps you learned
something broader from all this. This stuff fascinates me.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Small-Size Optimization in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/07/"/>
    <id>urn:uuid:1e249621-40bb-39f9-7e47-17fbe37c9fa4</id>
    <updated>2016-10-07T01:43:12Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I’ve worked on many programs that frequently require small,
short-lived buffers for use as a temporary workspace, perhaps to
construct a string or array. In C this is often accomplished with
arrays of <a href="/blog/2016/10/02/">automatic storage duration</a> (i.e. allocated on the
stack). This is dirt cheap — much cheaper than a heap allocation —
and, unlike a typical general-purpose allocator, involves no thread
contention. However, the catch that there may be no hard bound to the
buffer. For correctness, the scratch space must scale appropriately to
match its input. Whatever arbitrary buffer size I pick may be too small.</p>

<p>A widespread extension to C is the alloca() pseudo-function. It’s like
malloc(), but allocates memory on the stack, just like an automatic
variable. The allocation is automatically freed when the function (not
its scope!) exits, even with a longjmp() or other non-local exit.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloca</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Besides its portability issues, the most dangerous property is the
<strong>complete lack of error detection</strong>. If <code class="language-plaintext highlighter-rouge">size</code> is too large, the
program simply crashes, <em>or worse</em>.</p>

<p>For example, suppose I have an intern() function that finds or creates
the canonical representation/storage for a particular string. My
program needs to intern a string composed of multiple values, and will
construct a temporary string to do so.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I expect the vast majority of these <code class="language-plaintext highlighter-rouge">prefix</code> strings to be very small,
perhaps on the order of 10 to 80 bytes, and this function will handle
them extremely efficiently. But should this function get passed a huge
<code class="language-plaintext highlighter-rouge">prefix</code>, perhaps by a malicious actor, the program will misbehave
without warning.</p>

<p>A portable alternative to alloca() is variable-length arrays (VLA),
introduced in C99. Arrays with automatic storage duration need not
have a fixed, compile-time size. It’s just like alloca(), having
exactly <strong>the same dangers</strong>, but at least it’s properly scoped. It
was rejected for inclusion in C++11 due to this danger.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buffer</span><span class="p">[</span><span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">];</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s a middle-ground to this, using neither VLAs nor alloca().
Suppose the function always allocates a small, fixed size buffer —
essentially a free operation — but only uses this buffer if it’s large
enough for the job. If it’s not, a normal heap allocation is made with
malloc().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">temp</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">temp</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">temp</span><span class="p">))</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">)))</span>
            <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">buffer</span> <span class="o">!=</span> <span class="n">temp</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the function can now detect allocation errors, this version has
an error condition. Though, intern() itself would presumably return
NULL for its own allocation errors, so this is probably transparent to
the caller.</p>

<p>We’ve now entered the realm of <em>small-size optimization</em>. The vast
majority of cases are small and will therefore be very fast, but we
haven’t given up on the odd large case either. In fact, it’s been made
a little bit <em>worse</em> (via the unnecessary small allocation), selling
it out to make the common case fast. That’s sound engineering.</p>

<p>Visual Studio has a pair of functions that <em>nearly</em> automate this
solution: _malloca() and _freea(). It’s like alloca(), but
allocations beyond a certain threshold go on the heap. This allocation
is freed with _freea(), which does nothing in the case of a stack
allocation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">_malloca</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">_freea</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>I said “nearly” because Microsoft screwed it up: instead of returning
NULL on failure, it generates a stack overflow structured exception
(for a heap allocation failure).</p>

<p>I haven’t tried it yet, but I bet something similar to malloca() /
freea() could be implemented using a couple of macros.</p>

<h3 id="toward-structured-small-size-optimization">Toward Structured Small-Size Optimization</h3>

<p>CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the
talks. I don’t like developing in C++, but I always learn new,
interesting concepts from this conference, many of which apply
directly to C. I look forward to Chandler Carruth’s talks the most,
having learned so much from his past talks. I recommend these all:</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=fHNmRkzxHWs">Efficiency with Algorithms, Performance with Data Structures</a> (2014)</li>
  <li><a href="https://www.youtube.com/watch?v=nXaxk27zwlk">Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!</a> (2015)</li>
  <li><a href="https://www.youtube.com/watch?v=vElZc6zSIXM">High Performance Code 201: Hybrid Data Structures</a> (2016)</li>
  <li><a href="https://www.youtube.com/watch?v=FnGCDLhaxKU">Understanding Compiler Optimization</a> (2015)</li>
  <li><a href="https://www.youtube.com/watch?v=eR34r7HOU14">Optimizing the Emergent Structures of C++</a> (2013)</li>
</ul>

<p>After writing this article, I saw Nicholas Ormrod’s talk, <a href="https://www.youtube.com/watch?v=kPR8h4-qZdk">The strange
details of std::string at Facebook</a>, which is also highly
relevant.</p>

<p>Chandler’s talk this year was the one on hybrid data structures. I’d
already been mulling over small-size optimization for months, and the
first 5–10 minutes of his talk showed me I was on the right track. In
his talk he describes LLVM’s SmallVector class (among others), which
is basically a small-size-optimized version of std::vector, which, due
to constraints on iterators under std::move() semantics, can’t itself
be small-size optimized.</p>

<p>I picked up a new trick from this talk, which I’ll explain in C’s
terms. Suppose I have a dynamically growing buffer “vector” of <code class="language-plaintext highlighter-rouge">long</code>
values. I can keep pushing values into the buffer, doubling the
storage in size each time it fills. I’ll call this one “simple.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_simple</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Initialization is obvious. Though for easy overflow checks, and for
another reason I’ll explain later, I’m going to require the starting
size, <code class="language-plaintext highlighter-rouge">hint</code>, to be a power of two. It returns 1 on success and 0 on
error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_simple_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">hint</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">hint</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">hint</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">hint</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// power of 2</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">hint</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Pushing is straightforward, using realloc() when the buffer fills,
returning 0 for integer overflow or allocation failure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_simple_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// overflow</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// out of memory</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And finally, cleaning up. I hadn’t thought about this before, but if
the compiler manages to inline vec_simple_free(), that NULL pointer
assignment will probably get optimized out, possibly <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">even in the face
of a use-after-free bug</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_simple_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// trap use-after-free bugs</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And finally an example of its use (without checking for errors).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">example</span><span class="p">(</span><span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">vec_simple</span> <span class="n">v</span><span class="p">;</span>
    <span class="n">vec_simple_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="kt">long</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">n</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">arg</span><span class="p">))</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">vec_simple_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="c1">// ... process vector ...</span>
    <span class="n">vec_simple_free</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the common case is only a handful of <code class="language-plaintext highlighter-rouge">long</code> values, and this
function is called frequently, we’re doing a lot of heap allocation
that could be avoided. Wouldn’t it be nice to put all that on the
stack?</p>

<h3 id="applying-small-size-optimization">Applying Small-Size Optimization</h3>

<p>Modify the struct to add this <code class="language-plaintext highlighter-rouge">temp</code> field. It’s probably obvious what
I’m getting at here. This is essentially the technique in SmallVector.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_small</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">temp</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">values</code> field is initially pointed at the small buffer. Notice
that unlike the “simple” vector above, this initialization function
cannot fail. It’s one less thing for the caller to check. It also
doesn’t take a <code class="language-plaintext highlighter-rouge">hint</code> since the buffer size is fixed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_small_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Pushing gets a little more complicated. If it’s the first time the
buffer has grown, the realloc() has to be done “manually” with
malloc() and memcpy().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_small_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// overflow</span>

        <span class="kt">void</span>  <span class="o">*</span><span class="n">new_values</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* First time heap allocation. */</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">new_values</span><span class="p">)</span>
                <span class="n">memcpy</span><span class="p">(</span><span class="n">new_values</span><span class="p">,</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">));</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// out of memory</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, only call free() if the buffer was actually allocated on the
heap.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_small_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">!=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If 99% of these vectors never exceed 16 elements, then 99% of the time
the heap isn’t touched. That’s <em>much</em> better than before. The 1% case
is still covered, too, at what is probably an insignificant cost.</p>

<p>An important difference to SmallVector is that they parameterize the
small buffer’s size through the template. In C we’re stuck with fixed
sizes or macro hacks. Or are we?</p>

<h3 id="using-a-caller-provided-buffer">Using a Caller-Provided Buffer</h3>

<p>This time remove the temporary buffer, making it look like the simple
vector from before.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_flex</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The user will provide the initial buffer, which will presumably be an
adjacent, stack-allocated array, but whose size is under the user’s
control.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_flex_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">init</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// we need that low bit!</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">nmemb</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// power of 2</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">nmemb</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">init</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The power of two size, greater than one, means the size will always be
an even number. Why is this important? There’s one piece of
information missing from the struct: Is the buffer currently heap
allocated or not? That’s just one bit of information, but adding just
one more bit to the struct will typically pad it out another 31 or 63
more bits. What a waste! Since I’m not using the lowest bit of the
size (always being an even number), I can smuggle it in there. Hence
the <code class="language-plaintext highlighter-rouge">nmemb | 1</code>, the 1 indicating that it’s not heap allocated.</p>

<p>When pushing, the <code class="language-plaintext highlighter-rouge">actual_size</code> is extracted by clearing the bottom
bit (<code class="language-plaintext highlighter-rouge">size &amp; ~1</code>) and the indicator bit is extracted with a 1 bit mask
(<code class="language-plaintext highlighter-rouge">size &amp; 1</code>). The bit is cleared by virtue of not intentionally
setting it again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_flex_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">actual_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// clear bottom bit</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">actual_size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">actual_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* overflow */</span>

        <span class="kt">void</span> <span class="o">*</span><span class="n">new_values</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* First time heap allocation. */</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">new_values</span><span class="p">)</span>
                <span class="n">memcpy</span><span class="p">(</span><span class="n">new_values</span><span class="p">,</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">actual_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* out of memory */</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Only free() when it’s been allocated, like before.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_flex_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">))</span>
        <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And here’s what it looks like in action.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">example</span><span class="p">(</span><span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">vec_flex</span> <span class="n">v</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">buffer</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">vec_flex_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="kt">long</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">n</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">arg</span><span class="p">))</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">vec_flex_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="c1">// ... process vector ...</span>
    <span class="n">vec_flex_free</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you were to log all vector sizes as part of profiling, and the
assumption about their typical small number of elements was correct,
you could easily tune the array size in each case to remove the vast
majority of vector heap allocations.</p>

<p>Now that I’ve learned this optimization trick, I’ll be looking out for
good places to apply it. It’s also a good reason for me to stop
abusing VLAs.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Const and Optimization in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/07/25/"/>
    <id>urn:uuid:f785bc3b-dd3d-3952-2696-91eafe6b019d</id>
    <updated>2016-07-25T02:06:04Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Today there was a <a href="https://redd.it/4udqwj">question on /r/C_Programming</a> about the
effect of C’s <code class="language-plaintext highlighter-rouge">const</code> on optimization. Variations of this question
have been asked many times over the past two decades. Personally, I
blame naming of <code class="language-plaintext highlighter-rouge">const</code>.</p>

<p>Given this program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">);</span>
        <span class="n">y</span> <span class="o">+=</span> <span class="n">x</span><span class="p">;</span>  <span class="c1">// this load not optimized out</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">foo</code> takes a pointer to const, which is a promise from
the author of <code class="language-plaintext highlighter-rouge">foo</code> that it won’t modify the value of <code class="language-plaintext highlighter-rouge">x</code>. Given this
information, it would seem the compiler may assume <code class="language-plaintext highlighter-rouge">x</code> is always zero,
and therefore <code class="language-plaintext highlighter-rouge">y</code> is always zero.</p>

<p>However, inspecting the assembly output of several different compilers
shows that <code class="language-plaintext highlighter-rouge">x</code> is loaded each time around the loop. Here’s gcc 4.9.2
at -O3, with annotations, for x86-64,</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbp</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">xor</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; y = 0</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>              <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>    <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>        <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">add</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>  <span class="c1">; y += x  (not optmized?)</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; deallocate x</span>
     <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; return y</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">pop</span>    <span class="nb">rbp</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The output of clang 3.5 (with -fno-unroll-loops) is the same, except
ebp and ebx are swapped, and the computation of <code class="language-plaintext highlighter-rouge">&amp;x</code> is hoisted out of
the loop, into <code class="language-plaintext highlighter-rouge">r14</code>.</p>

<p>Are both compilers failing to take advantage of this useful
information? Wouldn’t it be undefined behavior for <code class="language-plaintext highlighter-rouge">foo</code> to modify
<code class="language-plaintext highlighter-rouge">x</code>? Surprisingly, the answer is no. <em>In this situation</em>, this would
be a perfectly legal definition of <code class="language-plaintext highlighter-rouge">foo</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">readonly_x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">readonly_x</span><span class="p">;</span>  <span class="c1">// cast away const</span>
    <span class="p">(</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key thing to remember is that <a href="http://yarchive.net/comp/const.html"><strong><code class="language-plaintext highlighter-rouge">const</code> doesn’t mean
constant</strong></a>. Chalk it up as a misnomer. It’s not an
optimization tool. It’s there to inform programmers — not the compiler
— as a tool to catch a certain class of mistakes at compile time. I
like it in APIs because it communicates how a function will use
certain arguments, or how the caller is expected to handle returned
pointers. It’s usually not strong enough for the compiler to change
its behavior.</p>

<p>Despite what I just said, occasionally the compiler <em>can</em> take
advantage of <code class="language-plaintext highlighter-rouge">const</code> for optimization. The C99 specification, in
§6.7.3¶5, has one sentence just for this:</p>

<blockquote>
  <p>If an attempt is made to modify an object defined with a
const-qualified type through use of an lvalue with
non-const-qualified type, the behavior is undefined.</p>
</blockquote>

<p>The original <code class="language-plaintext highlighter-rouge">x</code> wasn’t const-qualified, so this rule didn’t apply.
And there aren’t any rules against casting away <code class="language-plaintext highlighter-rouge">const</code> to modify an
object that isn’t itself <code class="language-plaintext highlighter-rouge">const</code>. This means the above (mis)behavior
of <code class="language-plaintext highlighter-rouge">foo</code> isn’t undefined behavior <em>for this call</em>. Notice how the
undefined-ness of <code class="language-plaintext highlighter-rouge">foo</code> depends on how it was called.</p>

<p>With one tiny tweak to <code class="language-plaintext highlighter-rouge">bar</code>, I can make this rule apply, allowing the
optimizer do some work on it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">const</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>The compiler may now assume that <strong><code class="language-plaintext highlighter-rouge">foo</code> modifying <code class="language-plaintext highlighter-rouge">x</code> is undefined
behavior, therefore <em>it never happens</em></strong>. For better or worse, this is
a major part of how a C optimizer reasons about your programs. The
compiler is free to assume <code class="language-plaintext highlighter-rouge">x</code> never changes, allowing it to optimize
out both the per-iteration load and <code class="language-plaintext highlighter-rouge">y</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>            <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>  <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>      <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; deallocate x</span>
     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>            <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The load disappears, <code class="language-plaintext highlighter-rouge">y</code> is gone, and the function always returns
zero.</p>

<p>Curiously, the specification <em>almost</em> allows the compiler to go
further. Consider what would happen if <code class="language-plaintext highlighter-rouge">x</code> were allocated somewhere
off the stack in read-only memory. That transformation would look like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">__x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">__x</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We would see a few more instructions shaved off (<a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">-fPIC, small code
model</a>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">section</span> <span class="nv">.rodata</span>
<span class="nl">x:</span>   <span class="kd">dd</span>     <span class="mi">0</span>

<span class="nf">section</span> <span class="nv">.text</span>
<span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>        <span class="c1">; loop variable i</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">x</span><span class="p">]</span>    <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>        <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>Because the address of <code class="language-plaintext highlighter-rouge">x</code> is taken and “leaked,” this last transform
is not permitted. If <code class="language-plaintext highlighter-rouge">bar</code> is called recursively such that a second
address is taken for <code class="language-plaintext highlighter-rouge">x</code>, that second pointer would compare equally
(<code class="language-plaintext highlighter-rouge">==</code>) with the first pointer depsite being semantically distinct
objects, which is forbidden (§6.5.9¶6).</p>

<p>Even with this special <code class="language-plaintext highlighter-rouge">const</code> rule, stick to using <code class="language-plaintext highlighter-rouge">const</code> for
yourself and for your fellow human programmers. Let the optimizer
reason for itself about what is constant and what is not.</p>

<p>Travis Downs nicely summed up this article in the comments:</p>

<blockquote>
  <p>In general, <code class="language-plaintext highlighter-rouge">const</code> <em>declarations</em> can’t help the optimizer, but
<code class="language-plaintext highlighter-rouge">const</code> <em>definitions</em> can.</p>
</blockquote>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Hotpatching a C Function on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/03/31/"/>
    <id>urn:uuid:49f6ea3c-d44a-3bed-1aad-70ad47e325c6</id>
    <updated>2016-03-31T23:59:59Z</updated>
    <category term="x86"/><category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In this post I’m going to do a silly, but interesting, exercise that
should never be done in any program that actually matters. I’m going
write a program that changes one of its function definitions while
it’s actively running and using that function. Unlike <a href="/blog/2014/12/23/">last
time</a>, this won’t involve shared libraries, but it will require
x86_64 and GCC. Most of the time it will work with Clang, too, but
it’s missing an important compiler option that makes it stable.</p>

<p>If you want to see it all up front, here’s the full source:
<a href="/download/hotpatch.c">hotpatch.c</a></p>

<p>Here’s the function that I’m going to change:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s dead simple, but that’s just for demonstration purposes. This
will work with any function of arbitrary complexity. The definition
will be changed to this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"goodbye %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">x</span><span class="o">++</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I was only going change the string, but I figured I should make it a
little more interesting.</p>

<p>Here’s how it’s going to work: I’m going to overwrite the beginning of
the function with an unconditional jump that immediately moves control
to the new definition of the function. It’s vital that the function
prototype does not change, since that would be a <em>far</em> more complex
problem.</p>

<p><strong>But first there’s some preparation to be done.</strong> The target needs to
be augmented with some GCC function attributes to prepare it for its
redefinition. As is, there are three possible problems that need to be
dealt with:</p>

<ul>
  <li>I want to hotpatch this function <em>while it is being used</em> by another
thread <em>without</em> any synchronization. It may even be executing the
function at the same time I clobber its first instructions with my
jump. If it’s in between these instructions, disaster will strike.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">ms_hook_prologue</code> function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.</p>

<ul>
  <li>The prologue NOP needs to be updated atomically. I can’t let the
other thread see a half-written instruction or, again, disaster. On
x86 this means I have an alignment requirement. Since I’m
overwriting an 8-byte instruction, I’m specifically going to need
8-byte alignment to get an atomic write.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">aligned</code> function attribute, ensuring the
hotpatch prologue is properly aligned.</p>

<ul>
  <li>The final problem is that there must be exactly one copy of this
function in the compiled program. It must never be inlined or
cloned, since these won’t be hotpatched.</li>
</ul>

<p>As you might have guessed, this is primarily fixed with the <code class="language-plaintext highlighter-rouge">noinline</code>
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the <code class="language-plaintext highlighter-rouge">noclone</code> attribute.</p>

<p>Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (<code class="language-plaintext highlighter-rouge">__asm("")</code>). Since <code class="language-plaintext highlighter-rouge">puts()</code> has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.</p>

<p>What does the function look like now?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span> <span class="p">((</span><span class="n">ms_hook_prologue</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noclone</span><span class="p">))</span>
<span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span><span class="s">""</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And what does the assembly look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -Mintel -d hotpatch
0000000000400848 &lt;hello&gt;:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 &lt;puts@plt&gt;
</code></pre></div></div>

<p>It’s 8-byte aligned and it has the 8-byte NOP: that <code class="language-plaintext highlighter-rouge">lea</code> instruction
does nothing. It copies <code class="language-plaintext highlighter-rouge">rsp</code> into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.</p>

<h3 id="hotpatching">Hotpatching</h3>

<p>Next, let’s take a look at the function that will perform the
hotpatch. I’ve written a generic patching function for this purpose.
This part is entirely specific to x86.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hotpatch</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">replacement</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 8-byte aligned?</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">page</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="o">~</span><span class="mh">0xfff</span><span class="p">);</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_WRITE</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">replacement</span> <span class="o">-</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="kt">uint8_t</span> <span class="n">bytes</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
        <span class="kt">uint64_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">instruction</span> <span class="o">=</span> <span class="p">{</span> <span class="p">{</span><span class="mh">0xe9</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">}</span> <span class="p">};</span>
    <span class="o">*</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">=</span> <span class="n">instruction</span><span class="p">.</span><span class="n">value</span><span class="p">;</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It takes the address of the function to be patched and the address of
the function to replace it. As mentioned, the target <em>must</em> be 8-byte
aligned (enforced by the assert). It’s also important this function is
only called by one thread at a time, even on different targets. If
that was a concern, I’d wrap it in a mutex to create a critical
section.</p>

<p>There are a number of things going on here, so let’s go through them
one at a time:</p>

<h4 id="make-the-function-writeable">Make the function writeable</h4>

<p>The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
<code class="language-plaintext highlighter-rouge">mprotect()</code>. If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
<code class="language-plaintext highlighter-rouge">hotpatch()</code>) I’d be in trouble.</p>

<p>It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hugepages). <em>Warning</em>: I’m being a
bad programmer and not checking the result of <code class="language-plaintext highlighter-rouge">mprotect()</code>. If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard <a href="https://marc.info/?t=145942649500004">in the
future</a>. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.</p>

<p>What if the function straddles pages? Well, I’m only patching the
first 8 bytes, which, thanks to alignment, will sit entirely inside
the page I just found. It’s not an issue.</p>

<p>At the end of the function, I <code class="language-plaintext highlighter-rouge">mprotect()</code> the page back to
non-writeable.</p>

<h4 id="create-the-instruction">Create the instruction</h4>

<p>I’m assuming the replacement function is within 2GB of the original in
virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s
no 64-bit relative jump, and I only have 8 bytes to work within
anyway. Looking that up in <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">the Intel manual</a>, I see this:</p>

<p><img src="/img/misc/jmp-e9.png" alt="" /></p>

<p>Fortunately it’s a really simple instruction. It’s opcode 0xE9 and
it’s followed immediately by the 32-bit displacement. The instruction
is 5 bytes wide.</p>

<p>To compute the relative jump, I take the difference between the
functions, minus 5. Why the 5? The jump address is computed from the
position <em>after</em> the jump instruction and, as I said, it’s 5 bytes
wide.</p>

<p>I put 0xE9 in a byte array, followed by the little endian
displacement. The astute may notice that the displacement is signed
(it can go “up” or “down”) and I used an unsigned integer. That’s
because it will overflow nicely to the right value and make those
shifts clean.</p>

<p>Finally, the instruction byte array I just computed is written over
the hotpatch NOP as a single, atomic, 64-bit store.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    *(uint64_t *)target = instruction.value;
</code></pre></div></div>

<p>Other threads will see either the NOP or the jump, nothing in between.
There’s no synchronization, so other threads may continue to execute
the NOP for a brief moment even through I’ve clobbered it, but that’s
fine.</p>

<h3 id="trying-it-out">Trying it out</h3>

<p>Here’s what my test program looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">worker</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">arg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">hello</span><span class="p">();</span>
        <span class="n">usleep</span><span class="p">(</span><span class="mi">100000</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">worker</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="n">getchar</span><span class="p">();</span>
    <span class="n">hotpatch</span><span class="p">(</span><span class="n">hello</span><span class="p">,</span> <span class="n">new_hello</span><span class="p">);</span>
    <span class="n">pthread_join</span><span class="p">(</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I fire off the other thread to keep it pinging at <code class="language-plaintext highlighter-rouge">hello()</code>. In the
main thread, it waits until I hit enter to give the program input,
after which it calls <code class="language-plaintext highlighter-rouge">hotpatch()</code> and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.</p>

<h3 id="further-reading">Further Reading</h3>

<p>These related articles have been shared with me since publishing this
article:</p>

<ul>
  <li><a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583">Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?</a></li>
  <li><a href="http://jbremer.org/x86-api-hooking-demystified/">x86 API Hooking Demystified</a></li>
  <li><a href="http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-living-on-the-edge-rapid-toggling-probes-with-cross-modification-on-x86">Living on the edge: Rapid-toggling probes with cross modification on x86</a></li>
  <li><a href="https://lwn.net/Articles/620640/">arm64: alternatives runtime patching</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Small, Freestanding Windows Executables</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/01/31/"/>
    <id>urn:uuid:8eddc701-52d3-3b0c-a8a8-dd13da6ead2c</id>
    <updated>2016-01-31T22:53:03Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>Update</strong>: This is old and <a href="/blog/2023/02/15/">was <strong>updated in 2023</strong></a>!</p>

<p>Recently I’ve been experimenting with freestanding C programs on
Windows. <em>Freestanding</em> refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and <a href="/blog/2014/12/09/">similar, bare metal
situations</a>. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size <code class="language-plaintext highlighter-rouge">memmove()</code> with move instructions. Since a freestanding
program would supply its own, it may have different semantics.</p>

<p>My usual go to for C/C++ on Windows is <a href="http://mingw-w64.org/">Mingw-w64</a>, which has
greatly suited my needs the past couple of years. It’s <a href="https://packages.debian.org/search?keywords=mingw-w64">packaged on
Debian</a>, and, when combined with Wine, allows me to fully develop
Windows applications on Linux. Being GCC, it’s also great for
cross-platform development since it’s essentially the same compiler as
the other platforms. The primary difference is the interface to the
operating system (POSIX vs. Win32).</p>

<p>However, it has one glaring flaw inherited from MinGW: it links
against msvcrt.dll, an ancient version of the Microsoft C runtime
library that currently ships with Windows. Besides being dated and
quirky, <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">it’s not an official part of Windows</a> and never has
been, despite its inclusion with every release since Windows 95.
Mingw-w64 doesn’t have a C library of its own, instead patching over
some of the flaws of msvcrt.dll and linking against it.</p>

<p>Since so much depends on msvcrt.dll despite its unofficial nature,
it’s unlikely Microsoft will ever drop it from future releases of
Windows. However, if strict correctness is a concern, we must ask
Mingw-w64 not to link against it. An alternative would be
<a href="http://plibc.sourceforge.net/">PlibC</a>, though the LGPL licensing is unfortunate. Another is
Cygwin, which is a very complete POSIX environment, but is heavy and
GPL-encumbered.</p>

<p>Sometimes I’d prefer to be more direct: <a href="https://hero.handmade.network/forums/code-discussion/t/94-guide_-_how_to_avoid_c_c++_runtime_on_windows">skip the C standard library
altogether</a> and talk directly to the operating system. On Windows
that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only
links against system DLLs.</p>

<h3 id="linux-vs-windows">Linux vs. Windows</h3>

<p>The most important benefit of a standard library like libc is a
portable, uniform interface to the host system. So long as the
standard library suits its needs, the same program can run anywhere.
Without it, the programs needs an implementation of each
host-specific interface.</p>

<p>On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (<code class="language-plaintext highlighter-rouge">int 0x80</code> on x86, <code class="language-plaintext highlighter-rouge">syscall</code> on
x86-64, <code class="language-plaintext highlighter-rouge">swi</code> on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.</p>

<p>For example, here’s a function for a 1-argument system call on x86-64.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">syscall1</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">result</span><span class="p">;</span>
    <span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">exit()</code> is implemented on top. Note: A <em>real</em> libc would do
cleanup before exiting, like calling registered <code class="language-plaintext highlighter-rouge">atexit()</code> functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;syscall.h&gt;</span><span class="c1">  // defines SYS_exit</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">code</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">code</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to <code class="language-plaintext highlighter-rouge">malloc()</code>). It’s not POSIX, but it has analogs to much of
the same functionality.</p>

<h3 id="program-entry">Program Entry</h3>

<p>The standard entry for a C program is <code class="language-plaintext highlighter-rouge">main()</code>. However, this is not
the application’s <em>true</em> entry. The entry is in the C library, which
does some initialization before calling your <code class="language-plaintext highlighter-rouge">main()</code>. When <code class="language-plaintext highlighter-rouge">main()</code>
returns, it performs cleanup and exits. Without a C library, programs
don’t start at <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<p>On Linux the default entry is the symbol <code class="language-plaintext highlighter-rouge">_start</code>. It’s prototype
would look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Returning from this function leads to a segmentation fault, so it’s up
to your application to perform the exit system call rather than
return.</p>

<p>On Windows, the entry depends on the type of application. The two
relevant subsystems today are the <em>console</em> and <em>windows</em> subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give <code class="language-plaintext highlighter-rouge">-mconsole</code> (default) or <code class="language-plaintext highlighter-rouge">-mwindows</code> to the linker to
choose the subsystem.</p>

<p>The default <a href="https://msdn.microsoft.com/en-us/library/f9t8842e.aspx">entry for each is slightly different</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike Linux’s <code class="language-plaintext highlighter-rouge">_start</code>, Windows programs can safely return from these
functions, similar to <code class="language-plaintext highlighter-rouge">main()</code>, hence the <code class="language-plaintext highlighter-rouge">int</code> return. The <code class="language-plaintext highlighter-rouge">WINAPI</code>
macro means the function may have a special calling convention,
depending on the platform.</p>

<p>On any system, you can choose a different entry symbol or address
using the <code class="language-plaintext highlighter-rouge">--entry</code> option to the GNU linker.</p>

<h3 id="disabling-libgcc">Disabling libgcc</h3>

<p>One problem I’ve run into is Mingw-w64 generating code that calls
<code class="language-plaintext highlighter-rouge">__chkstk_ms()</code> from libgcc. I believe this is a long-standing bug,
since <code class="language-plaintext highlighter-rouge">-ffreestanding</code> should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable <a href="https://metricpanda.com/rival-fortress-update-45-dealing-with-__chkstk-__chkstk_ms-when-cross-compiling-for-windows/">the stack
probe</a> and pre-commit the whole stack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
</code></pre></div></div>

<p>Alternatively you could link against libgcc (statically) with <code class="language-plaintext highlighter-rouge">-lgcc</code>,
but, again, I’m going for a tiny executable.</p>

<h3 id="a-freestanding-example">A freestanding example</h3>

<p>Here’s an example of a Windows “Hello, World” that doesn’t use a C
library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">WINAPI</span>
<span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">[]){</span><span class="mi">0</span><span class="p">},</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32
</code></pre></div></div>

<p>Notice I manually linked against kernel32.dll. The stripped final
result is only 4kB, mostly PE padding. There are <a href="http://www.phreedom.org/research/tinype/">techniques to trim
this down even further</a>, but for a substantial program it
wouldn’t make a significant difference.</p>

<p>From here you could create a GUI by linking against <code class="language-plaintext highlighter-rouge">user32.dll</code> and
<code class="language-plaintext highlighter-rouge">gdi32.dll</code> (both also part of Win32) and calling the appropriate
functions. I already <a href="/blog/2015/06/06/">ported my OpenGL demo</a> to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).</p>

<p>I may go this route for <a href="http://7drl.org/2016/01/13/7drl-2016-announced-for-5-13-march/">the upcoming 7DRL 2016</a> in March.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Basic Just-In-Time Compiler</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/19/"/>
    <id>urn:uuid:95e0437f-61f0-3932-55b7-f828e171d9ca</id>
    <updated>2015-03-19T04:57:55Z</updated>
    <category term="c"/><category term="tutorial"/><category term="netsec"/><category term="x86"/><category term="posix"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=17747759">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/akxq8q/a_basic_justintime_compiler/">on reddit</a>.</em></p>

<p><a href="http://redd.it/2z68di">Monday’s /r/dailyprogrammer challenge</a> was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(<code class="language-plaintext highlighter-rouge">u(0)</code>) and a sequence of operations, <code class="language-plaintext highlighter-rouge">f</code>, to apply to the previous
term (<code class="language-plaintext highlighter-rouge">u(n + 1) = f(u(n))</code>) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.</p>

<!--more-->

<p>For example, the relation <code class="language-plaintext highlighter-rouge">u(n + 1) = (u(n) + 2) * 3 - 5</code> would be
input as <code class="language-plaintext highlighter-rouge">+2 *3 -5</code>. If <code class="language-plaintext highlighter-rouge">u(0) = 0</code> then,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">u(1) = 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(2) = 4</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(3) = 13</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(4) = 40</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(5) = 121</code></li>
  <li>…</li>
</ul>

<p>Rather than write an interpreter to apply the sequence of operations,
for <a href="https://gist.github.com/skeeto/3a1aa3df31896c9956dc">my submission</a> (<a href="/download/jit.c">mirror</a>) I took the opportunity to
write a simple x86-64 Just-In-Time (JIT) compiler. So rather than
stepping through the operations one by one, my program converts the
operations into native machine code and lets the hardware do the work
directly. In this article I’ll go through how it works and how I did
it.</p>

<p><strong>Update</strong>: The <a href="http://redd.it/2zna5q">follow-up challenge</a> uses Reverse Polish
notation to allow for more complicated expressions. I wrote another
JIT compiler for <a href="https://gist.github.com/anonymous/f7e4a5086a2b0acc83aa">my submission</a> (<a href="/download/rpn-jit.c">mirror</a>).</p>

<h3 id="allocating-executable-memory">Allocating Executable Memory</h3>

<p>Modern operating systems have page-granularity protections for
different parts of <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">process memory</a>: read, write, and execute.
Code can only be executed from memory with the execute bit set on its
page, memory can only be changed when its write bit is set, and some
pages aren’t allowed to be read. In a running process, the pages
holding program code and loaded libraries will have their write bit
cleared and execute bit set. Most of the other pages will have their
execute bit cleared and their write bit set.</p>

<p>The reason for this is twofold. First, it significantly increases the
security of the system. If untrusted input was read into executable
memory, an attacker could input machine code (<em>shellcode</em>) into the
buffer, then exploit a flaw in the program to cause control flow to
jump to and execute that code. If the attacker is only able to write
code to non-executable memory, this attack becomes a lot harder. The
attacker has to rely on code already loaded into executable pages
(<a href="http://en.wikipedia.org/wiki/Return-oriented_programming"><em>return-oriented programming</em></a>).</p>

<p>Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, <code class="language-plaintext highlighter-rouge">NULL</code>
points to a special page with read, write, and execute disabled.</p>

<h4 id="an-instruction-buffer">An Instruction Buffer</h4>

<p>Memory returned by <code class="language-plaintext highlighter-rouge">malloc()</code> and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through <code class="language-plaintext highlighter-rouge">malloc()</code>, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an <code class="language-plaintext highlighter-rouge">asmbuf</code> struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PAGE_SIZE 4096
</span>
<span class="k">struct</span> <span class="n">asmbuf</span> <span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">code</span><span class="p">[</span><span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint64_t</span><span class="p">)];</span>
    <span class="kt">uint64_t</span> <span class="n">count</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use <code class="language-plaintext highlighter-rouge">sysconf(_SC_PAGESIZE)</code> to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.</p>

<p>Instead of <code class="language-plaintext highlighter-rouge">malloc()</code>, the compiler allocates memory as an anonymous
memory map (<code class="language-plaintext highlighter-rouge">mmap()</code>). It’s anonymous because it’s not backed by a
file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Windows doesn’t have POSIX <code class="language-plaintext highlighter-rouge">mmap()</code>, so on that platform we use
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> instead. Here’s the equivalent in Win32.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">type</span> <span class="o">=</span> <span class="n">MEM_RESERVE</span> <span class="o">|</span> <span class="n">MEM_COMMIT</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">VirtualAlloc</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">PAGE_READWRITE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Anyone reading closely should notice that I haven’t actually requested
that the memory be executable, which is, like, the whole point of all
this! This was intentional. Some operating systems employ a security
feature called W^X: “write xor execute.” That is, memory is either
writable or executable, but never both at the same time. This makes
the shellcode attack I described before even harder. For <a href="http://www.tedunangst.com/flak/post/now-or-never-exec">well-behaved
JIT compilers</a> it means memory protections need to be adjusted
after code generation and before execution.</p>

<p>The POSIX <code class="language-plaintext highlighter-rouge">mprotect()</code> function is used to change memory protections.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or on Win32 (that last parameter is not allowed to be <code class="language-plaintext highlighter-rouge">NULL</code>),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">old</span><span class="p">;</span>
    <span class="n">VirtualProtect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PAGE_EXECUTE_READ</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">old</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, instead of <code class="language-plaintext highlighter-rouge">free()</code> it gets unmapped.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And on Win32,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">VirtualFree</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MEM_RELEASE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I won’t list the definitions here, but there are two “methods” for
inserting instructions and immediate values into the buffer. This will
be raw machine code, so the caller will be acting a bit like an
assembler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">ins</span><span class="p">);</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="calling-conventions">Calling Conventions</h3>

<p>We’re only going to be concerned with three of x86-64’s many
registers: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rax</code>, and <code class="language-plaintext highlighter-rouge">rdx</code>. These are 64-bit (<code class="language-plaintext highlighter-rouge">r</code>) extensions
of <a href="/blog/2014/12/09/">the original 16-bit 8086 registers</a>. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">recurrence</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions">The System V AMD64 ABI calling convention</a> says that the first
integer/pointer function argument is passed in the <code class="language-plaintext highlighter-rouge">rdi</code> register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in <code class="language-plaintext highlighter-rouge">rax</code> when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy <code class="language-plaintext highlighter-rouge">rdi</code> to <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rdi</span>
</code></pre></div></div>

<p>There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in <code class="language-plaintext highlighter-rouge">asmbuf</code>. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in <code class="language-plaintext highlighter-rouge">rcx</code> rather than <code class="language-plaintext highlighter-rouge">rdi</code>. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.</p>

<p>The very last thing it will do, assuming the result is in <code class="language-plaintext highlighter-rouge">rax</code>, is
return to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">ret</span>
</code></pre></div></div>

<p>So we know the assembly, but what do we pass to <code class="language-plaintext highlighter-rouge">asmbuf_ins()</code>? This
is where we get our hands dirty.</p>

<h4 id="finding-the-code">Finding the Code</h4>

<p>If you want to do this the Right Way, you go download the x86-64
documentation, look up the instructions we’re using, and manually work
out the bytes we need and how the operands fit into it. You know, like
they used to do <a href="/blog/2016/11/17/">out of necessity</a> back in the 60’s.</p>

<p>Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file <code class="language-plaintext highlighter-rouge">peek.s</code> and hand it to <code class="language-plaintext highlighter-rouge">nasm</code>. It will produce a raw binary
with the machine code, which we’ll disassemble with <code class="language-plaintext highlighter-rouge">nidsasm</code> (the
NASM disassembler).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret
</code></pre></div></div>

<p>That’s straightforward. The first instruction is 3 bytes and the
return is 1 byte.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4889f8</span><span class="p">);</span>  <span class="c1">// mov   rax, rdi</span>
<span class="c1">// ... generate code ...</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mh">0xc3</span><span class="p">);</span>      <span class="c1">// ret</span>
</code></pre></div></div>

<p>For each operation, we’ll set it up so the operand will already be
loaded into <code class="language-plaintext highlighter-rouge">rdi</code> regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use <code class="language-plaintext highlighter-rouge">0x0123456789abcdef</code> as the
operand.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rdi</span><span class="p">,</span> <span class="mh">0x0123456789abcdef</span>
</code></pre></div></div>

<p>Which disassembled with <code class="language-plaintext highlighter-rouge">ndisasm</code> is,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301
</code></pre></div></div>

<p>Notice the operand listed little endian immediately after the
instruction. That’s also easy!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">operand</span><span class="p">;</span>
<span class="n">scanf</span><span class="p">(</span><span class="s">"%ld"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mh">0x48bf</span><span class="p">);</span>         <span class="c1">// mov   rdi, operand</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
</code></pre></div></div>

<p>Apply the same discovery process individually for each operator you
want to support, accumulating the result in <code class="language-plaintext highlighter-rouge">rax</code> for each.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">switch</span> <span class="p">(</span><span class="n">operator</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4801f8</span><span class="p">);</span>   <span class="c1">// add   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4829f8</span><span class="p">);</span>   <span class="c1">// sub   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'*'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mh">0x480fafc7</span><span class="p">);</span> <span class="c1">// imul  rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'/'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4831d2</span><span class="p">);</span>   <span class="c1">// xor   rdx, rdx</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x48f7ff</span><span class="p">);</span>   <span class="c1">// idiv  rdi</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As an exercise, try adding support for modulus operator (<code class="language-plaintext highlighter-rouge">%</code>), XOR
(<code class="language-plaintext highlighter-rouge">^</code>), and bit shifts (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the <a href="https://old.reddit.com/r/dailyprogrammer/comments/2z68di/_/cpgkcx7">closed form solution</a> to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.</p>

<h3 id="calling-the-generated-code">Calling the Generated Code</h3>

<p>Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a <code class="language-plaintext highlighter-rouge">void *</code> just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_finalize</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">recurrence</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="o">-&gt;</span><span class="n">code</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="n">x</span><span class="p">[</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">recurrence</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]);</span>
</code></pre></div></div>

<p>That’s pretty cool if you ask me! Now this was an extremely simplified
situation. There’s no branching, no intermediate values, no function
calls, and I didn’t even touch the stack (push, pop). The recurrence
relation definition in this challenge is practically an assembly
language itself, so after the initial setup it’s a 1:1 translation.</p>

<p>I’d like to build a JIT compiler more advanced than this in the
future. I just need to find a suitable problem that’s more complicated
than this one, warrants having a JIT compiler, but is still simple
enough that I could, on some level, justify not using LLVM.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>C11 Lock-free Stack</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/09/02/"/>
    <id>urn:uuid:743811a4-aaf7-32e3-8a0c-62f1e8dbaf66</id>
    <updated>2014-09-02T03:10:01Z</updated>
    <category term="c"/><category term="tutorial"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>C11, the <a href="http://en.wikipedia.org/wiki/C11_(C_standard_revision)">latest C standard revision</a>, hasn’t received anywhere
near the same amount of fanfare as C++11. I’m not sure why this is.
Some of the updates to each language are very similar, such as formal
support for threading and atomic object access. Three years have
passed and some parts of C11 still haven’t been implemented by any
compilers or standard libraries yet. Since there’s not yet a lot of
discussion online about C11, I’m basing much of this article on my own
understanding of the <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 draft</a>. I <em>may</em> be under-using the
<code class="language-plaintext highlighter-rouge">_Atomic</code> type specifier and not paying enough attention to memory
ordering constraints.</p>

<p>Still, this is a good opportunity to break new ground with a
demonstration of C11. I’m going to use the new
<a href="http://en.cppreference.com/w/c/atomic"><code class="language-plaintext highlighter-rouge">stdatomic.h</code></a> portion of C11 to build a lock-free data
structure. To compile this code you’ll need a C compiler and C library
with support for both C11 and the optional <code class="language-plaintext highlighter-rouge">stdatomic.h</code> features. As
of this writing, as far as I know only <a href="https://gcc.gnu.org/gcc-4.9/changes.html">GCC 4.9</a>, released April
2014, supports this. It’s in Debian unstable but not in Wheezy.</p>

<p>If you want to take a look before going further, here’s the source.
The test code in the repository uses plain old pthreads because C11
threads haven’t been implemented by anyone yet.</p>

<ul>
  <li><a href="https://github.com/skeeto/lstack">https://github.com/skeeto/lstack</a></li>
</ul>

<p>I was originally going to write this article a couple weeks ago, but I
was having trouble getting it right. Lock-free data structures are
trickier and nastier than I expected, more so than traditional mutex
locks. Getting it right requires very specific help from the hardware,
too, so it won’t run just anywhere. I’ll discuss all this below. So
sorry for the long article. It’s just a lot more complex a topic than
I had anticipated!</p>

<h3 id="lock-free">Lock-free</h3>

<p>A lock-free data structure doesn’t require the use of mutex locks.
More generally, it’s a data structure that can be accessed from
multiple threads without blocking. This is accomplished through the
use of atomic operations — transformations that cannot be
interrupted. Lock-free data structures will generally provide better
throughput than mutex locks. And it’s usually safer, because there’s
no risk of getting stuck on a lock that will never be freed, such as a
deadlock situation. On the other hand there’s additional risk of
starvation (livelock), where a thread is unable to make progress.</p>

<p>As a demonstration, I’ll build up a lock-free stack, a sequence with
last-in, first-out (LIFO) behavior. Internally it’s going to be
implemented as a linked-list, so pushing and popping is O(1) time,
just a matter of consing a new element on the head of the list. It
also means there’s only one value to be updated when pushing and
popping: the pointer to the head of the list.</p>

<p>Here’s what the API will look like. I’ll define <code class="language-plaintext highlighter-rouge">lstack_t</code> shortly.
I’m making it an opaque type because its fields should never be
accessed directly. The goal is to completely hide the atomic
semantics from the users of the stack.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>     <span class="nf">lstack_init</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">max_size</span><span class="p">);</span>
<span class="kt">void</span>    <span class="nf">lstack_free</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
<span class="kt">size_t</span>  <span class="nf">lstack_size</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
<span class="kt">int</span>     <span class="nf">lstack_push</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
<span class="kt">void</span>   <span class="o">*</span><span class="nf">lstack_pop</span> <span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
</code></pre></div></div>

<p>Users can push void pointers onto the stack, check the size of the
stack, and pop void pointers back off the stack. Except for
initialization and destruction, these operations are all safe to use
from multiple threads. Two different threads will never receive the
same item when popping. No elements will ever be lost if two threads
attempt to push at the same time. Most importantly a thread will never
block on a lock when accessing the stack.</p>

<p>Notice there’s a maximum size declared at initialization time. While
<a href="http://www.research.ibm.com/people/m/michael/pldi-2004.pdf">lock-free allocation is possible</a> [PDF], C makes no
guarantees that <code class="language-plaintext highlighter-rouge">malloc()</code> is lock-free, so being truly lock-free
means not calling <code class="language-plaintext highlighter-rouge">malloc()</code>. An important secondary benefit to
pre-allocating the stack’s memory is that this implementation doesn’t
require the use of <a href="http://en.wikipedia.org/wiki/Hazard_pointer">hazard pointers</a>, which would be far more
complicated than the stack itself.</p>

<p>The declared maximum size should actually be the desired maximum size
plus the number of threads accessing the stack. This is because a
thread might remove a node from the stack and before the node can
freed for reuse, another thread attempts a push. This other thread
might not find any free nodes, causing it to give up without the stack
actually being “full.”</p>

<p>The <code class="language-plaintext highlighter-rouge">int</code> return value of <code class="language-plaintext highlighter-rouge">lstack_init()</code> and <code class="language-plaintext highlighter-rouge">lstack_push()</code> is for
error codes, returning 0 for success. The only way these can fail is
by running out of memory. This is an issue regardless of being
lock-free: systems can simply run out of memory. In the push case it
means the stack is full.</p>

<h3 id="structures">Structures</h3>

<p>Here’s the definition for a node in the stack. Neither field needs to
be accessed atomically, so they’re not special in any way. In fact,
the fields are <em>never</em> updated while on the stack and visible to
multiple threads, so it’s effectively immutable (outside of reuse).
Users never need to touch this structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">lstack_node</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Internally a <code class="language-plaintext highlighter-rouge">lstack_t</code> is composed of <em>two</em> stacks: the value stack
(<code class="language-plaintext highlighter-rouge">head</code>) and the free node stack (<code class="language-plaintext highlighter-rouge">free</code>). These will be handled
identically by the atomic functions, so it’s really a matter of
convention which stack is which. All nodes are initially placed on the
free stack and the value stack starts empty. Here’s what an internal
stack looks like.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">lstack_head</span> <span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">aba</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>There’s still no atomic declaration here because the struct is going
to be handled as an entire unit. The <code class="language-plaintext highlighter-rouge">aba</code> field is critically
important for correctness and I’ll go over it shortly. It’s declared
as a <code class="language-plaintext highlighter-rouge">uintptr_t</code> because it needs to be the same size as a pointer.
Now, this is not guaranteed by C11 — it’s only guaranteed to be large
enough to hold any valid <code class="language-plaintext highlighter-rouge">void *</code> pointer, so it could be even larger
— but this will be the case on any system that has the required
hardware support for this lock-free stack. This struct is therefore
the size of two pointers. If that’s not true for any reason, this code
will not link. Users will never directly access or handle this struct
either.</p>

<p>Finally, here’s the actual stack structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node_buffer</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">head</span><span class="p">,</span> <span class="n">free</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span> <span class="n">lstack_t</span><span class="p">;</span>
</code></pre></div></div>

<p>Notice the use of the new <code class="language-plaintext highlighter-rouge">_Atomic</code> qualifier. Atomic values may have
different size, representation, and alignment requirements in order to
satisfy atomic access. These values should never be accessed directly,
even just for reading (use <code class="language-plaintext highlighter-rouge">atomic_load()</code>).</p>

<p>The <code class="language-plaintext highlighter-rouge">size</code> field is for convenience to check the number of elements on
the stack. It’s accessed separately from the stack nodes themselves,
so it’s not safe to read <code class="language-plaintext highlighter-rouge">size</code> and use the information to make
assumptions about future accesses (e.g. checking if the stack is empty
before popping off an element). Since there’s no way to lock the
lock-free stack, there’s otherwise no way to estimate the size of the
stack during concurrent access without completely disassembling it via
<code class="language-plaintext highlighter-rouge">lstack_pop()</code>.</p>

<p>There’s <a href="https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt">no reason to use <code class="language-plaintext highlighter-rouge">volatile</code> here</a>. That’s a
separate issue from atomic operations. The C11 <code class="language-plaintext highlighter-rouge">stdatomic.h</code> macros
and functions will ensure atomic values are accessed appropriately.</p>

<h3 id="stack-functions">Stack Functions</h3>

<p>As stated before, all nodes are initially placed on the internal free
stack. During initialization they’re allocated in one solid chunk,
chained together, and pinned on the <code class="language-plaintext highlighter-rouge">free</code> pointer. The initial
assignments to atomic values are done through <code class="language-plaintext highlighter-rouge">ATOMIC_VAR_INIT</code>, which
deals with memory access ordering concerns. The <code class="language-plaintext highlighter-rouge">aba</code> counters don’t
<em>actually</em> need to be initialized. Garbage, indeterminate values are
just fine, but not initializing them would probably look like a
mistake.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lstack_init</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">max_size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">head_init</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">};</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="n">head_init</span><span class="p">);</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>

    <span class="cm">/* Pre-allocate all nodes. */</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">max_size</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">lstack_node</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">max_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">next</span> <span class="o">=</span> <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">+</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">[</span><span class="n">max_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">].</span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">free_init</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">};</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="n">free_init</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The free nodes will not necessarily be used in the same order that
they’re placed on the free stack. Several threads may pop off nodes
from the free stack and, as a separate operation, push them onto the
value stack in a different order. Over time with multiple threads
pushing and popping, the nodes are likely to get shuffled around quite
a bit. This is why a linked listed is still necessary even though
allocation is contiguous.</p>

<p>The reverse of <code class="language-plaintext highlighter-rouge">lstack_init()</code> is simple, and it’s assumed concurrent
access has terminated. The stack is no longer valid, at least not
until <code class="language-plaintext highlighter-rouge">lstack_init()</code> is used again. This one is declared <code class="language-plaintext highlighter-rouge">inline</code> and
put in the header.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">stack_free</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To read an atomic value we need to use <code class="language-plaintext highlighter-rouge">atomic_load()</code>. Give it a
pointer to an atomic value, it dereferences the pointer and returns
the value. This is used in another inline function for reading the
size of the stack.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">size_t</span>
<span class="nf">lstack_size</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_load</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="push-and-pop">Push and Pop</h4>

<p>For operating on the two stacks there will be two internal, static
functions, <code class="language-plaintext highlighter-rouge">push</code> and <code class="language-plaintext highlighter-rouge">pop</code>. These deal directly in nodes, accepting
and returning them, so they’re not suitable to expose in the API
(users aren’t meant to be aware of nodes). This is the most complex
part of lock-free stacks. Here’s <code class="language-plaintext highlighter-rouge">pop()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span>
<span class="nf">pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="o">*</span><span class="n">head</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">next</span><span class="p">,</span> <span class="n">orig</span> <span class="o">=</span> <span class="n">atomic_load</span><span class="p">(</span><span class="n">head</span><span class="p">);</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">orig</span><span class="p">.</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
            <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>  <span class="c1">// empty stack</span>
        <span class="n">next</span><span class="p">.</span><span class="n">aba</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">aba</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">atomic_compare_exchange_weak</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">,</span> <span class="n">next</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s centered around the new C11 <code class="language-plaintext highlighter-rouge">stdatomic.h</code> function
<code class="language-plaintext highlighter-rouge">atomic_compare_exchange_weak()</code>. This is an atomic operation more
generally called <a href="http://en.wikipedia.org/wiki/Compare-and-swap">compare-and-swap</a> (CAS). On x86 there’s an
instruction specifically for this, <code class="language-plaintext highlighter-rouge">cmpxchg</code>. Give it a pointer to the
atomic value to be updated (<code class="language-plaintext highlighter-rouge">head</code>), a pointer to the value it’s
expected to be (<code class="language-plaintext highlighter-rouge">orig</code>), and a desired new value (<code class="language-plaintext highlighter-rouge">next</code>). If the
expected and actual values match, it’s updated to the new value. If
not, it reports a failure and updates the expected value to the latest
value. In the event of a failure we start all over again, which
requires the <code class="language-plaintext highlighter-rouge">while</code> loop. This is an <em>optimistic</em> strategy.</p>

<p>The “weak” part means it will sometimes spuriously fail where the
“strong” version would otherwise succeed. In exchange for more
failures, calling the weak version is faster. Use the weak version
when the body of your <code class="language-plaintext highlighter-rouge">do ... while</code> loop is fast and the strong
version when it’s slow (when trying again is expensive), or if you
don’t need a loop at all. You usually want to use weak.</p>

<p>The alternative to CAS is <a href="http://en.wikipedia.org/wiki/Load-link/store-conditional">load-link/store-conditional</a>. It’s a
stronger primitive that doesn’t suffer from the ABA problem described
next, but it’s also not available on x86-64. On other platforms, one
or both of <code class="language-plaintext highlighter-rouge">atomic_compare_exchange_*()</code> will be implemented using
LL/SC, but we still have to code for the worst case (CAS).</p>

<h5 id="the-aba-problem">The ABA Problem</h5>

<p>The <code class="language-plaintext highlighter-rouge">aba</code> field is here to solve <a href="http://en.wikipedia.org/wiki/ABA_problem">the ABA problem</a> by counting
the number of changes that have been made to the stack. It will be
updated atomically alongside the pointer. Reasoning about the ABA
problem is where I got stuck last time writing this article.</p>

<p>Suppose <code class="language-plaintext highlighter-rouge">aba</code> didn’t exist and it was just a pointer being swapped.
Say we have two threads, A and B.</p>

<ul>
  <li>
    <p>Thread A copies the current <code class="language-plaintext highlighter-rouge">head</code> into <code class="language-plaintext highlighter-rouge">orig</code>, enters the loop body
to update <code class="language-plaintext highlighter-rouge">next.node</code> to <code class="language-plaintext highlighter-rouge">orig.node-&gt;next</code>, then gets preempted
before the CAS. The scheduler pauses the thread.</p>
  </li>
  <li>
    <p>Thread B comes along performs a <code class="language-plaintext highlighter-rouge">pop()</code> changing the value pointed
to by <code class="language-plaintext highlighter-rouge">head</code>. At this point A’s CAS will fail, which is fine. It
would reconstruct a new updated value and try again. While A is
still asleep, B puts the popped node back on the free node stack.</p>
  </li>
  <li>
    <p>Some time passes with A still paused. The freed node gets re-used
and pushed back on top of the stack, which is likely given that
nodes are allocated FIFO. Now <code class="language-plaintext highlighter-rouge">head</code> has its original value again,
but the <code class="language-plaintext highlighter-rouge">head-&gt;node-&gt;next</code> pointer is pointing somewhere completely
new! <em>This is very bad</em> because A’s CAS will now succeed despite
<code class="language-plaintext highlighter-rouge">next.node</code> having the wrong value.</p>
  </li>
  <li>
    <p>A wakes up and it’s CAS succeeds. At least one stack value has been
lost and at least one node struct was leaked (it will be on neither
stack, nor currently being held by a thread). This is the ABA
problem.</p>
  </li>
</ul>

<p>The core problem is that, unlike integral values, pointers have
meaning beyond their intrinsic numeric value. The meaning of a
particular pointer changes when the pointer is reused, making it
suspect when used in CAS. The unfortunate effect is that, <strong>by itself,
atomic pointer manipulation is nearly useless</strong>. They’ll work with
append-only data structures, where pointers are never recycled, but
that’s it.</p>

<p>The <code class="language-plaintext highlighter-rouge">aba</code> field solves the problem because it’s incremented every time
the pointer is updated. Remember that this internal stack struct is
two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16
bytes is compared by CAS and they all have to match for it to succeed.
Since B, or other threads, will increment <code class="language-plaintext highlighter-rouge">aba</code> at least twice (once
to remove the node, and once to put it back in place), A will never
mistake the recycled pointer for the old one. There’s a special
double-width CAS instruction specifically for this purpose,
<code class="language-plaintext highlighter-rouge">cmpxchg16</code>. This is generally called DWCAS. It’s available on most
x86-64 processors. On Linux you can check <code class="language-plaintext highlighter-rouge">/proc/cpuinfo</code> for support.
It will be listed as <code class="language-plaintext highlighter-rouge">cx16</code>.</p>

<p>If it’s not available at compile-time this program won’t link. The
function that wraps <code class="language-plaintext highlighter-rouge">cmpxchg16</code> won’t be there. You can tell GCC to
<em>assume</em> it’s there with the <code class="language-plaintext highlighter-rouge">-mcx16</code> flag. The same rule here applies
to C++11’s new std::atomic.</p>

<p>There’s still a tiny, tiny possibility of the ABA problem still
cropping up. On 32-bit systems A may get preempted for over 4 billion
(2^32) stack operations, such that the ABA counter wraps around to the
same value. There’s nothing we can do about this, but if you witness
this in the wild you need to immediately stop what you’re doing and go
buy a lottery ticket. Also avoid any lightning storms on the way to
the store.</p>

<h5 id="hazard-pointers-and-garbage-collection">Hazard Pointers and Garbage Collection</h5>

<p>Another problem in <code class="language-plaintext highlighter-rouge">pop()</code> is dereferencing <code class="language-plaintext highlighter-rouge">orig.node</code> to access its
<code class="language-plaintext highlighter-rouge">next</code> field. By the time we get to it, the node pointed to by
<code class="language-plaintext highlighter-rouge">orig.node</code> may have already been removed from the stack and freed. If
the stack was using <code class="language-plaintext highlighter-rouge">malloc()</code> and <code class="language-plaintext highlighter-rouge">free()</code> for allocations, it may
even have had <code class="language-plaintext highlighter-rouge">free()</code> called on it. If so, the dereference would be
undefined behavior — a segmentation fault, or worse.</p>

<p>There are three ways to deal with this.</p>

<ol>
  <li>
    <p>Garbage collection. If memory is automatically managed, the node
will never be freed as long as we can access it, so this won’t be a
problem. However, if we’re interacting with a garbage collector
we’re not really lock-free.</p>
  </li>
  <li>
    <p>Hazard pointers. Each thread keeps track of what nodes it’s
currently accessing and other threads aren’t allowed to free nodes
on this list. This is messy and complicated.</p>
  </li>
  <li>
    <p>Never free nodes. This implementation recycles nodes, but they’re
never truly freed until <code class="language-plaintext highlighter-rouge">lstack_free()</code>. It’s always safe to
dereference a node pointer because there’s always a node behind it.
It may point to a node that’s on the free list or one that was even
recycled since we got the pointer, but the <code class="language-plaintext highlighter-rouge">aba</code> field deals with
any of those issues.</p>
  </li>
</ol>

<p>Reference counting on the node won’t work here because we can’t get to
the counter fast enough (atomically). It too would require
dereferencing in order to increment. The reference counter could
potentially be packed alongside the pointer and accessed by a DWCAS,
but we’re already using those bytes for <code class="language-plaintext highlighter-rouge">aba</code>.</p>

<h5 id="push">Push</h5>

<p>Push is a lot like pop.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="o">*</span><span class="n">head</span><span class="p">,</span> <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">next</span><span class="p">,</span> <span class="n">orig</span> <span class="o">=</span> <span class="n">atomic_load</span><span class="p">(</span><span class="n">head</span><span class="p">);</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">aba</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">aba</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">atomic_compare_exchange_weak</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">,</span> <span class="n">next</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s counter-intuitive, but adding a <a href="http://blog.memsql.com/common-pitfalls-in-writing-lock-free-algorithms/">few microseconds of
sleep</a> after CAS failures would probably <em>increase</em>
throughput. Under high contention, threads wouldn’t take turns
clobbering each other as fast as possible. It would be a bit like
exponential backoff.</p>

<h4 id="api-push-and-pop">API Push and Pop</h4>

<p>The API push and pop functions are built on these internal atomic
functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lstack_push</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span>
    <span class="n">atomic_fetch_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Push removes a node from the free stack. If the free stack is empty it
reports an out-of-memory error. It assigns the value and pushes it
onto the value stack where it will be visible to other threads.
Finally, the stack size is incremented atomically. This means there’s
an instant where the stack size is listed as one shorter than it
actually is. However, since there’s no way to access both the stack
size and the stack itself at the same instant, this is fine. The stack
size is really only an estimate.</p>

<p>Popping is the same thing in reverse.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">lstack_pop</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">atomic_fetch_sub</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">value</span> <span class="o">=</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
    <span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Remove the top node, subtract the size estimate atomically, put the
node on the free list, and return the pointer. It’s really simple with
the primitive push and pop.</p>

<h3 id="sha1-demo">SHA1 Demo</h3>

<p>The lstack repository linked at the top of the article includes a demo
that searches for patterns in SHA-1 hashes (sort of like Bitcoin
mining). It fires off one worker thread for each core and the results
are all collected into the same lock-free stack. It’s not <em>really</em>
exercising the library thoroughly because there are no contended pops,
but I couldn’t think of a better example at the time.</p>

<p>The next thing to try would be implementing a C11, bounded, lock-free
queue. It would also be more generally useful than a stack,
particularly for common consumer-producer scenarios.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
