<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged x86 at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/x86/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/x86/feed/"/>
  <updated>2026-04-07T03:24:16Z</updated>
  <id>urn:uuid:763a3ddc-a1df-4bad-b03e-86513dc3c50c</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  <entry>
    <title>Frankenwine: Multiple personas in a Wine process</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/19/"/>
    <id>urn:uuid:d2b53f8d-88a6-400b-a748-693a758741c5</id>
    <updated>2026-01-19T21:51:38Z</updated>
    <category term="c"/><category term="win32"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I came across a recent article on <a href="https://gpfault.net/posts/drunk-exe.html">making Linux system calls from a Wine
process</a>. Windows programs running under Wine are still normal Linux
processes and may interact with the Linux kernel like any other process.
None of this was surprising, and the demonstration works just as I expect.
Still, it got the wheels spinning and I realized an <em>almost</em> practical
application: build <a href="/blog/2023/01/18/">my pkg-config implementation</a> such that on Windows
<code class="language-plaintext highlighter-rouge">pkg-config.exe</code> behaves as a native pkg-config, but when run under Wine
this same binary takes the persona of a Linux program and becomes a cross
toolchain pkg-config, bypassing Win32 and talking directly with the Linux
kernel. <a href="https://justine.lol/cosmopolitan/">Cosmopolitcan Libc</a> cleverly does this out-of-the-box, but
in this article we’ll mash together a couple existing sources with a bit
of glue.</p>

<p>The results are in <a href="https://github.com/skeeto/u-config/commit/e0008d7e">the merge-demo branch</a> of u-config, and took
hardly any work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)
</code></pre></div></div>

<p>A platform layer, <code class="language-plaintext highlighter-rouge">main_wine.c</code>, is a merge of two existing platform
layers, one of which required unavoidable tweaks. We’ll get to those
details in a moment. First we’ll need to detect if we’re running under
Wine, and <a href="https://web.archive.org/web/20250923061634/https://stackoverflow.com/questions/7372388/determine-whether-a-program-is-running-under-wine-at-runtime/42333249#42333249">the best solution I found</a> was to locate
<code class="language-plaintext highlighter-rouge">ntdll!wine_get_version</code>. If this function exists, we’re in Wine. That
works out to a pretty one-liner because <code class="language-plaintext highlighter-rouge">ntdll.dll</code> is already loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">running_on_wine</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">GetModuleHandleA</span><span class="p">(</span><span class="s">"ntdll"</span><span class="p">),</span> <span class="s">"wine_get_version"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>An x86-64 Linux syscall wrapper with <a href="/blog/2024/12/20/">thorough inline assembly</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">syscall3</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">b</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="p">(</span><span class="kt">ptrdiff_t</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’d normally use <code class="language-plaintext highlighter-rouge">long</code> for all these integers because Linux is <a href="https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models">LP64</a>
(<code class="language-plaintext highlighter-rouge">long</code> is pointer-sized), but Windows is LLP64 (only <code class="language-plaintext highlighter-rouge">long long</code> is 64
bits). It’s so bizarre to interface with Linux from LLP64, and this will
have consequences later. With these pieces we can see the basic shape of a
split personality program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">running_on_wine</span><span class="p">())</span> <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello, wine</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
        <span class="n">WriteFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"hello, windows</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We can cram two programs into this binary and select which program at run
time depending on what we see. In typical programs locating and calling
into glibc would be a challenge, particularly with the incompatible ABIs
involved. We’re avoiding it here by interfacing directly with the kernel.</p>

<h3 id="application-to-u-config">Application to u-config</h3>

<p>Luckily u-config has completely-optional platform layers implemented with
Linux system calls. The POSIX platform layer works fine, and that’s what
distributions should generally use, but these bonus platforms are unhosted
and do not require libc. That means we can shove it into a Windows build
with relatively little trouble.</p>

<p>Before we do that, let’s think about what we’re doing. <a href="/blog/2021/08/21/">Debian has great
cross toolchain support</a>, including Mingw-w64. There are even a few
Windows libraries in the Debian package repository, <a href="https://packages.debian.org/trixie/x32/libz-mingw-w64">such as zlib</a>, and
we can build Windows programs against them. If you’re cross-building and
using pkg-config, you ought to use the cross toolchain pkg-config, which
in GNU ecosystems gets an architecture prefix like the other cross tools.
Debian cross toolchains each include a cross pkg-config, and it sometimes
<em>almost</em> works correctly! Here’s what I get on Debian 13:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz
</code></pre></div></div>

<p>Note the architecture in the <code class="language-plaintext highlighter-rouge">-I</code> and <code class="language-plaintext highlighter-rouge">-L</code> options. It really is querying
the <a href="https://peter0x44.github.io/posts/cross-compilers/">cross sysroot</a>. Though these paths are in the cross sysroot,
and so should not be listed by pkg-config. It’s unoptimal and indicates
this pkg-config is probably misconfigured. In other cases it’s far from
correct:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...
</code></pre></div></div>

<p>A tool prefixed <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-</code> should not produce paths containing
<code class="language-plaintext highlighter-rouge">x86_64-linux-gnu</code> (the host architecture in this case). Our version won’t
have these issues.</p>

<p>The u-config platform interface is five functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// read whole files</span>
<span class="n">s8node</span> <span class="o">*</span><span class="nf">os_listing</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// list directories</span>
<span class="kt">void</span>    <span class="nf">os_write</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>          <span class="c1">// standard out/err</span>
<span class="kt">void</span>    <span class="nf">os_fail</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">);</span>                       <span class="c1">// non-zero exit</span>

<span class="kt">void</span> <span class="nf">uconfig</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Platforms implement the first four functions, and call <code class="language-plaintext highlighter-rouge">uconfig()</code> with
the platform’s configuration, context pointer (<code class="language-plaintext highlighter-rouge">os *</code>), command line
arguments, environment, and some memory (all in the <code class="language-plaintext highlighter-rouge">config</code> object). My
strategy is to link two platforms into the binary, and the first challenge
is they both define <code class="language-plaintext highlighter-rouge">os_write</code>, etc. I did not plan nor intend for one
binary to contain more than one platform layer. Unity builds offer a fix
without changing a single line of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include</span> <span class="cpf">"main_windows.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span>
<span class="cp">#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include</span> <span class="cpf">"main_linux_amd64.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span></code></pre></div></div>

<p>This dirty, but effective trick <a href="/blog/2025/02/05/">may look familiar</a>. It also doesn’t
interfere with the other builds. Next I define the real platform functions
as a dispatch based on our run-time situation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="n">wine_detected</span><span class="p">;</span>

<span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">linux_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">win32_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I were serious about keeping this experiment, I’d lift <code class="language-plaintext highlighter-rouge">os</code> as I did
the functions (as <code class="language-plaintext highlighter-rouge">win32_os</code>, <code class="language-plaintext highlighter-rouge">linux_os</code>) and include <code class="language-plaintext highlighter-rouge">wine_detected</code> in
the context, eliminating this global variable. That cannot be done with
simple hacks and macros.</p>

<p>The next challenge is that I wrote the Linux platform layer assuming LP64,
and so it uses <code class="language-plaintext highlighter-rouge">long</code> instead of an equivalent platform-agnostic type like
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code>. I never thought this would be an issue because this source
literally contains <code class="language-plaintext highlighter-rouge">asm</code> blocks and no conditional compilation, yet here
we are. Lesson learned. I wanted to try an extremely janky <code class="language-plaintext highlighter-rouge">#define</code> on
<code class="language-plaintext highlighter-rouge">long</code> to fix it, but this source file has a couple <code class="language-plaintext highlighter-rouge">long long</code> that won’t
play along. These multi-token type names of C are antithetical to its
preprocessor! So I adjusted the source manually instead.</p>

<p>The Windows and Linux platform entry points are completely different, both
in name and form, and so co-exist naturally. The merged platform layer is
a new entry point that will pass control to the appropriate entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">entrypoint</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>  <span class="c1">// Linux</span>
<span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">mainCRTStartup</span><span class="p">();</span>    <span class="c1">// Windows</span>
</code></pre></div></div>

<p>On Linux <code class="language-plaintext highlighter-rouge">stack</code> is <a href="/blog/2025/03/06/">the initial value of the stack pointer</a>, which
<a href="https://articles.manugarg.com/aboutelfauxiliaryvectors">points to <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, <code class="language-plaintext highlighter-rouge">envp</code>, and <code class="language-plaintext highlighter-rouge">auxv</code></a>. We’ll need construct
an artificial “stack” for the Linux platform layer to harvest. On Windows
this is <a href="/blog/2023/02/15/">the process entry point</a>, and it will find the rest on its
own as a normal Windows process. Ultimately this ended up simpler than I
expected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">merge_entrypoint</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">wine_detected</span> <span class="o">=</span> <span class="n">running_on_wine</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="o">*</span><span class="n">fakestack</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">c16</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">GetCommandLineW</span><span class="p">();</span>
        <span class="n">fakestack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)(</span><span class="n">iz</span><span class="p">)</span><span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">fakestack</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="c1">// TODO: append envp to the fake stack</span>
        <span class="n">entrypoint</span><span class="p">((</span><span class="n">iz</span> <span class="o">*</span><span class="p">)</span><span class="n">fakestack</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">mainCRTStartup</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Where <a href="/blog/2022/02/18/"><code class="language-plaintext highlighter-rouge">cmdline_to_argv8</code> is my Windows argument parser</a>, already
used by u-config, and I reserve one element at the front to store <code class="language-plaintext highlighter-rouge">argc</code>.
Since this is just a proof-of-concept I didn’t bother fabricating and
pushing <code class="language-plaintext highlighter-rouge">envp</code> onto the fake stack. The Linux entry point doesn’t need
<code class="language-plaintext highlighter-rouge">auxv</code> and can be omitted. Once in the Linux entry point it’s essentially
a Linux process from then on, except the x64 calling convention still in
use internally.</p>

<p>Finally, I configure the Linux platform layer for Debian’s cross sysroot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include</span><span class="cpf">"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "</span><span class="c1">/usr/x86_64-w64-mingw32/lib"</span><span class="cp">
</span></code></pre></div></div>

<p>And that’s it! We have our platform merge. Build (<a href="https://github.com/skeeto/w64devkit">w64devkit</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c
</code></pre></div></div>

<p>On Debian use <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-gcc</code> for <code class="language-plaintext highlighter-rouge">cc</code>. The <code class="language-plaintext highlighter-rouge">-e</code> linker option
selects the new, higher level entry point. After installing <a href="https://packages.debian.org/trixie/wine-binfmt">Wine
binfmt</a>, here’s how it looks on Debian:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs zlib
-lz
</code></pre></div></div>

<p>That’s the correct output, but is it using the cross sysroot? Ask it to
include the <code class="language-plaintext highlighter-rouge">-I</code> argument despite it being in the cross sysroot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz
</code></pre></div></div>

<p>Looking good! It passes the <code class="language-plaintext highlighter-rouge">pc_path</code> test, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig
</code></pre></div></div>

<p>Running <em>this same binary</em> on Windows after installing zlib in w64devkit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz
</code></pre></div></div>

<p>Also:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig
</code></pre></div></div>

<p>My Frankenwine is a success!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Closures as Win32 window procedures</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/12/"/>
    <id>urn:uuid:7bf46ec6-a8b2-4ffa-857a-86c040357702</id>
    <updated>2025-12-12T19:52:10Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Back in 2017 I wrote <a href="/blog/2017/01/08/">about a technique for creating closures in C</a>
using <a href="/blog/2015/03/19/">JIT-compiled</a> wrapper. It’s neat, though rarely necessary in
real programs, so I don’t think about it often. I applied it to <code class="language-plaintext highlighter-rouge">qsort</code>,
which <a href="/blog/2023/02/11/">sadly</a> accepts no context pointer. More practical would be
working around <a href="/blog/2023/12/17/">insufficient custom allocator interfaces</a>, to
create allocation functions at run-time bound to a particular allocation
region. I’ve learned a lot since I last wrote about this subject, and <a href="https://lowkpro.com/blog/creating-c-closures-from-lua-closures.html">a
recent article</a> had me thinking about it again, and how I could do
better than before. In this article I will enhance Win32 window procedure
callbacks with a fifth argument, allowing us to more directly pass extra
context. I’m using <a href="https://github.com/skeeto/w64devkit">w64devkit</a> on x64, but the everything here should
work out-of-the-box with any x64 toolchain that speaks GNU assembly.</p>

<p>A <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nc-winuser-wndproc">window procedure</a> has this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">Wndproc</span><span class="p">(</span>
  <span class="n">HWND</span> <span class="n">hWnd</span><span class="p">,</span>
  <span class="n">UINT</span> <span class="n">Msg</span><span class="p">,</span>
  <span class="n">WPARAM</span> <span class="n">wParam</span><span class="p">,</span>
  <span class="n">LPARAM</span> <span class="n">lParam</span><span class="p">,</span>
<span class="p">);</span>
</code></pre></div></div>

<p>To create a window we must first register a class with <code class="language-plaintext highlighter-rouge">RegisterClass</code>,
which accepts a set of properties describing a window class, including a
pointer to one of these functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="p">...;</span>

    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">my_wndproc</span><span class="p">,</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>

    <span class="n">HWND</span> <span class="n">hwnd</span> <span class="o">=</span> <span class="n">CreateWindowExA</span><span class="p">(</span><span class="s">"my_class"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">state</span><span class="p">);</span>
</code></pre></div></div>

<p>The thread drives a message pump with events from the operating system,
dispatching them to this procedure, which then manipulates the program
state in response:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="n">MSG</span> <span class="n">msg</span><span class="p">;</span> <span class="n">GetMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);)</span> <span class="p">{</span>
        <span class="n">TranslateMessage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>
        <span class="n">DispatchMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>  <span class="c1">// calls the window procedure</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>All four <code class="language-plaintext highlighter-rouge">WNDPROC</code> parameters are determined by Win32. There is no context
pointer argument. So how does this procedure access the program state? We
generally have two options:</p>

<ol>
  <li>Global variables. Yucky but easy. Frequently seen in tutorials.</li>
  <li>A <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code> pointer attached to the window.</li>
</ol>

<p>The second option takes some setup. Win32 passes the last <code class="language-plaintext highlighter-rouge">CreateWindowEx</code>
argument to the window procedure when the window created, via <code class="language-plaintext highlighter-rouge">WM_CREATE</code>.
The procedure attaches the pointer to its window as <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>. This
pointer is passed indirectly, through a <code class="language-plaintext highlighter-rouge">CREATESTRUCT</code>. So ultimately it
looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">case</span> <span class="n">WM_CREATE</span><span class="p">:</span>
        <span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="n">cs</span> <span class="o">=</span> <span class="p">(</span><span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="p">)</span><span class="n">lParam</span><span class="p">;</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">state</span> <span class="o">*</span><span class="p">)</span><span class="n">cs</span><span class="o">-&gt;</span><span class="n">lpCreateParams</span><span class="p">;</span>
        <span class="n">SetWindowLongPtr</span><span class="p">(</span><span class="n">hwnd</span><span class="p">,</span> <span class="n">GWLP_USERDATA</span><span class="p">,</span> <span class="p">(</span><span class="n">LONG_PTR</span><span class="p">)</span><span class="n">arg</span><span class="p">);</span>
        <span class="c1">// ...</span>
</code></pre></div></div>

<p>In future messages we can retrieve it with <code class="language-plaintext highlighter-rouge">GetWindowLongPtr</code>. Every time
I go through this I wish there was a better way. What if there was a fifth
window procedure parameter though which we could pass a context?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);
</code></pre></div></div>

<p>We’ll build just this as a trampoline. The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a>
passes the first four arguments in registers, and the rest are pushed on
the stack, including this new parameter. Our trampoline cannot just stuff
the extra parameter in the register, but will actually have to build a
stack frame. Slightly more complicated, but barely so.</p>

<h3 id="allocating-executable-memory">Allocating executable memory</h3>

<p>In previous articles, and in the programs where I’ve applied techniques
like this, I’ve allocated executable memory with <code class="language-plaintext highlighter-rouge">VirtualAlloc</code> (or <code class="language-plaintext highlighter-rouge">mmap</code>
elsewhere). This introduces a small challenge for solving the problem
generally: Allocations may be arbitrarily far from our code and data, out
of reach of relative addressing. If they’re further than 2G apart, we need
to encode absolute addresses, and in the simple case would just assume
they’re always too far apart.</p>

<p>These days I’ve more experience with executable formats, and allocation,
and I immediately see a better solution: Request a block of writable,
executable memory from the loader, then allocate our trampolines from it.
Other than being executable, this memory isn’t special, and <a href="/blog/2025/01/19/">allocation
works the usual way</a>, using functions unaware it’s executable. By
allocating through the loader, this memory will be part of our loaded
image, guaranteed to be close to our other code and data, allowing our JIT
compiler to assume <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models#small-code-model">a small code model</a>.</p>

<p>There are a number of ways to do this, and here’s one way to do it with
GNU-styled toolchains targeting COFF:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="nf">.section</span> <span class="nv">.exebuf</span><span class="p">,</span><span class="s">"bwx"</span>
        <span class="nf">.globl</span> <span class="nv">exebuf</span>
<span class="nl">exebuf:</span>	<span class="nf">.space</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span>
</code></pre></div></div>

<p>This assembly program defines a new section named <code class="language-plaintext highlighter-rouge">.exebuf</code> containing 2M
of writable (<code class="language-plaintext highlighter-rouge">"w"</code>), executable (<code class="language-plaintext highlighter-rouge">"x"</code>) memory, allocated at run time just
like <code class="language-plaintext highlighter-rouge">.bss</code> (<code class="language-plaintext highlighter-rouge">"b"</code>). We’ll treat this like an arena out of which we can
allocate all trampolines we’ll probably ever need. With careful use of
<code class="language-plaintext highlighter-rouge">.pushsection</code> this could be basic inline assembly, but I’ve left it as a
separate source. On the C side I retrieve this like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="n">Arena</span> <span class="nf">get_exebuf</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">extern</span> <span class="kt">char</span> <span class="n">exebuf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span><span class="p">];</span>
    <span class="n">Arena</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="n">exebuf</span><span class="p">,</span> <span class="n">exebuf</span><span class="o">+</span><span class="k">sizeof</span><span class="p">(</span><span class="n">exebuf</span><span class="p">)};</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately I have to repeat myself on the size. There are different
ways to deal with this, but this is simple enough for now. I would have
loved to define the array in C with the GCC <a href="https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html"><code class="language-plaintext highlighter-rouge">section</code> attribute</a>,
but as is usually the case with this attribute, it’s not up to the task,
lacking the ability to set section flags. Besides, by not relying on the
attribute, any C compiler could compile this source, and we only need a
GNU-style toolchain to create the tiny COFF object containing <code class="language-plaintext highlighter-rouge">exebuf</code>.</p>

<p>While we’re at it, a reminder of some other basic definitions we’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="n">Str</span> <span class="nf">clone</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which have been discussed at length in previous articles.</p>

<h3 id="trampoline-compiler">Trampoline compiler</h3>

<p>From here the plan is to create a function that accepts a <code class="language-plaintext highlighter-rouge">Wndproc5</code> and a
context pointer to bind, and returns a classic <code class="language-plaintext highlighter-rouge">WNDPROC</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Wndproc5</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>Our window procedure now gets a fifth argument with the program state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">my_wndproc</span><span class="p">(</span><span class="n">HWND</span><span class="p">,</span> <span class="n">UINT</span><span class="p">,</span> <span class="n">WPARAM</span><span class="p">,</span> <span class="n">LPARAM</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When registering the class we wrap it in a trampoline compatible with
<code class="language-plaintext highlighter-rouge">RegisterClass</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">),</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>
</code></pre></div></div>

<p>All windows using this class will readily have access to this state object
through their fifth parameter. It turns out setting up <code class="language-plaintext highlighter-rouge">exebuf</code> was the
more complicated part, and <code class="language-plaintext highlighter-rouge">make_wndproc</code> is quite simple!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Wndproc5</span> <span class="n">proc</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">thunk</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"</span><span class="se">\x48\x83\xec\x28</span><span class="s">"</span>      <span class="c1">// sub   $40, %rsp</span>
        <span class="s">"</span><span class="se">\x48\xb8</span><span class="s">........"</span>      <span class="c1">// movq  $arg, %rax</span>
        <span class="s">"</span><span class="se">\x48\x89\x44\x24\x20</span><span class="s">"</span>  <span class="c1">// mov   %rax, 32(%rsp)</span>
        <span class="s">"</span><span class="se">\xe8</span><span class="s">...."</span>              <span class="c1">// call  proc</span>
        <span class="s">"</span><span class="se">\x48\x83\xc4\x28</span><span class="s">"</span>      <span class="c1">// add   $40, %rsp</span>
        <span class="s">"</span><span class="se">\xc3</span><span class="s">"</span>                  <span class="c1">// ret</span>
    <span class="p">);</span>
    <span class="n">Str</span> <span class="n">r</span>   <span class="o">=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">thunk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">proc</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="mi">24</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span> <span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="mi">20</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rel</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rel</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">WNDPROC</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The assembly allocates a new stack frame, with callee shadow space, and
with room for the new argument, which also happens to re-align the stack.
It stores the new argument for the <code class="language-plaintext highlighter-rouge">Wndproc5</code> just above the shadow space.
Then calls into the <code class="language-plaintext highlighter-rouge">Wndproc5</code> without touching other parameters. There
are two “patches” to fill out, which I’ve initially filled with dots: the
context pointer itself, and a 32-bit signed relative address for the call.
It’s going to be very near the callee. The only thing I don’t like about
this function is that I’ve manually worked out the patch offsets.</p>

<p>It’s probably not useful, but it’s easy to update the context pointer at
any time if hold onto the trampoline pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_wndproc_arg</span><span class="p">(</span><span class="n">WNDPROC</span> <span class="n">p</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="o">+</span><span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, for instance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// multiple states</span>
    <span class="n">WNDPROC</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="c1">// ...</span>
    <span class="n">set_wndproc_arg</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>  <span class="c1">// switch states</span>
</code></pre></div></div>

<p>Though I expect the most common case is just creating multiple procedures:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WNDPROC</span> <span class="n">procs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>To my slight surprise these trampolines still work with an active <a href="https://learn.microsoft.com/en-us/windows/win32/secbp/control-flow-guard">Control
Flow Guard</a> system policy. Trampolines do not have stack unwind
entries, and I thought Windows might refuse to pass control to them.</p>

<p>Here’s a complete, runnable example if you’d like to try it yourself:
<a href="https://gist.github.com/skeeto/13363b78489b26bed7485ec0d6b2c7f8"><code class="language-plaintext highlighter-rouge">main.c</code> and <code class="language-plaintext highlighter-rouge">exebuf.s</code></a></p>

<h3 id="better-cases">Better cases</h3>

<p>This is more work than going through <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>, and real programs
have a small, fixed number of window procedures — typically one — so this
isn’t the best example, but I wanted to illustrate with a real interface.
Again, perhaps the best real use is a library with a weak custom allocator
interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">);</span>   <span class="c1">// no context pointer!</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>     <span class="c1">// "</span>
<span class="p">}</span> <span class="n">Allocator</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">arena_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>

<span class="c1">// ...</span>

    <span class="n">Allocator</span> <span class="n">perm_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">Allocator</span> <span class="n">scratch_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">scratch</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>Something to keep in my back pocket for the future.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Assemblers in w64devkit, and other updates</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/08/10/"/>
    <id>urn:uuid:3e24c35f-9fac-470a-8225-2e0a0bc8f7ac</id>
    <updated>2025-08-10T15:42:50Z</updated>
    <category term="x86"/>
    <content type="html">
      <![CDATA[<p>Today I’m releasing <a href="https://github.com/skeeto/w64devkit/">w64devkit</a> 2.4.0, mostly for <a href="https://gcc.gnu.org/pipermail/gcc/2025-August/246491.html">GCC 15.2</a>. As
usual, it includes the continuous background improvements, and ideally
each release is the best so far. The <a href="/blog/2020/05/15/">first release</a> included the
Netwide Assembler, <a href="https://www.nasm.us/">NASM</a>, but it’s now been a year since I removed NASM
from the distribution (2.0.0). I’m asked on occasion why, or how to get it
back. Because I value thorough source control logs, my justifications for
this, and all changes, are captured in these logs, so <code class="language-plaintext highlighter-rouge">git log</code> is a kind
of miniature, project blog. I understand this is neither discoverable nor
obvious, especially because the GitHub UI (ugh) lacks anything like <code class="language-plaintext highlighter-rouge">git
log</code> in the terminal. So let’s talk about it here, along with other recent
changes.</p>

<p>NASM is nice assembler, and I still in general prefer its x86 syntax to
the GNU Assembler, GAS. It’s tidy, self-contained, dependency-free other
than a C toolchain, reliable, easy to build and cross compile, which is
why I included it in the first place. However, <em>it’s just not a good fit
for w64dk</em>. It’s redundant with Binutils <code class="language-plaintext highlighter-rouge">as</code>, which is already mandatory
for supporting GCC. As a rule, w64dk is a curation that avoids redundancy,
and a second assembler requires special justification. Originally it was
that the syntax was nicer, but last year I decided that wasn’t enough to
outweigh against NASM’s weaknesses.</p>

<p>First, it doesn’t integrate well with the GNU toolchain, at least not to
the extent that <code class="language-plaintext highlighter-rouge">as</code> does. For example, many people don’t realize they can
assemble GAS assembly through the general <code class="language-plaintext highlighter-rouge">gcc</code> driver.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c myprogram.s
</code></pre></div></div>

<p>There’s rarely a reason to invoke <code class="language-plaintext highlighter-rouge">as</code> directly. Need a specific assembler
flag? Use <code class="language-plaintext highlighter-rouge">-Wa</code> or <code class="language-plaintext highlighter-rouge">-Xassembler</code>. Going through the compiler driver for
both assembly and linking (i.e. instead of <code class="language-plaintext highlighter-rouge">ld</code>) has the advantage of
operating at a “higher level.” The compiler driver knows about the whole
toolchain, and better understands <a href="https://peter0x44.github.io/posts/cross-compilers/">the sysroot</a>. Use a capital
file extension, <code class="language-plaintext highlighter-rouge">.S</code>, and <code class="language-plaintext highlighter-rouge">gcc</code> will automatically run it through the C
preprocessor.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c myprogram.S
</code></pre></div></div>

<p>This is quite nifty, especially for cross builds. NASM isn’t so integrated
and so requires invoking the <code class="language-plaintext highlighter-rouge">nasm</code> command directly. It’s not a big deal,
but it’s friction that adds up.</p>

<p>But even more, the most useful form of assembly in the context of w64dk is
<a href="/blog/2024/12/20/">inline assembly</a>, which of course is completely out of NASM’s lane.
So most of the time you’re going to be writing GAS anyway. Again, NASM is
just not integrated into the toolchain like <code class="language-plaintext highlighter-rouge">as</code>.</p>

<p>Second, NASM is a <em>dreadfully slow</em> assembler — in the vicinity of <em>two
orders of magnitude slower</em> than GAS! If your assembly program is a couple
hundred lines, no big deal. If you’re writing a compiler targeting NASM,
it’s impractical beyond toy programs. Friends don’t let friends use NASM
as a back-end. If you’re so allergic to GAS, note that <a href="https://github.com/yasm/yasm">YASM</a> has
matching syntax and better performance.</p>

<p>Third, and the last nail in the coffin, NASM doesn’t support DWARF debug
information in Windows targets (<code class="language-plaintext highlighter-rouge">win32</code>/<code class="language-plaintext highlighter-rouge">win64</code>). That means you cannot
debug NASM programs with GDB on Windows, at least not with source-level
debugging. Were you even aware GAS has source-level debugging with GDB?
Sure, you can show the assembly pane (<code class="language-plaintext highlighter-rouge">layout asm</code>) and step through the
<em>disassembly</em> with <code class="language-plaintext highlighter-rouge">ni</code>, but stepping through the original source (<code class="language-plaintext highlighter-rouge">layout
src</code>) is a whole different experience. Even better, bring up the register
pane (<code class="language-plaintext highlighter-rouge">layout regs</code>) and you have something akin to a Visual Studio watch
window. It’s such a pleasant experience, and yet no tutorial I’ve seen has
ever mentioned it. It should be the first thing people are taught. <em>Get
your act together, assembly tutorials!</em></p>

<p>In theory, YASM ought to solve this with its <code class="language-plaintext highlighter-rouge">-g dwarf2</code>, but alas this
feature appears to be broken. So that really just leaves GAS as the most
practical game in town for assembly on Windows with GNU-style toolchains.
In case it helps: <a href="https://archive.org/details/h42_Assembly_Language_Programming_for_PDP-11_and_LSL-11_Computers_ISBN_0-697-08164-8/page/n1/mode/2up">Learning a little PDP-11 assembly</a> gave me a
deeper understanding — and appreciation — of why GAS x86 is the way it is.
Makes it sting a little less than before.</p>

<h3 id="compiling-nasm">Compiling NASM</h3>

<p>At the same time w64dk lost NASM, it gained the ability to run Autotools
<code class="language-plaintext highlighter-rouge">configure</code> scripts. It only took a few environment variables as hints for
the script, and <a href="https://github.com/skeeto/w64devkit/commit/7785eb9c">a small hack in the shell</a> to <em>undo</em> an unnecessary
Autotools hack. The typical native <code class="language-plaintext highlighter-rouge">cmd.exe</code> invocation with a command
uses <code class="language-plaintext highlighter-rouge">/c</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd /c echo hello world
</code></pre></div></div>

<p>It’s <a href="/blog/2022/02/18/"><em>roughly</em></a> equivalent to <code class="language-plaintext highlighter-rouge">-c</code> in a unix shell. This is what
you’d use if you’re in another shell and you need to invoke <code class="language-plaintext highlighter-rouge">cmd</code> for a
specific purpose. However, if you’re in an MSYS2 shell with its virtual
file system, <code class="language-plaintext highlighter-rouge">/c</code> looks like a path to the C drive. So MSYS2 “helpfully”
translates the switch to a native path, something like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd C:\ echo hello world
</code></pre></div></div>

<p>Not helpful at all. So Autotools, assuming Cygwin-like environments are
the only that exist, uses a special escape form when invoking <code class="language-plaintext highlighter-rouge">cmd</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd //c echo hello world
</code></pre></div></div>

<p>Which Cygwin-like environments translate into the desired <code class="language-plaintext highlighter-rouge">/c</code>. If you’re
not in such an environment, then <code class="language-plaintext highlighter-rouge">cmd.exe</code> sees the <code class="language-plaintext highlighter-rouge">//c</code>, which doesn’t
work. So the <a href="https://frippery.org/busybox/">busybox-w32</a> shell now pattern matches for precisely:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cmd //c echo [...]
</code></pre></div></div>

<p>For a similar translation. That’s right, it matches <code class="language-plaintext highlighter-rouge">echo</code> in particular
because that’s the only <code class="language-plaintext highlighter-rouge">cmd</code> feature Autotools uses. So it’s completely
unnecessary, just poor code generation.</p>

<p>With that work in place, you can download a NASM source release, untar it,
run <code class="language-plaintext highlighter-rouge">./configure</code>, <code class="language-plaintext highlighter-rouge">make -j$(nproc)</code>, and copy the resulting <code class="language-plaintext highlighter-rouge">nasm.exe</code>
into w64dk or wherever else on your <code class="language-plaintext highlighter-rouge">$PATH</code> is convenient. The same is
true for quite a bit of software! You can build Binutils, including <code class="language-plaintext highlighter-rouge">as</code>
itself, exactly the same way. Being so easy for users to build their own
tools means I’m less concerned with including extraneous, more specialized
tools, such as NASM.</p>

<h3 id="path-style">Path Style</h3>

<p>Borrowing <a href="https://www.msys2.org/wiki/MSYS2-introduction/#path">a concept from MSYS2</a>, <code class="language-plaintext highlighter-rouge">w64devkit.ini</code> now has <code class="language-plaintext highlighter-rouge">path
style</code> option for controlling the initial <code class="language-plaintext highlighter-rouge">PATH</code>, using <a href="https://github.com/msys2/MSYS2-packages/blob/ae252e94/filesystem/profile#L28-L45">the same names
and configuration</a>. I’ve already found it useful in testing w64dk
itself in a relatively pristine, hermetic environment:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[w64devkit]</span>
<span class="py">home</span> <span class="p">=</span> <span class="s">.</span>
<span class="err">path</span> <span class="py">style</span> <span class="p">=</span> <span class="s">strict</span>
</code></pre></div></div>

<p>This uses the <code class="language-plaintext highlighter-rouge">w64devkit/</code> directory itself as <code class="language-plaintext highlighter-rouge">$HOME</code>, and <code class="language-plaintext highlighter-rouge">$PATH</code> is
initially just the w64dk <code class="language-plaintext highlighter-rouge">bin/</code> directory. See the <code class="language-plaintext highlighter-rouge">w64devkit.ini</code> header
for full documentation.</p>

<p>Otherwise, most of the major features have been discussed already:
<a href="/blog/2024/06/30/">peports and vc++filt</a>, <a href="/blog/2023/01/18/">pkg-config</a>, and <a href="/blog/2025/02/17/">xxd</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A more robust raw OpenBSD syscall demo</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/03/06/"/>
    <id>urn:uuid:f7101ee1-a2e6-4895-b763-bd7b2a842280</id>
    <updated>2025-03-06T02:43:20Z</updated>
    <category term="c"/><category term="bsd"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Ted Unangst published <a href="https://flak.tedunangst.com/post/dude-where-are-your-syscalls"><em>dude, where are your syscalls?</em></a> on flak
yesterday, with a neat demonstration of OpenBSD’s <a href="https://undeadly.org/cgi?action=article;sid=20230222064027">pinsyscall</a>
security feature, whereby only pre-registered addresses are allowed to
make system calls. Whether it strengthens or weakens security is <a href="https://isopenbsdsecu.re/mitigations/pinsyscall/">up for
debate</a>, but regardless it’s an interesting, low-level programming
challenge. The original demo is fragile for multiple reasons, and requires
manually locating and entering addresses for each build. In this article I
show how to fix it. To prove that it’s robust, I ported an entire, real
application to use raw system calls on OpenBSD.</p>

<p>The original program uses ARM64 assembly. I’m a lot more comfortable with
x86-64 assembly, plus that’s the hardware I have readily on hand. So the
assembly language will be different, but all the concepts apply to both
these architectures. Almost none of these OpenBSD system interfaces are
formally documented (or stable for that matter), and I had to dig around
the OpenBSD source tree to figure it out (along with a <a href="https://news.ycombinator.com/item?id=26290723">helpful jart
nudge</a>). So don’t be afraid to get your hands dirty.</p>

<p>There are lots of subtle problems in the original demo, so let’s go
through the program piece by piece, starting with the entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">start</span><span class="p">()</span>
<span class="p">{</span>
        <span class="n">w</span><span class="p">(</span><span class="s">"hello</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
        <span class="n">x</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function is registered as the entry point in the ELF image, so it has
no caller. <del>That means no return address on the stack, so the stack is
not aligned for a function.</del>(<strong>Correction</strong>: The stack alignment issue is
true for x86, but not ARM, so the original demo is fine.) In toy programs
that goes unnoticed, but compilers generate code assuming the stack is
aligned. In a real application this is likely to crash deep on the first
SIMD register spill.</p>

<p>We could fix this with a <a href="https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-force_005falign_005farg_005fpointer-function-attribute_002c-x86"><code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code></a> attribute, at
least for architectures that support it, but I prefer to write the entry
point in assembly. Especially so we can access the command line arguments
and environment variables, which is necessary in a real application. That
happens to work the same as it does on Linux, so here’s my old, familiar
entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">"        .globl _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start: mov   %rsp, %rdi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"        call  start</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Per the ABI, the first argument passes through <code class="language-plaintext highlighter-rouge">rdi</code>, so I pass a copy of
the stack pointer, <code class="language-plaintext highlighter-rouge">rsp</code>, as it appeared on entry. Entry point arguments
<code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> are all pushed on the stack at <code class="language-plaintext highlighter-rouge">rsp</code>, so the
first real function can retrieve it all from just the stack pointer. The
original demo won’t use it, though. Using <code class="language-plaintext highlighter-rouge">call</code> to pass control pushes a
return address, which will never be used, and aligns the stack for the
first real function. I name it <code class="language-plaintext highlighter-rouge">_start</code> because that’s what the linker
expects and so things will go a little smoother, so it’s rather convenient
that the original didn’t use this name.</p>

<p>Next up, the “write” function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="kr">__asm</span><span class="p">(</span>
<span class="s">"       mov x2, x1;"</span>
<span class="s">"       mov x1, x0;"</span>
<span class="s">"       mov w0, #1;"</span>
<span class="s">"       mov x8, #4;"</span>
<span class="s">"       svc #0;"</span>
        <span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There are two <a href="/blog/2024/12/20/">serious problems with this assembly block</a>. First, the
function arguments are not necessarily in those registers by the time
control reaches the basic assembly block. The function prologue could move
them around. Even more so if this function was inlined. This is exactly
the problem <em>extended</em> inline assembly is intended to solve. Second, it
clobbers a number of registers. Compilers assume this does not happen when
generating their own code. This sort of assembly falls apart the moment it
comes into contact with a non-zero optimization level.</p>

<p>Solving this is just a matter of using inline assembly properly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">long</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">err</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">rax</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>  <span class="c1">// SYS_write</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"+a"</span><span class="p">(</span><span class="n">rax</span><span class="p">),</span> <span class="s">"+d"</span><span class="p">(</span><span class="n">len</span><span class="p">),</span> <span class="s">"=@ccc"</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"D"</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">what</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">err</span> <span class="o">?</span> <span class="o">-</span><span class="n">rax</span> <span class="o">:</span> <span class="n">rax</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve enhanced it a bit, returning a <a href="/blog/2016/09/23/">Linux-style negative errno</a> on
error. In the BSD ecosystem, syscall errors are indicated using the carry
flag, which here is output into <code class="language-plaintext highlighter-rouge">err</code> via <code class="language-plaintext highlighter-rouge">=@ccc</code>. When set, the return
value is an errno. Further, the OpenBSD kernel uses both <code class="language-plaintext highlighter-rouge">rax</code> and <code class="language-plaintext highlighter-rouge">rdx</code>
for return values, so I’ve also listed <code class="language-plaintext highlighter-rouge">rdx</code> as an input+output despite
not consuming the result. Despite all these changes, this function is not
yet complete! We’ll get back to it later.</p>

<p>The “exit” function, <code class="language-plaintext highlighter-rouge">x</code>, is just fine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">x</span><span class="p">()</span> <span class="p">{</span>
        <span class="kr">__asm</span><span class="p">(</span>
<span class="s">"       mov x8, #1;"</span>
<span class="s">"       svc #0;"</span>
        <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It doesn’t set an exit status, so it passes garbage instead, but otherwise
this works. No inputs, plus clobbers and outputs don’t matter when control
never returns. In a real application I might write it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">x</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"syscall"</span> <span class="o">::</span> <span class="s">"a"</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">status</span><span class="p">));</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function will need a little additional work later, too.</p>

<p>The <code class="language-plaintext highlighter-rouge">ident</code> section is basically fine as-is:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span><span class="p">(</span><span class="s">" .section </span><span class="se">\"</span><span class="s">.note.openbsd.ident</span><span class="se">\"</span><span class="s">, </span><span class="se">\"</span><span class="s">a</span><span class="se">\"\n</span><span class="s">"</span>
<span class="s">"       .p2align 2</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   8</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   4</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   1</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .ascii </span><span class="se">\"</span><span class="s">OpenBSD</span><span class="se">\\</span><span class="s">0</span><span class="se">\"\n</span><span class="s">"</span>
<span class="s">"       .long   0</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .previous</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>The compiler assumes the current section remains the same at the end of
the assembly block, which here is accomplished with <code class="language-plaintext highlighter-rouge">.previous</code>. Though it
clobbers the assembler’s remembered “other” section and so may interfere
with surrounding code using <code class="language-plaintext highlighter-rouge">.previous</code>. Better to use <code class="language-plaintext highlighter-rouge">.pushsection</code> and
<code class="language-plaintext highlighter-rouge">.popsection</code> for good stack discipline. There are many such examples in
the OpenBSD source tree.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">".pushsection .note.openbsd.ident, </span><span class="se">\"</span><span class="s">a</span><span class="se">\"\n</span><span class="s">"</span>
    <span class="s">".long  8, 4, 1, 0x6e65704f, 0x00445342, 0</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".popsection</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Now the trickiest part, the pinsyscall table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">whats</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">sysno</span><span class="p">;</span>
<span class="p">}</span> <span class="n">happening</span><span class="p">[]</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">section</span><span class="p">(</span><span class="s">".openbsd.syscalls"</span><span class="p">)))</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">{</span> <span class="mh">0x104f4</span><span class="p">,</span> <span class="mi">4</span> <span class="p">},</span>
        <span class="p">{</span> <span class="mh">0x10530</span><span class="p">,</span> <span class="mi">1</span> <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Those offsets — offsets from the beginning of the ELF image — were entered
manually, and it kind of ruins the whole demo. We don’t have a good way to
get at those offsets from C, or any high level language. However, we can
solve that by tweaking the inline assembly with some labels:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="kt">long</span> <span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">long</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"_w: syscall"</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noinline</span><span class="p">,</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">x</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"_x: syscall"</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Very importantly I’ve added <code class="language-plaintext highlighter-rouge">noinline</code> to prevent these functions from
being inlined into additional copies of the <code class="language-plaintext highlighter-rouge">syscall</code> instruction, which
of course won’t be registered. This also prevents duplicate labels causing
assembler errors. Once we have the labels, we can use them in an assembly
block listing the allowed syscall instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">".pushsection .openbsd.syscalls</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".long  _x, 1</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".long  _w, 4</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".popsection</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>That lets the linker solve the offsets problem, which is its main job
after all. With these changes the demo works reliably, even under high
optimization levels. I suggest these flags:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostdlib -no-pie -o where where.c
</code></pre></div></div>

<p>Disabling PIE with <code class="language-plaintext highlighter-rouge">-no-pie</code> is necessary in real applications or else
strings won’t work. You can apply more flags to strip it down further, but
these are the flags generally necessary to compile these sorts of programs
on at least OpenBSD 7.6.</p>

<p>So, how do I know this stuff works in general? Because I ported <a href="/blog/2023/01/18/">my ultra
portable pkg-config clone, u-config</a>, to use raw OpenBSD syscalls:
<strong><a href="https://github.com/skeeto/u-config/blob/openbsd/openbsd_main.c"><code class="language-plaintext highlighter-rouge">openbsd_main.c</code></a></strong>. Everything still works at high optimization
levels.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c libmemory.a
$ ./pkg-config --cflags --libs libcurl
-I/usr/local/include -L/usr/local/lib -lcurl
</code></pre></div></div>

<p>Because the new syscall wrappers behave just like Linux system calls, it
leverages the <code class="language-plaintext highlighter-rouge">linux_noarch.c</code> platform, and the whole port is ~70 lines
of code. A few more flags (<code class="language-plaintext highlighter-rouge">-fno-stack-protector</code>, <code class="language-plaintext highlighter-rouge">-Oz</code>, <code class="language-plaintext highlighter-rouge">-s</code>, etc.), and
it squeezes into a slim 21.6K static binary.</p>

<p>Despite making no libc calls, it’s not possible stop compilers from
fabricating (<a href="/blog/2024/11/10/">hallucinating?</a>) string function calls, so the build
above depends on external definitions. In the command above, <code class="language-plaintext highlighter-rouge">libmemory.a</code>
comes from <a href="https://github.com/skeeto/w64devkit/blob/master/src/libmemory.c"><code class="language-plaintext highlighter-rouge">libmemory.c</code></a> found <a href="/blog/2024/02/05/">in w64devkit</a>. Alternatively,
<a href="https://flak.tedunangst.com/post/you-dont-link-all-of-libc">and on topic</a>, you could link the OpenBSD libc string functions by
omitting <code class="language-plaintext highlighter-rouge">libmemory.a</code> from the build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c
</code></pre></div></div>

<p>Though it pulls in a lot of bloat (~8x size increase), and teasing out the
necessary objects isn’t trivial.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Practical libc-free threading on Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/03/23/"/>
    <id>urn:uuid:631a8107-2eef-420b-9594-752e6f013048</id>
    <updated>2023-03-23T05:32:41Z</updated>
    <category term="c"/><category term="optimization"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re <a href="/blog/2023/02/15/">not using a C runtime</a> on Linux, and instead you’re
programming against its system call API. It’s long-term and stable after
all. <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">Memory management</a> and <a href="/blog/2023/02/13/">buffered I/O</a> are easily
solved, but a lot of software benefits from concurrency. It would be nice
to also have thread spawning capability. This article will demonstrate a
simple, practical, and robust approach to spawning and managing threads
using only raw system calls. It only takes about a dozen lines of C,
including a few inline assembly instructions.</p>

<p>The catch is that there’s no way to avoid using a bit of assembly. Neither
the <code class="language-plaintext highlighter-rouge">clone</code> nor <code class="language-plaintext highlighter-rouge">clone3</code> system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architectures supported by Linux. The <a href="https://man7.org/linux/man-pages/man2/clone.2.html">glibc <code class="language-plaintext highlighter-rouge">clone(2)</code>
wrapper</a> fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.</p>

<p>Before diving in, the complete, working demo: <a href="https://github.com/skeeto/scratch/blob/master/misc/stack_head.c"><strong><code class="language-plaintext highlighter-rouge">stack_head.c</code></strong></a></p>

<h3 id="the-clone-system-call">The clone system call</h3>

<p>On Linux, threads are spawned using the <code class="language-plaintext highlighter-rouge">clone</code> system call with semantics
like the classic unix <code class="language-plaintext highlighter-rouge">fork(2)</code>. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">clone</span><span class="p">(</span><span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Sounds kind of elegant, but it has an annoying problem: The new thread
begins life in the <em>middle</em> of a function without any established stack
frame. Its stack is a blank slate. It’s not ready to do anything except
jump to a function prologue that will set up a stack frame. So besides the
assembly for the system call itself, it also needs more assembly to get
the thread into a C-compatible state. In other words, <strong>a generic system
call wrapper cannot reliably spawn threads</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">brokenclone</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">threadentry</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_clone</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">stack</span><span class="p">);</span>
    <span class="c1">// DANGER: new thread may access non-existant stack frame here</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">threadentry</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For odd historical reasons, each architecture’s <code class="language-plaintext highlighter-rouge">clone</code> has a slightly
different interface. The newer <code class="language-plaintext highlighter-rouge">clone3</code> unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.</p>

<h3 id="the-stack-header">The stack “header”</h3>

<p>I <a href="/blog/2015/05/15/">figured out a neat trick eight years ago</a> which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the <code class="language-plaintext highlighter-rouge">ret</code> instruction
do the work.</p>

<p>Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the <code class="language-plaintext highlighter-rouge">stack_head</code>,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own <code class="language-plaintext highlighter-rouge">stack_head</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using <code class="language-plaintext highlighter-rouge">sizeof</code>
to place the structure, as I’ll demonstrate later.</p>

<p>Now for the cool part: The <code class="language-plaintext highlighter-rouge">...</code> can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">join_futex</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>More details on that futex shortly.</p>

<h3 id="the-clone-wrapper">The clone wrapper</h3>

<p>I call the <code class="language-plaintext highlighter-rouge">clone</code> wrapper <code class="language-plaintext highlighter-rouge">newthread</code>. It has the inline assembly for the
system call, and since it includes a <code class="language-plaintext highlighter-rouge">ret</code> to diverge the threads, it’s a
“naked” function <a href="/blog/2023/02/12/">just like with <code class="language-plaintext highlighter-rouge">setjmp</code></a>. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like <code class="language-plaintext highlighter-rouge">clone</code>, it doesn’t accept a thread
entry point. Instead it accepts a <code class="language-plaintext highlighter-rouge">stack_head</code> seeded with the entry
point. The whole wrapper is just six instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">long</span> <span class="nf">newthread</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"mov  %%rdi, %%rsi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// arg2 = stack</span>
        <span class="s">"mov  $0x50f00, %%edi</span><span class="se">\n</span><span class="s">"</span>  <span class="c1">// arg1 = clone flags</span>
        <span class="s">"mov  $56, %%eax</span><span class="se">\n</span><span class="s">"</span>       <span class="c1">// SYS_clone</span>
        <span class="s">"syscall</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  %%rsp, %%rdi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// entry point argument</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="o">:</span> <span class="o">:</span> <span class="s">"rax"</span><span class="p">,</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"rsi"</span><span class="p">,</span> <span class="s">"rdi"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x86-64, both function calls and system calls use <code class="language-plaintext highlighter-rouge">rdi</code> and <code class="language-plaintext highlighter-rouge">rsi</code> for
their first two parameters. Per the reference <code class="language-plaintext highlighter-rouge">clone(2)</code> prototype above:
the first system call argument is <code class="language-plaintext highlighter-rouge">flags</code> and the second argument is the
new <code class="language-plaintext highlighter-rouge">stack</code>, which will point directly at the <code class="language-plaintext highlighter-rouge">stack_head</code>. However, the
stack pointer arrives in <code class="language-plaintext highlighter-rouge">rdi</code>. So I copy <code class="language-plaintext highlighter-rouge">stack</code> into the second argument
register, <code class="language-plaintext highlighter-rouge">rsi</code>, then load the flags (<code class="language-plaintext highlighter-rouge">0x50f00</code>) into the first argument
register, <code class="language-plaintext highlighter-rouge">rdi</code>. The system call number goes in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<p>Where does that <code class="language-plaintext highlighter-rouge">0x50f00</code> come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">long</span> <span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FILES</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FS</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SIGHAND</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SYSVSEM</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_THREAD</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_VM</span><span class="p">;</span>
</code></pre></div></div>

<p>When the system call returns, it copies the stack pointer into <code class="language-plaintext highlighter-rouge">rdi</code>, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as <code class="language-plaintext highlighter-rouge">stack</code>, of course. In the old thread this is a
harmless no-op because <code class="language-plaintext highlighter-rouge">rdi</code> is a volatile register in this ABI. Finally,
<code class="language-plaintext highlighter-rouge">ret</code> pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (<a href="/blog/2016/09/23/">negative errno</a>) or the new thread ID. In the new thread
<strong>it pops the first element of <code class="language-plaintext highlighter-rouge">stack_head</code></strong> which, of course, is the
entry point. That’s why it must be first!</p>

<p>The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the <code class="language-plaintext highlighter-rouge">exit</code> (<em>not</em> <code class="language-plaintext highlighter-rouge">exit_group</code>)
system call to terminate itself.</p>

<h3 id="caller-point-of-view">Caller point of view</h3>

<p>The caller side looks something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">threadentry</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do work ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
    <span class="n">futex_wake</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">);</span>
    <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span> <span class="o">=</span> <span class="n">newstack</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">);</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">entry</span> <span class="o">=</span> <span class="n">threadentry</span><span class="p">;</span>
    <span class="c1">// ... assign other thread data ...</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">newthread</span><span class="p">(</span><span class="n">stack</span><span class="p">);</span>

    <span class="c1">// ... do work ...</span>

    <span class="n">futex_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Despite the minimalist, 6-instruction clone wrapper, this is taking the
shape of a conventional threading API. It would only take a bit more to
hide the futex, too. Speaking of which, what’s going on there? The <a href="/blog/2022/10/05/">same
principal as a WaitGroup</a>. The futex, an integer, is zero-initialized,
indicating the thread is running (“not done”). The joiner tells the kernel
to wait until the integer is non-zero, which it may already be since I
don’t bother to check first. When the child thread is done, it atomically
sets the futex to non-zero and wakes all waiters, which might be nobody.</p>

<p>Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its <code class="language-plaintext highlighter-rouge">SIGCHLD</code> (or use <code class="language-plaintext highlighter-rouge">CLONE_CHILD_CLEARTID</code>). If this
sounds like a problem, consider <a href="https://vimeo.com/644068002">your context</a> more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.</p>

<p>What’s with the <code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code>? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, <a href="/blog/2023/02/15/#stack-alignment-on-32-bit-x86">just like on Windows</a>.
If you want to access <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> you’ll need <a href="/blog/2022/02/18/">more
assembly</a>. (I wish doing <em>really basic things</em> without libc on Linux
didn’t require so much assembly.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span> <span class="p">(</span>
    <span class="s">".global _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start:</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  (%rsp), %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsp), %rsi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsi,%rdi,8), %rdx</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   call  main</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  %eax, %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  $60, %eax</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   syscall</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Getting back to the example usage, it has some regular-looking system call
wrappers. Where do those come from? Start with this 6-argument generic
system call wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">syscall6</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">long</span> <span class="n">d</span><span class="p">,</span> <span class="kt">long</span> <span class="n">e</span><span class="p">,</span> <span class="kt">long</span> <span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r10</span> <span class="n">asm</span><span class="p">(</span><span class="s">"r10"</span><span class="p">)</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r8</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r8"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r9</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r9"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r10</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r8</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r9</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I could define <code class="language-plaintext highlighter-rouge">syscall5</code>, <code class="language-plaintext highlighter-rouge">syscall4</code>, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
</span></code></pre></div></div>

<p>Now we can have some exits:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit_group</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit_group</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simplified futex wrappers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">expect</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL4</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">expect</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL3</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mh">0x7fffffff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And so on.</p>

<p>Finally I can talk about that <code class="language-plaintext highlighter-rouge">newstack</code> function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="nf">newstack</span><span class="p">(</span><span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">p</span> <span class="o">=</span> <span class="n">SYSCALL6</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x22</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">size</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span> <span class="o">+</span> <span class="n">count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">aligned</code> attribute comes into play here: I treat the result like an
array of <code class="language-plaintext highlighter-rouge">stack_head</code> and return the last element. The attribute ensures
each individual elements is aligned.</p>

<p>That’s it! There’s not much to it other than a few thoughtful assembly
instructions. It took doing this a few times in a few different programs
before I noticed how simple it can be.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Luhn algorithm using SWAR and SIMD</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/04/30/"/>
    <id>urn:uuid:2bb8fbd6-4197-4799-8258-861d316a7086</id>
    <updated>2022-04-30T17:53:05Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Ever been so successful that credit card processing was your bottleneck?
Perhaps you’ve wondered, “If only I could compute check digits three times
faster using the same hardware!” Me neither. But if that ever happens
someday, then this article is for you. I will show how to compute the
<a href="https://en.wikipedia.org/wiki/Luhn_algorithm">Luhn algorithm</a> in parallel using <em>SIMD within a register</em>, or
SWAR.</p>

<p>If you want to skip ahead, here’s the full source, tests, and benchmark:
<a href="https://github.com/skeeto/scratch/blob/master/misc/luhn.c"><code class="language-plaintext highlighter-rouge">luhn.c</code></a></p>

<p>The Luhn algorithm isn’t just for credit card numbers, but they do make a
nice target for a SWAR approach. The major payment processors use <a href="https://www.paypalobjects.com/en_GB/vhelp/paypalmanager_help/credit_card_numbers.htm">16
digit numbers</a> — i.e. 16 ASCII bytes — and typical machines today have
8-byte registers, so the input fits into two machine registers. In this
context, the algorithm works like so:</p>

<ol>
  <li>
    <p>Consider the digits number as an array, and double every other digit
starting with the first. For example, 6543 becomes 12, 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum individual digits in each element. The example becomes 3 (i.e.
1+2), 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.</p>
  </li>
</ol>

<p>I will implement this algorithm in C with this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>It assumes the input is 16 bytes and only contains digits, and it will
return the Luhn sum. Callers either validate a number by comparing the
result to zero, or use it to compute a check digit when generating a
number. (Read: You could use SWAR to rapidly generate valid numbers.)</p>

<p>The plan is to process the 16-digit number in two halves, and so first
load the halves into 64-bit registers, which I’m calling <code class="language-plaintext highlighter-rouge">hi</code> and <code class="language-plaintext highlighter-rouge">lo</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">hi</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">0</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">1</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">2</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">3</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">4</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">5</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">6</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">7</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">lo</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">8</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">9</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">11</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">13</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">14</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
</code></pre></div></div>

<p>This looks complicated and possibly expensive, but it’s really just an
idiom for loading a little endian 64-bit integer from a buffer. Breaking
it down:</p>

<ul>
  <li>
    <p>The input, <code class="language-plaintext highlighter-rouge">*s</code>, is <code class="language-plaintext highlighter-rouge">char</code>, which may be signed on some architectures. I
chose this type since it’s the natural type for strings. However, I do
not want sign extension, so I mask the low byte of the possibly-signed
result by ANDing with 255. It’s as though <code class="language-plaintext highlighter-rouge">*s</code> was <code class="language-plaintext highlighter-rouge">unsigned char</code>.</p>
  </li>
  <li>
    <p>The shifts assemble the 64-bit result in little endian byte order
<a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">regardless of the host machine byte order</a>. In other words, this
will produce correct results even on big endian hosts.</p>
  </li>
  <li>
    <p>I chose little endian since it’s the natural byte order for all the
architectures I care about. Big endian hosts may pay a cost on this load
(byte swap instruction, etc.). The rest of the function could just as
easily be computed over a big endian load if I was primarily targeting a
big endian machine instead.</p>
  </li>
  <li>
    <p>I could have used <code class="language-plaintext highlighter-rouge">unsigned long long</code> (i.e. <em>at least</em> 64 bits) since
no part of this function requires <em>exactly</em> 64 bits. I chose <code class="language-plaintext highlighter-rouge">uint64_t</code>
since it’s succinct, and in practice, every implementation supporting
<code class="language-plaintext highlighter-rouge">long long</code> also defines <code class="language-plaintext highlighter-rouge">uint64_t</code>.</p>
  </li>
</ul>

<p>Both GCC and Clang figure this all out and produce perfect code. On
x86-64, just one instruction for each statement:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Or, more impressively, loading both using a <em>single instruction</em> on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">ldp</span>  <span class="nv">x0</span><span class="p">,</span> <span class="nv">x1</span><span class="p">,</span> <span class="p">[</span><span class="nv">x0</span><span class="p">]</span>
</code></pre></div></div>

<p>The next step is to decode ASCII into numeric values. This is <a href="https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/">trivial and
common</a> in SWAR, and only requires subtracting <code class="language-plaintext highlighter-rouge">'0'</code> (<code class="language-plaintext highlighter-rouge">0x30</code>). So long
as there is no overflow, this can be done lane-wise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–9. Next, double every
other digit. Multiplication in SWAR is not easy, but doubling just means
adding the odd lanes to themselves. I can mask out the lanes that are not
doubled. Regarding the mask, recall that the least significant byte is the
first byte (little endian).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="n">lo</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–18. Now for the tricky
problem of folding the tens place into the ones place. Unlike 8 or 16, 10
is not a particularly convenient base for computers, especially since SWAR
lacks lane-wide division or modulo. Perhaps a lane-wise <a href="https://en.wikipedia.org/wiki/Binary-coded_decimal">binary-coded
decimal</a> could solve this. However, I have a better trick up my
sleeve.</p>

<p>Consider that the tens place is either 0 or 1. In other words, we really
only care if the value in the lane is greater than 9. If I add 6 to each
lane, the 5th bit (value 16) will definitely be set in any lanes that were
previously at least 10. I can use that bit as the tens place.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="p">(</span><span class="n">hi</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
</code></pre></div></div>

<p>This code adds 6 to the doubled lanes, shifts the 5th bit to the least
significant position in the lane, masks for just that bit, and adds it
lane-wise to the total. Only applying this to doubled lanes is a style
decision, and I could have applied it to all lanes for free.</p>

<p>The astute might notice I’ve strayed from the stated algorithm. A lane
that was holding, say, 12 now hold 13 rather than 3. Since the final
result of the algorithm is modulo 10, leaving the tens place alone is
harmless, so this is fine.</p>

<p>At this point each lane contains values in 0–19. Now that the tens
processing is done, I can combine the halves into one register with a
lane-wise sum:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">lo</span><span class="p">;</span>
</code></pre></div></div>

<p>Each lane contains values in 0–38. I would have preferred to do this
sooner, but that would have complicated tens place handling. Even if I had
rotated the doubled lanes in one register to even out the sums, some lanes
may still have had a 2 in the tens place.</p>

<p>The final step is a horizontal sum reduction using the typical SWAR
approach. Add the top half of the register to the bottom half, then the
top half of what’s left to the bottom half, etc.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
</code></pre></div></div>

<p>Before the sum I said each lane was 0–38, so couldn’t this sum be as high
as 304 (8x38)? It would overflow the lane, giving an incorrect result.
Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled
lanes. That’s a maximum of 224, which fits in the result lane without
overflow. Whew! I’ve been tracking the range all along to guard against
overflow like this.</p>

<p>Finally mask the result lane and return it modulo 10:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="p">(</span><span class="n">hi</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>On my machine, SWAR is around 3x faster than a straightforward
digit-by-digit implementation.</p>

<h3 id="usage-examples">Usage examples</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">is_valid</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">random_credit_card</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"%015llu0"</span><span class="p">,</span> <span class="n">rand64</span><span class="p">()</span><span class="o">%</span><span class="mi">1000000000000000</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">+</span> <span class="mi">10</span> <span class="o">-</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="simd">SIMD</h3>

<p>Conveniently, all the SWAR operations translate directly into SSE2
instructions. If you understand the SWAR version, then this is easy to
follow:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>

    <span class="c1">// decode ASCII</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mh">0x30</span><span class="p">));</span>

    <span class="c1">// double every other digit</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x00ff</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">m</span><span class="p">));</span>

    <span class="c1">// extract and add tens digit</span>
    <span class="n">__m128i</span> <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x0006</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_srai_epi32</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>

    <span class="c1">// horizontal sum</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi32</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On my machine, the SIMD version is around another 3x increase over SWAR,
and so nearly an order of magnitude faster than a digit-by-digit
implementation.</p>

<p><em>Update</em>: Const-me on Hacker News <a href="https://news.ycombinator.com/item?id=31320853">suggests a better option</a> for
handling the tens digit in the function above, shaving off 7% of the
function’s run time on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// if (digit &gt; 9) digit -= 9</span>
    <span class="n">__m128i</span> <span class="n">nine</span> <span class="o">=</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">9</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">gt</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">nine</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">gt</span><span class="p">,</span> <span class="n">nine</span><span class="p">));</span>
</code></pre></div></div>

<p><em>Update</em>: u/aqrit on reddit has come up with a more optimized SSE2
solution, 12% faster than mine on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="sc">'5'</span><span class="p">),</span> <span class="n">v</span><span class="p">);</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_slli_epi16</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">8</span><span class="p">));</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span>  <span class="c1">// subtract 1 if less than 5</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_setzero_si128</span><span class="p">());</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
    <span class="c1">// (('0' * 24) - 8) % 10 == 4</span>
<span class="p">}</span>
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A flexible, lightweight, spin-lock barrier</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/03/13/"/>
    <id>urn:uuid:5a72d27a-60f4-4b52-a4c2-f1c3b72e6c85</id>
    <updated>2022-03-13T23:55:08Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=30671979">on Hacker News</a>.</em></p>

<p>The other day I wanted try the famous <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">memory reordering experiment</a>
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an <a href="https://research.swtch.com/hwmm">“impossible” result</a> on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.</p>

<!--more-->

<p>Here’s the entire barrier implementation for two threads in C11.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for two threads. Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="mi">2</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">BarrierWait</span><span class="p">(</span><span class="n">barrier</span> <span class="o">*</span><span class="kt">uint32</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">v</span> <span class="o">:=</span> <span class="n">atomic</span><span class="o">.</span><span class="n">AddUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">v</span><span class="o">&amp;</span><span class="m">1</span> <span class="o">==</span> <span class="m">1</span> <span class="p">{</span>
        <span class="n">v</span> <span class="o">&amp;=</span> <span class="m">2</span>
        <span class="k">for</span> <span class="n">atomic</span><span class="o">.</span><span class="n">LoadUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="m">2</span> <span class="o">==</span> <span class="n">v</span> <span class="p">{</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.</p>

<p>When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of <a href="https://web.archive.org/web/20151109230817/https://stackoverflow.com/questions/33598686/spinning-thread-barrier-using-atomic-builtins">subtly-incorrect</a> spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.</p>

<p>Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.</p>

<h3 id="experiment">Experiment</h3>

<p>Here’s the setup for the memory reordering experiment, where <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>
are initialized to zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0
</code></pre></div></div>

<p>Considering all the possible orderings, it would seem that at least one of
<code class="language-plaintext highlighter-rouge">r0</code> or <code class="language-plaintext highlighter-rouge">r1</code> is 1. There seems to be no ordering where <code class="language-plaintext highlighter-rouge">r0</code> and <code class="language-plaintext highlighter-rouge">r1</code> could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.</p>

<p>How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use <code class="language-plaintext highlighter-rouge">volatile</code> for <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
<code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<p>So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"movl  $1, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"movl  %2, %0</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="o">*</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="o">*</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>ARM64 (to try on my Raspberry Pi):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"str  %w0, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"ldr  %w0, %2</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
<code class="language-plaintext highlighter-rouge">static</code>.</p>

<p>Alternatively, I could use C11 atomics with a relaxed memory order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">atomic_store_explicit</span><span class="p">(</span><span class="n">w0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">atomic_load_explicit</span><span class="p">(</span><span class="n">w1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this is a <em>race</em> and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of <em>starting barrier</em>… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">w0</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">r0</span><span class="p">,</span> <span class="n">r1</span><span class="p">;</span>

<span class="c1">// thread#1                   // thread#2</span>
<span class="n">w0</span> <span class="o">=</span> <span class="n">w1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>
<span class="n">r1</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w1</span><span class="p">);</span>    <span class="n">r0</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w0</span><span class="p">);</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>

<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r0</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">r1</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"impossible!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.</p>

<p>Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.</p>

<h3 id="barrier-selection">Barrier selection</h3>

<p>On my first attempt, I made the obvious decision for the barrier: I used
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_barrier_wait.html"><code class="language-plaintext highlighter-rouge">pthread_barrier_t</code></a>. I was already using pthreads for spawning the
extra thread, including <a href="/blog/2020/05/15/">on Windows</a>, so this was convenient.</p>

<p>However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a <em>global</em>
lock <em>twice</em> per wait to manage the barrier’s reference counter.</p>

<p>All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a <em>spin-lock barrier</em>.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.</p>

<h3 id="barrier-design">Barrier design</h3>

<p>Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to <code class="language-plaintext highlighter-rouge">w0</code>, <code class="language-plaintext highlighter-rouge">w1</code>) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.</p>

<p>I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.</p>

<p>At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait1</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Or to avoid an extra load, use the result directly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait2</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">++*</span><span class="n">barrier</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.</p>

<p>To fix this, the wait function must also track the <em>phase</em>. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently <strong>the rest of the integer acts like a phase counter</strong>!
Writing this out more explicitly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">observed</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">thread_count</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">thread_count</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// not last arrival, watch for phase change</span>
        <span class="kt">unsigned</span> <span class="n">init_phase</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
            <span class="kt">unsigned</span> <span class="n">current_phase</span> <span class="o">=</span> <span class="o">*</span><span class="n">barrier</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current_phase</span> <span class="o">!=</span> <span class="n">init_phase</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.</p>

<p>By the way, I’m using <code class="language-plaintext highlighter-rouge">unsigned</code> since it may eventually overflow, and
even <code class="language-plaintext highlighter-rouge">_Atomic int</code> overflow is undefined for the <code class="language-plaintext highlighter-rouge">++</code> operator. However,
if you use <code class="language-plaintext highlighter-rouge">atomic_fetch_add</code> or C++ <code class="language-plaintext highlighter-rouge">std::atomic</code> then overflow is
defined and you can use <code class="language-plaintext highlighter-rouge">int</code>.</p>

<p>Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (<code class="language-plaintext highlighter-rouge">&gt;&gt;</code>), I
mask (<code class="language-plaintext highlighter-rouge">&amp;</code>) the phase bit with 2.</p>

<p>With this spin-lock barrier, the experiment observes <code class="language-plaintext highlighter-rouge">r0 = r1 = 0</code> in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.</p>

<h3 id="generalizing-to-more-threads">Generalizing to more threads</h3>

<p>Two threads required two bits. This generalizes to <code class="language-plaintext highlighter-rouge">log2(n)+1</code> bits for
<code class="language-plaintext highlighter-rouge">n</code> threads, where <code class="language-plaintext highlighter-rouge">n</code> is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_waitn</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note: <strong>It never makes sense for <code class="language-plaintext highlighter-rouge">n</code> to exceed the logical core count!</strong>
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.</p>

<p>If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a <code class="language-plaintext highlighter-rouge">uint64_t</code> — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the <code class="language-plaintext highlighter-rouge">&amp;</code> operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.</p>

<p>While C11 <code class="language-plaintext highlighter-rouge">_Atomic</code> seems like it would be useful, unsurprisingly it is
not supported by one major, <a href="/blog/2021/12/30/">stubborn</a> implementation. If you’re
using C++11 or later, then go ahead use <code class="language-plaintext highlighter-rouge">std::atomic&lt;int&gt;</code> since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif
</span>
<span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="n">BARRIER_INC</span><span class="p">(</span><span class="n">barrier</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="n">BARRIER_GET</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This has the nice bonus that the interface does not have the <code class="language-plaintext highlighter-rouge">_Atomic</code>
qualifier, nor <code class="language-plaintext highlighter-rouge">std::atomic</code> template. It’s just a plain old <code class="language-plaintext highlighter-rouge">int</code>, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.</p>

<p>If you’d like to try the experiment yourself: <a href="https://gist.github.com/skeeto/c63b9ddf2c599eeca86356325b93f3a7"><code class="language-plaintext highlighter-rouge">reorder.c</code></a>. If
you’d like to see a test of Go and C sharing a thread barrier:
<a href="https://gist.github.com/skeeto/bdb5a0d2aa36b68b6f66ca39989e1444"><code class="language-plaintext highlighter-rouge">coop.go</code></a>.</p>

<p>I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe <a href="https://vimeo.com/644068002">context is
everything</a>. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>The wild west of Windows command line parsing</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/02/18/"/>
    <id>urn:uuid:04c886e0-3434-4292-b7de-e8213461838c</id>
    <updated>2022-02-18T03:52:12Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I’ve been experimenting again lately with <a href="/blog/2016/01/31/">writing software without a
runtime</a> aside from the operating system itself, both on Linux and
Windows. Another way to look at it: I write and embed a bespoke, minimal
runtime within the application. One of the runtime’s core jobs is
retrieving command line arguments from the operating system. On Windows
this is a deeper rabbit hole than I expected, and far more complex than I
realized. There is no standard, and every runtime does it a little
differently. Five different applications may see five different sets of
arguments — even different argument counts — from the same input, and this
is <em>before</em> any sort of option parsing. It’s truly a modern day Tower of
Babel: “Confound their command line parsing, that they may not understand
one another’s arguments.”</p>

<p>Unix-like systems pass the <code class="language-plaintext highlighter-rouge">argv</code> array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it <code class="language-plaintext highlighter-rouge">argv</code>. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>     <span class="c1">; argc</span>
        <span class="nf">lea</span>   <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>   <span class="c1">; argv</span>
        <span class="nf">call</span>  <span class="nv">main</span>
        <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="mi">60</span>        <span class="c1">; SYS_exit</span>
        <span class="nf">syscall</span>
</code></pre></div></div>

<p>It’s 5 instructions (20 bytes) on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">ldr</span>  <span class="nv">w0</span><span class="p">,</span> <span class="p">[</span><span class="nb">sp</span><span class="p">]</span>        <span class="c1">; argc</span>
        <span class="nf">add</span>  <span class="nv">x1</span><span class="p">,</span> <span class="nb">sp</span><span class="p">,</span> <span class="mi">8</span>       <span class="c1">; argv</span>
        <span class="nf">bl</span>   <span class="nv">main</span>
        <span class="nf">mov</span>  <span class="nv">w8</span><span class="p">,</span> <span class="mi">93</span>          <span class="c1">; SYS_exit</span>
        <span class="nf">svc</span>  <span class="mi">0</span>
</code></pre></div></div>

<p>On Windows, <code class="language-plaintext highlighter-rouge">argv</code> is passed in serialized form as a string. That’s how
MS-DOS did it (via the <a href="https://en.wikipedia.org/wiki/Program_Segment_Prefix">Program Segment Prefix</a>), because <a href="http://www.gaby.de/cpm/manuals/archive/cpm22htm/ch5.htm">that’s how
CP/M did it</a>. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and <em>somebody</em> has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.</p>

<p>Windows itself provides an operating system routine for parsing command
line strings: <a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw">CommandLineToArgvW</a>. Fetch the command line string
with <a href="https://docs.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew">GetCommandLineW</a>, pass it to this function, and you have your
<code class="language-plaintext highlighter-rouge">argc</code> and <code class="language-plaintext highlighter-rouge">argv</code>. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so <a href="/blog/2021/12/30/">if you want to work in UTF-8</a> you’ll also need
<code class="language-plaintext highlighter-rouge">WideCharToMultiByte</code>. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.</p>

<h3 id="my-getcommandlinew">My GetCommandLineW</h3>

<p>GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the <a href="https://docs.microsoft.com/en-us/windows/win32/api/winternl/ns-winternl-peb">Process
Environment Block</a>. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the <a href="https://en.wikipedia.org/wiki/Win32_Thread_Information_Block">Thread Information Block</a>,
itself found via the <code class="language-plaintext highlighter-rouge">gs</code> register (x64, <code class="language-plaintext highlighter-rouge">fs</code> on x86), an <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">old 386 segment
register</a>. Buried in the PEB is a <a href="https://docs.microsoft.com/en-us/windows/win32/api/subauth/ns-subauth-unicode_string"><code class="language-plaintext highlighter-rouge">UNICODE_STRING</code></a>, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">wchar_t</span> <span class="o">*</span><span class="nf">cmdline_fetch</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="cp">#if __amd64
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%gs:(0x60), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x20(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x78(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#elif __i386
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%fs:(0x30), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x10(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x44(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#endif
</span>    <span class="k">return</span> <span class="n">cmd</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the <code class="language-plaintext highlighter-rouge">gs</code> register, and you’ll find it using these offsets, too.</p>

<p>If you look carefully at the <code class="language-plaintext highlighter-rouge">UNICODE_STRING</code> you’ll see the length is
given by a <code class="language-plaintext highlighter-rouge">USHORT</code> in units of bytes, despite being a 16-bit <code class="language-plaintext highlighter-rouge">wchar_t</code>
string. This is <a href="https://devblogs.microsoft.com/oldnewthing/20031210-00/?p=41553">the source</a> of Windows’ maximum command line length
of <a href="https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw">32,767 characters</a> (including terminator).</p>

<p>GetCommandLineW is from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>, but CommandLineToArgvW is a bit
more off the beaten path in <code class="language-plaintext highlighter-rouge">shell32.dll</code>. If you wanted to avoid linking
to <code class="language-plaintext highlighter-rouge">shell32.dll</code> for <a href="https://randomascii.wordpress.com/2018/12/03/a-not-called-function-can-cause-a-5x-slowdown/">important reasons</a>, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.</p>

<p>The GetCommandLineW has a rough explanation: split arguments on whitespace
(not defined), quoting is involved, and there’s something about counting
backslashes, but only if they stop on a quote. It’s not quite enough to
implement your own, and if you test against it, it’s quickly apparent that
this documentation is at best incomplete. It links to a deprecated page
about <a href="https://docs.microsoft.com/en-us/previous-versions/17w5ykft(v=vs.85)">parsing C++ command line arguments</a> with a few more details.
Unfortunately the algorithm described on this page is not the algorithm
used by GetCommandLineW, nor is it used by any runtime I could find. It
even varies between Microsoft’s own CRTs. There is no canonical command
line parsing result, not even a <em>de facto</em> standard.</p>

<p>I eventually came across David Deley’s <a href="https://daviddeley.com/autohotkey/parameters/parameters.htm">How Command Line Parameters Are
Parsed</a>, which is the closest there is to an authoritative document on
the matter (<a href="https://web.archive.org/web/20210615061518/http://www.windowsinspired.com/how-a-windows-programs-splits-its-command-line-into-individual-arguments/">also</a>). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. <code class="language-plaintext highlighter-rouge">argv[0]</code>) follows entirely different
rules, which really confused me for while. The <a href="https://source.winehq.org/git/wine.git/blob/5a66eab72:/dlls/shcore/main.c#l264">Wine documentation</a>
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.</p>

<h3 id="my-commandlinetoargvw">My CommandLineToArgvW</h3>

<p>Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s <a href="/blog/2020/12/31/">a
little state machine</a>: <strong><a href="https://github.com/skeeto/scratch/blob/master/parsers/cmdline.c#L27"><code class="language-plaintext highlighter-rouge">cmdline.c</code></a></strong>. The interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_to_argv8</span><span class="p">(</span><span class="k">const</span> <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike the others, mine encodes straight into <a href="https://simonsapin.github.io/wtf-8/">WTF-8</a>, a superset of
UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative
lines of code: invisible since it involves <em>not</em> reacting to ill-formed
input. If you use the new-ish UTF-8 manifest Win32 feature then your
program cannot handle command line strings with ill-formed UTF-16, a
problem solved by WTF-8.</p>

<p>As documented, that <code class="language-plaintext highlighter-rouge">argv</code> must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case <code class="language-plaintext highlighter-rouge">argc</code> is 16,383 (plus one more <code class="language-plaintext highlighter-rouge">argv</code> slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the <code class="language-plaintext highlighter-rouge">argv</code> is
actually used as a <code class="language-plaintext highlighter-rouge">char</code> buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">argc</span> <span class="o">=</span> <span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmdline_fetch</span><span class="p">(),</span> <span class="n">argv</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">main</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also: Note the <code class="language-plaintext highlighter-rouge">FUZZ</code> option in my source. It has been pretty thoroughly
<a href="/blog/2019/01/25/">fuzz tested</a>. It didn’t find anything, but it does make me more
confident in the result.</p>

<p>I also peeked at some language runtimes to see how others handle it. Just
as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft
CRT. Also expected, CPython implicitly does whatever the underlying C
runtime does, so its exact command line behavior depends on which version
of Visual Studio was used to build the Python binary. OpenJDK
<a href="https://github.com/openjdk/jdk/blob/jdk-17+35/src/jdk.jpackage/windows/native/common/WinSysInfo.cpp#L141">pragmatically calls CommandLineToArgvW</a>. Go (gc) <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/os/exec_windows.go#115">does its own
parsing</a>, with behavior mixed between CommandLineToArgvW and some of
Microsoft’s CRTs, but not quite matching either.</p>

<h3 id="building-a-command-line-string">Building a command line string</h3>

<p>I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an <code class="language-plaintext highlighter-rouge">argv</code>. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.</p>

<p>Naturally, I also wrote my own implementation (same source):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_from_argv8</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Like before, it accepts a WTF-8 <code class="language-plaintext highlighter-rouge">argv</code>, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one <em>can</em> fail when <code class="language-plaintext highlighter-rouge">argv</code> is too large, it returns
zero for an error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"python.exe"</span><span class="p">,</span> <span class="s">"-c"</span><span class="p">,</span> <span class="n">code</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
<span class="kt">wchar_t</span> <span class="n">cmd</span><span class="p">[</span><span class="n">CMDLINE_CMD_MAX</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cmdline_from_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">CMDLINE_CMD_MAX</span><span class="p">,</span> <span class="n">argv</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"argv too large"</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">CreateProcessW</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="cm">/*...*/</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"CreateProcessW failed"</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>How do others handle this?</p>

<ul>
  <li>
    <p>The <a href="https://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c?h=emacs-27.2#n2009">aged Emacs implementation</a> is written in C rather than Lisp,
steeped in history with vestigial wrong turns. Emacs still only calls
the “narrow” CreateProcessA despite having every affordance to do
otherwise, and <a href="https://github.com/skeeto/emacsql/issues/77#issuecomment-887125675">uses the wrong encoding at that</a>. A personal
source of headaches.</p>
  </li>
  <li>
    <p>CPython uses Python rather than C via <a href="https://github.com/python/cpython/blob/3.10/Lib/subprocess.py#L529"><code class="language-plaintext highlighter-rouge">subprocess.list2cmdline</code></a>.
While <a href="https://bugs.python.org/issue10838">undocumented</a>, it’s accessible on any platform and easy to
test against various inputs. Try it out!</p>
  </li>
  <li>
    <p>Go (gc) is <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/syscall/exec_windows.go#101">just as delightfully boring I’d expect</a>.</p>
  </li>
  <li>
    <p>OpenJDK <a href="https://github.com/openjdk/jdk/blob/jdk-17%2B35/src/java.base/windows/classes/java/lang/ProcessImpl.java#L229">optimistically optimizes</a> for command line strings under
80 bytes, and like Emacs, displays the weathering of long use.</p>
  </li>
</ul>

<p>I don’t plan to write a language implementation anytime soon, where this
might be needed, but it’s nice to know I’ve already solved this problem
for myself!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>More DLL fun with w64devkit: Go, assembly, and Python</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/06/29/"/>
    <id>urn:uuid:b2c53451-b12a-4f1a-a475-6c81096c9b5a</id>
    <updated>2021-06-29T21:50:30Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>My previous article explained <a href="/blog/2021/05/31/">how to work with dynamic-link libraries
(DLLs) using w64devkit</a>. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, <a href="/blog/2020/05/15/">w64devkit</a> is a great complement to Go and reliably
fullfills all the needs of <a href="https://golang.org/cmd/cgo/">cgo</a> — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.</p>

<h3 id="go-bootstrap-and-cgo">Go: bootstrap and cgo</h3>

<p>The primary Go implementation, confusingly <a href="https://golang.org/doc/faq#What_compiler_technology_is_used_to_build_the_compilers">named “gc”</a>, is an
<a href="/blog/2020/01/21/">incredible piece of software engineering</a>. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.</p>

<p>You will need the Go 1.4 source, <a href="https://dl.google.com/go/go1.4-bootstrap-20171003.tar.gz">go1.4-bootstrap-20171003.tar.gz</a>.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).</p>

<p>Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use <a href="/blog/2021/02/08/"><code class="language-plaintext highlighter-rouge">cmd.exe</code></a> explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" &gt;&gt;~/.profile
$ source ~/.profile
</code></pre></div></div>

<p>Not only is Go now available, so is the full power of cgo. (Including <a href="https://dave.cheney.net/2016/01/18/cgo-is-not-go">its
costs</a> if used.)</p>

<h3 id="vim-suggestions">Vim suggestions</h3>

<p>Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
<code class="language-plaintext highlighter-rouge">goimports</code> and a couple of corrections to Vim’s built-in Go support (<code class="language-plaintext highlighter-rouge">[[</code>
and <code class="language-plaintext highlighter-rouge">]]</code> navigation). The included <code class="language-plaintext highlighter-rouge">ctags</code> understands Go, so tags
navigation works the same as it does with C. <code class="language-plaintext highlighter-rouge">\i</code> saves the current
buffer, runs <code class="language-plaintext highlighter-rouge">goimports</code>, and populates the quickfix list with any errors.
Similarly <code class="language-plaintext highlighter-rouge">:make</code> invokes <code class="language-plaintext highlighter-rouge">go build</code> and, as expected, populates the
quickfix list.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code>autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="k">setlocal</span> <span class="nb">makeprg</span><span class="p">=</span><span class="k">go</span>\ build
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">silent</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">i</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">update</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">cexpr</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"goimports -w "</span> <span class="p">.</span> <span class="nb">expand</span><span class="p">(</span><span class="s2">"%"</span><span class="p">))</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">silent</span> <span class="k">edit</span><span class="p">&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">[[</span>
<span class="se">    \</span> ?^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">]]</span>
<span class="se">    \</span> /^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>Go only comes with <code class="language-plaintext highlighter-rouge">gofmt</code> but <code class="language-plaintext highlighter-rouge">goimports</code> is just one command away, so
there’s little excuse not to have it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go install golang.org/x/tools/cmd/goimports@latest
</code></pre></div></div>

<p>Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.</p>

<h3 id="cgo-dlls">cgo DLLs</h3>

<p>The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
<code class="language-plaintext highlighter-rouge">import "C"</code>. The imported <code class="language-plaintext highlighter-rouge">C</code> object provides access to C types and
functions. Go functions marked with an <code class="language-plaintext highlighter-rouge">//export</code> comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.</p>

<p>To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Which DLL am I running?</span>
<span class="kt">int</span> <span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Generate 64 bits from a CSPRNG.</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Compute the Euclidean norm.</span>
<span class="kt">float</span> <span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s a C implementation which I’m calling “version 1”.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;math.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;ntsecapi.h&gt;</span><span class="cp">
</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">int</span>
<span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span>
<span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">RtlGenRandom</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">x</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">float</span>
<span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As discussed in the previous article, each function is exported using
<code class="language-plaintext highlighter-rouge">__declspec</code> so that they’re available for import. As before:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o hello1.dll hello1.c
</code></pre></div></div>

<p>Side note: This could be trivially converted into a C++ implementation
just by adding <code class="language-plaintext highlighter-rouge">extern "C"</code> to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.</p>

<p>Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>

<span class="k">import</span> <span class="s">"C"</span>
<span class="k">import</span> <span class="p">(</span>
	<span class="s">"crypto/rand"</span>
	<span class="s">"encoding/binary"</span>
	<span class="s">"math"</span>
<span class="p">)</span>

<span class="c">//export version</span>
<span class="k">func</span> <span class="n">version</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="kt">int</span> <span class="p">{</span>
	<span class="k">return</span> <span class="m">2</span>
<span class="p">}</span>

<span class="c">//export rand64</span>
<span class="k">func</span> <span class="n">rand64</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">buf</span> <span class="p">[</span><span class="m">8</span><span class="p">]</span><span class="kt">byte</span>
	<span class="n">rand</span><span class="o">.</span><span class="n">Read</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="n">r</span> <span class="o">:=</span> <span class="n">binary</span><span class="o">.</span><span class="n">LittleEndian</span><span class="o">.</span><span class="n">Uint64</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="p">}</span>

<span class="c">//export dist</span>
<span class="k">func</span> <span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">)</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="kt">float64</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)))</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the use of C types for all arguments and return values. The <code class="language-plaintext highlighter-rouge">main</code>
function is required since this is the main package, but it will never be
called. The DLL is built like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go build -buildmode=c-shared -o hello2.dll hello2.go
</code></pre></div></div>

<p>Without the <code class="language-plaintext highlighter-rouge">-o</code> option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.</p>

<p>What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using <code class="language-plaintext highlighter-rouge">--out-implib</code>. For Go we have to handle this ourselves via
<code class="language-plaintext highlighter-rouge">gendef</code> and <code class="language-plaintext highlighter-rouge">dlltool</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def
</code></pre></div></div>

<p>The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.</p>

<h3 id="nasm-assembly-dll">NASM assembly DLL</h3>

<p>We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.</p>

<p>w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has <a href="https://elronnd.net/writ/2021-02-13_att-asm.html">friendlier syntax</a>. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">bits</span> <span class="mi">64</span>

<span class="nf">section</span> <span class="nv">.text</span>

<span class="nf">global</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nf">export</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nl">DllMainCRTStartup:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">version</span>
<span class="nf">export</span> <span class="nv">version</span>
<span class="nl">version:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">3</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">rand64</span>
<span class="nf">export</span> <span class="nv">rand64</span>
<span class="nl">rand64:</span>
	<span class="nf">rdrand</span> <span class="nb">rax</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nf">export</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nl">dist:</span>
	<span class="nf">mulss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">mulss</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">addss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">sqrtss</span> <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">global</code> directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
<code class="language-plaintext highlighter-rouge">export</code> directive is Windows-specific and is equivalent to <code class="language-plaintext highlighter-rouge">dllexport</code> in
C.</p>

<p>Every DLL must have an entrypoint, usually named <code class="language-plaintext highlighter-rouge">DllMainCRTStartup</code>. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.</p>

<p>Here’s how to assemble and link the DLL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o
</code></pre></div></div>

<h3 id="call-the-dlls-from-python">Call the DLLs from Python</h3>

<p>Python has a nice, built-in C interop, <code class="language-plaintext highlighter-rouge">ctypes</code>, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ctypes</span>

<span class="k">def</span> <span class="nf">load</span><span class="p">(</span><span class="n">version</span><span class="p">):</span>
    <span class="n">hello</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="sa">f</span><span class="s">"./hello</span><span class="si">{</span><span class="n">version</span><span class="si">}</span><span class="s">.dll"</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">(</span><span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">,</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_ulonglong</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="k">return</span> <span class="n">hello</span>

<span class="k">for</span> <span class="n">hello</span> <span class="ow">in</span> <span class="n">load</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"version"</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">())</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"rand   "</span><span class="p">,</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">()</span><span class="si">:</span><span class="mi">016</span><span class="n">x</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"dist   "</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
</code></pre></div></div>

<p>After loading the DLL with <code class="language-plaintext highlighter-rouge">CDLL</code> the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0
</code></pre></div></div>

<p>That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Chunking Optimizations: Let the Knife Do the Work</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/09/"/>
    <id>urn:uuid:961086fa-46af-42d4-bd69-6f4a326a1505</id>
    <updated>2019-12-09T22:37:55Z</updated>
    <category term="c"/><category term="cpp"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>There’s an old saying, <a href="https://www.youtube.com/watch?v=bTee6dKpDB0"><em>let the knife do the work</em></a>. Whether
preparing food in the kitchen or whittling a piece of wood, don’t push
your weight into the knife. Not only is it tiring, you’re much more
likely to hurt yourself. Use the tool properly and little force will be
required.</p>

<p>The same advice also often applies to compilers.</p>

<p>Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of
data. The simplest approach would be to do it a byte at a time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* XOR src into dst */</span>
<span class="kt">void</span>
<span class="nf">xor512a</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Maybe you benchmark it or you look at the assembly output, and the
results are disappointing. Your compiler did <em>exactly</em> what you asked
of it and produced code that performs 64 single-byte XOR operations
(GCC 9.2.0, x86-64, <code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512a:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="p">],</span> <span class="nb">cl</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">64</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The target architecture has wide registers so it could be doing <em>at
least</em> 8 bytes at a time. Since your compiler isn’t doing it, you
decide to chunk the work into 8 byte blocks yourself in an attempt to
manually implement a <em>chunking operation</em>. Here’s some <a href="https://old.reddit.com/r/C_Programming/comments/e83jzk/strange_gcc_compiler_bug_when_using_o2_or_higher/">real world
code</a> that does so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* WARNING: Broken, do not use! */</span>
<span class="kt">void</span>
<span class="nf">xor512b</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You check the assembly output of this function, and it looks much
better. It’s now processing 8 bytes at a time, so it should be about 8
times faster than before.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512b:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">rcx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">],</span> <span class="nb">rcx</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Still, this machine has 16-byte wide registers (SSE2 <code class="language-plaintext highlighter-rouge">xmm</code>), so there
could be another doubling in speed. Oh well, this is good enough, so you
plug it into your program. But something strange happens: <strong>The output
is now wrong!</strong></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">dst</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span>
        <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">13</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">16</span>
    <span class="p">};</span>
    <span class="kt">uint32_t</span> <span class="n">src</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">36</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span>
        <span class="mi">81</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span> <span class="mi">169</span><span class="p">,</span> <span class="mi">196</span><span class="p">,</span> <span class="mi">225</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">xor512b</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">src</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">dst</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Your program prints 1..16 as if <code class="language-plaintext highlighter-rouge">xor512b()</code> was never called. You check
over everything a dozen times, and you can’t find anything wrong. Even
crazier, if you disable optimizations then the bug goes away. It must be
some kind of compiler bug!</p>

<p>Investigating a bit more, you learn that the <code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>
option also fixes the bug. That’s because this program violates C strict
aliasing rules. An array of <code class="language-plaintext highlighter-rouge">uint32_t</code> was accessed as a <code class="language-plaintext highlighter-rouge">uint64_t</code>. As
an <a href="/blog/2018/07/20/#strict-aliasing">important optimization</a>, compilers are allowed to assume such
variables do not alias and generate code accordingly. Otherwise every
memory store could potentially modify any variable, which limits the
compiler’s ability to produce decent code.</p>

<p>The original version is fine because <code class="language-plaintext highlighter-rouge">char *</code>, including both <code class="language-plaintext highlighter-rouge">signed</code>
and <code class="language-plaintext highlighter-rouge">unsigned</code>, has a special exemption and may alias with anything. For
the same reason, using <code class="language-plaintext highlighter-rouge">char *</code> unnecessarily can also make your
programs slower.</p>

<p>What could you do to keep the chunking operation while not running afoul
of strict aliasing? Counter-intuitively, you could use <code class="language-plaintext highlighter-rouge">memcpy()</code>. Copy
the chunks into legitimate, local <code class="language-plaintext highlighter-rouge">uint64_t</code> variables, do the work, and
copy the result back out.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512c</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint64_t</span> <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">src</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">memcpy()</code> is a built-in function, your compiler knows its
semantics and can ultimately elide all that copying. The assembly
listing for <code class="language-plaintext highlighter-rouge">xor512c</code> is identical to <code class="language-plaintext highlighter-rouge">xor512b</code>, but it won’t go haywire
when integrated into a real program.</p>

<p>It works and it’s correct, but you can still do much better than this!</p>

<h3 id="letting-your-compiler-do-the-work">Letting your compiler do the work</h3>

<p>The problem is you’re forcing the knife and not letting it do the work.
There’s a constraint on your compiler that hasn’t been considered: It
must work correctly for overlapping inputs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">74</span><span class="p">]</span> <span class="o">=</span> <span class="p">{...};</span>
<span class="n">xor512a</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div></div>

<p>In this situation, the byte-by-byte and chunked versions of the function
will have different results. That’s exactly why your compiler can’t do
the chunking operation itself. However, <em>you don’t care about this
situation</em> because the inputs never overlap.</p>

<p>Let’s revisit the first, simple implementation, but this time being
smarter about it. The <code class="language-plaintext highlighter-rouge">restrict</code> keyword indicates that the inputs
will not overlap, freeing your compiler of this unwanted concern.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512d</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(Side note: Adding <code class="language-plaintext highlighter-rouge">restrict</code> to the manually chunked function,
<code class="language-plaintext highlighter-rouge">xor512b()</code>, will not fix it. Using <code class="language-plaintext highlighter-rouge">restrict</code> can never make an
incorrect program correct.)</p>

<p>Compiled with GCC 9.2.0 and <code class="language-plaintext highlighter-rouge">-O3</code>, the resulting unrolled code
processes 16-byte chunks at a time (<code class="language-plaintext highlighter-rouge">pxor</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm2</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm3</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm4</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm2</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm3</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm4</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Compiled with Clang 9.0.0 with AVX-512 enabled in the target
(<code class="language-plaintext highlighter-rouge">-mavx512bw</code>), <em>it does the entire operation in a single, big chunk!</em></p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">vmovdqu64</span>   <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">vpxorq</span>      <span class="nv">zmm0</span><span class="p">,</span> <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">vmovdqu64</span>   <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">zmm0</span>
        <span class="nf">vzeroupper</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>“Letting the knife do the work” means writing a correct program and
lifting unnecessary constraints so that the compiler can use whatever
chunk size is appropriate for the target.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Infectious Executable Stacks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/11/15/"/>
    <id>urn:uuid:7266b2ea-f39e-4b9a-87c8-e4480374af41</id>
    <updated>2019-11-15T03:29:37Z</updated>
    <category term="c"/><category term="netsec"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21553882">on Hacker News</a></em>.</p>

<p>In software development there are many concepts that at first glance
seem useful and sound, but, after considering the consequences of their
implementation and use, are actually horrifying. Examples include
<a href="https://lwn.net/Articles/683118/">thread cancellation</a>, <a href="/blog/2019/10/27/">variable length arrays</a>, and <a href="/blog/2018/07/20/#strict-aliasing">memory
aliasing</a>. GCC’s closure extension to C is another, and this
little feature compromises the entire GNU toolchain.</p>

<!--more-->

<h3 id="gnu-c-nested-functions">GNU C nested functions</h3>

<p>GCC has its own dialect of C called GNU C. One feature unique to GNU C
is <em>nested functions</em>, which allow C programs to define functions inside
other functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">intsort1</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The nested function above is straightforward and harmless. It’s nothing
groundbreaking, and it is trivial for the compiler to implement. The
<code class="language-plaintext highlighter-rouge">cmp</code> function is really just a static function whose scope is limited
to the containing function, no different than a local static variable.</p>

<p>With one slight variation the nested function turns into a closure. This
is where things get interesting:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">intsort2</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
        <span class="k">return</span> <span class="n">invert</span> <span class="o">?</span> <span class="o">-</span><span class="n">r</span> <span class="o">:</span> <span class="n">r</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">invert</code> variable from the outer scope is accessed from the inner
scope. This has <a href="/blog/2019/09/25/">clean, proper closure semantics</a> and works
correctly just as you’d expect. It fits quite well with traditional C
semantics. The closure itself is re-entrant and thread-safe. It’s
automatically (read: stack) allocated, and so it’s automatically freed
when the function returns, including when the stack is unwound via
<code class="language-plaintext highlighter-rouge">longjmp()</code>. It’s a natural progression to support closures like this
via nested functions. The eventual caller, <code class="language-plaintext highlighter-rouge">qsort</code>, doesn’t even know
it’s calling a closure!</p>

<p>While this seems so useful and easy, its implementation has serious
consequences that, in general, outweigh its benefits. In fact, in order
to make this work, the whole GNU toolchain has been specially rigged!</p>

<p>How does it work? The function pointer, <code class="language-plaintext highlighter-rouge">cmp</code>, passed to <code class="language-plaintext highlighter-rouge">qsort</code> must
somehow be associated with its lexical environment, specifically the
<code class="language-plaintext highlighter-rouge">invert</code> variable. A static address won’t do. When I <a href="/blog/2017/01/08/">implemented
closures as a toy library</a>, I talked about the function address for
each closure instance somehow needing to be unique.</p>

<p>GCC accomplishes this by constructing a trampoline on the stack. That
trampoline has access to the local variables stored adjacent to it, also
on the stack. GCC also generates a normal <code class="language-plaintext highlighter-rouge">cmp</code> function, like the
simple nested function before, that accepts <code class="language-plaintext highlighter-rouge">invert</code> as an additional
argument. The trampoline calls this function, passing the local variable
as this additional argument.</p>

<h3 id="trampoline-illustration">Trampoline illustration</h3>

<p>To illustrate this, I’ve manually implemented <code class="language-plaintext highlighter-rouge">intsort2()</code> below for
x86-64 (<a href="https://wiki.osdev.org/System_V_ABI">System V ABI</a>) without using GCC’s nested function
extension:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">invert</span> <span class="o">?</span> <span class="o">-</span><span class="n">r</span> <span class="o">:</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">intsort3</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fp</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">cmp</span><span class="p">;</span>
    <span class="k">volatile</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="c1">// mov  edx, invert</span>
        <span class="mh">0xba</span><span class="p">,</span> <span class="n">invert</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="c1">// mov  rax, cmp</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0xb8</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">,</span>
                    <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">40</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">48</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">56</span><span class="p">,</span>
        <span class="c1">// jmp  rax</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xe0</span>
    <span class="p">};</span>
    <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">trampoline</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">trampoline</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s a complete example you can try yourself on nearly any x86-64
unix-like system: <a href="/download/trampoline.c"><strong>trampoline.c</strong></a>. It even works with Clang. The
two notable systems where stack trampolines won’t work are
<a href="https://marc.info/?l=openbsd-cvs&amp;m=149606868308439&amp;w=2">OpenBSD</a> and <a href="https://github.com/microsoft/WSL/issues/286">WSL</a>.</p>

<p>(Note: The <code class="language-plaintext highlighter-rouge">volatile</code> is necessary because C compilers rightfully do
not see the contents of <code class="language-plaintext highlighter-rouge">buf</code> as being consumed. Execution of the
contents isn’t considered.)</p>

<p>In case you hadn’t already caught it, there’s a catch. The linker needs
to link a binary that asks the loader for an executable stack (<code class="language-plaintext highlighter-rouge">-z
execstack</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c
</code></pre></div></div>

<p>That’s because <code class="language-plaintext highlighter-rouge">buf</code> contains x86 code implementing the trampoline:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>  <span class="nb">edx</span><span class="p">,</span> <span class="nv">invert</span>    <span class="c1">; assign third argument</span>
<span class="nf">mov</span>  <span class="nb">rax</span><span class="p">,</span> <span class="nv">cmp</span>       <span class="c1">; store cmp address in RAX register</span>
<span class="nf">jmp</span>  <span class="nb">rax</span>            <span class="c1">; jump to cmp</span>
</code></pre></div></div>

<p>(Note: The absolute jump through a 64-bit register is necessary because
the trampoline on the stack and the jump target will be very far apart.
Further, these days the program will likely be compiled as a Position
Independent Executable (PIE), so <code class="language-plaintext highlighter-rouge">cmp</code> <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">might itself have an high
address</a> rather than load into the lowest 32 bits of the address
space.)</p>

<p>However, executable stacks were phased out ~15 years ago because it
makes buffer overflows so much more dangerous! Attackers can inject
and execute whatever code they like, typically <em>shellcode</em>. That’s why
we need this unusual linker option.</p>

<p>You can see that the stack will be executable using our old friend,
<code class="language-plaintext highlighter-rouge">readelf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -l a.out
...
  GNU_STACK  0x00000000 0x00000000 0x00000000
             0x00000000 0x00000000 RWE   0x10
...
</code></pre></div></div>

<p>Note the “RWE” at the bottom right, meaning read-write-execute. This is
a really bad sign in a real binary. Do any binaries installed on your
system right now have an executable stack? <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944817">I found one on mine</a>.
(Update: <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944971">A major one was found in the comments by Walter Misar</a>.)</p>

<p>When compiling the original version using a nested function there’s no
need for that special linker option. That’s because GCC saw that it
would need an executable stack and used this option automatically.</p>

<p>Or, more specifically, GCC <em>stopped</em> requesting a non-executable stack
in the object file it produced. For the GNU Binutils linker, <strong>the
default is an executable stack.</strong></p>

<h3 id="fail-open-design">Fail open design</h3>

<p>Since this is the default, the only way to get a non-executable stack is
if <em>every</em> object file input to the linker explicitly declares that it
does not need an executable stack. To request a non-executable stack, an
object file <a href="https://www.airs.com/blog/archives/518">must contain the (empty) section <code class="language-plaintext highlighter-rouge">.note.GNU-stack</code></a>.
If even a single object file fails to do this, then the final program
gets an executable stack.</p>

<p>Not only does one contaminated object file infect the binary, everything
dynamically linked with it <em>also</em> gets an executable stack. Entire
processes are infected! This occurs even via <code class="language-plaintext highlighter-rouge">dlopen()</code>, where the stack
is dynamically made executable to accomodate the new shared object.</p>

<p>I’ve been bit myself. In <a href="/blog/2016/11/15/"><em>Baking Data with Serialization</em></a> I did
it completely by accident, and I didn’t notice my mistake until three
years later. The GNU linker outputs object files without the special
note by default even though the object file only contains data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$
</code></pre></div></div>

<p>This is fixed with <code class="language-plaintext highlighter-rouge">-z noexecstack</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
  [ 2] .note.GNU-stack  PROGBITS  00000000  0000004c
$
</code></pre></div></div>

<p>This may happen any time you link object files not produced by GCC, such
as output <a href="/blog/2015/04/19/">from the NASM assembler</a> or <a href="/blog/2016/11/17/">hand-crafted object
files</a>.</p>

<p>Nested C closures are super slick, but they’re just not worth the risk
of an executable stack, and they’re certainly not worth an entire
toolchain being fail open about it.</p>

<p>Update: A <a href="http://verisimilitudes.net/2019-11-21">rebuttal</a>. My short response is that the issue
discussed in my article isn’t really about C the language but rather
about an egregious issue with one particular toolchain. The problem
doesn’t even arise if you use only C, but instead when linking in object
files specifically <em>not</em> derived from C code.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>The Value of Undefined Behavior</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/07/20/"/>
    <id>urn:uuid:9758e9ea-46b6-3904-5166-52c7e6922892</id>
    <updated>2018-07-20T21:31:18Z</updated>
    <category term="c"/><category term="cpp"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In several places, the C and C++ language specifications use a
curious, and fairly controversial, phrase: <em>undefined behavior</em>. For
certain program constructs, the specification prescribes no specific
behavior, instead allowing <a href="http://www.catb.org/jargon/html/N/nasal-demons.html">anything to happen</a>. Such constructs
are considered erroneous, and so the result depends on the particulars
of the platform and implementation. The original purpose of undefined
behavior was for implementation flexibility. In other words, it’s
slack that allows a compiler to produce appropriate and efficient code
for its target platform.</p>

<p>Specifying a particular behavior would have put unnecessary burden on
implementations — especially in the earlier days of computing — making
for inefficient programs on some platforms. For example, if the result
of dereferencing a null pointer was defined to trap — to cause the
program to halt with an error — then platforms that do not have
hardware trapping, such as those without virtual memory, would be
required to instrument, in software, each pointer dereference.</p>

<p>In the 21st century, undefined behavior has taken on a somewhat
different meaning. Optimizers use it — or <em>ab</em>use it depending on your
point of view — to lift <a href="/blog/2016/12/22/">constraints</a> that would otherwise
inhibit more aggressive optimizations. It’s not so much a
fundamentally different application of undefined behavior, but it does
take the concept to an extreme.</p>

<p>The reasoning works like this: A program that evaluates a construct
whose behavior is undefined cannot, by definition, have any meaningful
behavior, and so that program would be useless. As a result,
<a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">compilers assume programs never invoke undefined behavior</a> and
use those assumptions to prove its optimizations.</p>

<p>Under this newer interpretation, mistakes involving undefined behavior
are more <a href="https://kristerw.blogspot.com/2017/09/why-undefined-behavior-may-call-never.html">punishing</a> and <a href="/blog/2018/05/01/">surprising</a> than before. Programs
that <em>seem</em> to make some sense when run on a particular architecture may
actually compile into a binary with a security vulnerability due to
conclusions reached from an analysis of its undefined behavior.</p>

<p>This can be frustrating if your programs are intended to run on a very
specific platform. In this situation, all behavior really <em>could</em> be
locked down and specified in a reasonable, predictable way. Such a
language would be like an extended, less portable version of C or C++.
But your toolchain still insists on running your program on the
<em>abstract machine</em> rather than the hardware you actually care about.
However, <strong>even in this situation undefined behavior can still be
desirable</strong>. I will provide a couple of examples in this article.</p>

<h3 id="signed-integer-overflow">Signed integer overflow</h3>

<p>To start things off, let’s look at one of my all time favorite examples
of useful undefined behavior, a situation involving signed integer
overflow. The result of a signed integer overflow isn’t just
unspecified, it’s undefined behavior. Full stop.</p>

<p>This goes beyond a simple matter of whether or not the underlying
machine uses a two’s complement representation. From the perspective of
the abstract machine, just the act a signed integer overflowing is
enough to throw everything out the window, even if the overflowed result
is never actually used in the program.</p>

<p>On the other hand, unsigned integer overflow is defined — or, more
accurately, defined to wrap, <em>not</em> overflow. Both the undefined signed
overflow and defined unsigned overflow are useful in different
situations.</p>

<p>For example, here’s a fairly common situation, much like what <a href="https://www.youtube.com/watch?v=yG1OZ69H_-o&amp;t=38m18s">actually
happened in bzip2</a>. Consider this function that does substring
comparison:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">cmp_signed</span><span class="p">(</span><span class="kt">int</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">cmp_unsigned</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this function, the indices <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> will always be some small,
non-negative value. Since it’s non-negative, it should be <code class="language-plaintext highlighter-rouge">unsigned</code>,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually <em>don’t</em> want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.</p>

<p>The constraint is that <strong>the behavior of <code class="language-plaintext highlighter-rouge">i1</code> or <code class="language-plaintext highlighter-rouge">i2</code> overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior.</strong> On x86-64, where <code class="language-plaintext highlighter-rouge">int</code> is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.</p>

<p>In the signed case, incrementing the integers cannot overflow since that
would be undefined behavior. This permits the compiler to perform the
increment only in 64-bit precision without truncation if it would be
more efficient, which, in this case, it is.</p>

<p>Here’s the output of Clang 6.0.0 with <code class="language-plaintext highlighter-rouge">-Os</code> on x86-64. Pay close
attention to the main loop, which I named <code class="language-plaintext highlighter-rouge">.loop</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">cmp_signed:</span>
        <span class="nf">movsxd</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; use i1 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">movsxd</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; use i2 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">jmp</span>    <span class="nv">.check</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">rdx</span>                  <span class="c1">; increment only the base pointer</span>
<span class="nl">.check:</span> <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

        <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>             <span class="c1">; return c1 - c2</span>
        <span class="nf">ret</span>

<span class="nl">cmp_unsigned:</span>
        <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">jne</span>    <span class="nv">.ret</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; truncated i1 overflow</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; truncated i2 overflow</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>                  <span class="c1">; increment i1</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>                  <span class="c1">; increment i2</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

<span class="nl">.ret:</span>   <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>As unsigned values, <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, <code class="language-plaintext highlighter-rouge">buf</code>, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).</p>

<p>The signed integer helps to communicate the <em>narrow contract</em> of the
function — the limited range of <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> — to the compiler. In a
variant of C where signed integer overflow is defined (i.e. <code class="language-plaintext highlighter-rouge">-fwrapv</code>),
this capability is lost. In fact, using <code class="language-plaintext highlighter-rouge">-fwrapv</code> deoptimizes the signed
version of this function.</p>

<p>Side note: Using <code class="language-plaintext highlighter-rouge">size_t</code> (an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.</p>

<h3 id="strict-aliasing">Strict aliasing</h3>

<p>Another controversial undefined behavior is <a href="https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8"><em>strict aliasing</em></a>.
This particular term doesn’t actually appear anywhere in the C
specification, but it’s the popular name for C’s aliasing rules. In
short, variables with types that aren’t compatible are not allowed to
alias through pointers.</p>

<p>Here’s the classic example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">b</span><span class="p">;</span> <span class="c1">// load</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Naively one might assume the <code class="language-plaintext highlighter-rouge">return *b</code> could be optimized to a simple
<code class="language-plaintext highlighter-rouge">return 0</code>. However, since <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.</p>

<p>If <code class="language-plaintext highlighter-rouge">foo</code> has a narrow contract that forbids <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> to alias, we
have a couple of options for helping our compiler.</p>

<p>First, we could manually resolve the aliasing issue by returning 0
explicitly. In more complicated functions this might mean making local
copies of values, working only with those local copies, then storing the
results back before returning. Then aliasing would no longer matter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Second, C99 introduced a <code class="language-plaintext highlighter-rouge">restrict</code> qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to <code class="language-plaintext highlighter-rouge">memcpy()</code> are qualified with <code class="language-plaintext highlighter-rouge">restrict</code> as of C99.
Passing aliasing pointers through <code class="language-plaintext highlighter-rouge">restrict</code> parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">long</code> are never compatible even
when they have the same representation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>If you use an extended or modified version of C without strict
aliasing (<code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.</p>

<p>What <a href="https://lkml.org/lkml/2003/2/26/158">irritates</a> a lot of people is that compilers will still
apply the strict aliasing rule even when it’s trivial for the compiler
to prove that aliasing is occurring:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* note: forbidden */</span>
<span class="kt">long</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s not just a simple matter of making exceptions for these cases.
The language specification would need to define all the rules about
when and where incompatible types are permitted to alias, and
developers would have to understand all these rules if they wanted to
take advantage of the exceptions. It can’t just come down to trusting
that the compiler is smart enough to see the aliasing when it’s
sufficiently simple. It would need to be carefully defined.</p>

<p>Besides, there are probably <a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">conforming, portable solutions</a>
that, with contemporary compilers, will safely compile to the efficient
code you actually want anyway.</p>

<p>There <em>is</em> one special exception for strict aliasing: <code class="language-plaintext highlighter-rouge">char *</code> is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a <code class="language-plaintext highlighter-rouge">char *</code> pointer could force the compiler to
generate additional, unnecessary loads.</p>

<p>In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: <code class="language-plaintext highlighter-rouge">uint8_t</code> is not necessarily <code class="language-plaintext highlighter-rouge">unsigned char</code>.
That’s just one possible <code class="language-plaintext highlighter-rouge">typedef</code> definition for it. It could instead
<code class="language-plaintext highlighter-rouge">typedef</code> to, say, some internal <code class="language-plaintext highlighter-rouge">__byte</code> type.</p>

<p>In other words, technically speaking, <code class="language-plaintext highlighter-rouge">uint8_t</code> does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110">willing to exploit it</a> for optimization purposes.</p>

<h3 id="other-undefined-behaviors">Other undefined behaviors</h3>

<p>Some kinds of undefined behavior don’t have performance or portability
benefits. They’re only there to make the compiler’s job a little
simpler. Today, most of these are caught trivially at compile time as
syntax or semantic issues (i.e. a pointer cast to a float).</p>

<p>Some others are obvious about their performance benefits and don’t
require much explanation. For example, it’s undefined behavior to
index out of bounds (with some special exceptions for one past the
end), meaning compilers are not obligated to generate those checks,
instead relying on the programmer to arrange, by whatever means, that
it doesn’t happen.</p>

<p>Undefined behavior is like nitro, a dangerous, volatile substance that
makes things go really, really fast. You could argue that it’s <em>too</em>
dangerous to use in practice, but the aggressive use of undefined
behavior is <a href="http://thoughtmesh.net/publish/367.php">not without merit</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Intercepting and Emulating Linux System Calls with Ptrace</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/06/23/"/>
    <id>urn:uuid:a39b7709-d0a6-3b12-159f-7445d9524594</id>
    <updated>2018-06-23T20:41:08Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>The <code class="language-plaintext highlighter-rouge">ptrace(2)</code> (“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing <a href="https://blog.plover.com/Unix/strace-groff.html">strace</a> — system call trace. With Ptrace, tracers
can pause tracees, <a href="/blog/2016/09/03/">inspect and set registers and memory</a>, monitor
system calls, or even <em>intercept</em> system calls.</p>

<p>By intercept, I mean that the tracer can mutate system call arguments,
mutate the system call return value, or even block certain system calls.
Reading between the lines, this means a tracer can fully service system
calls itself. This is particularly interesting because it also means <strong>a
tracer can emulate an entire foreign operating system</strong>. This is done
without any special help from the kernel beyond Ptrace.</p>

<p>The catch is that a process can only have one tracer attached at a time,
so it’s not possible emulate a foreign operating system while also
debugging that process with, say, GDB. The other issue is that emulated
systems calls will have higher overhead.</p>

<p>For this article I’m going to focus on <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">Linux’s Ptrace</a> on
x86-64, and I’ll be taking advantage of a few Linux-specific extensions.
For the article I’ll also be omitting error checks, but the full source
code listings will have them.</p>

<p>You can find runnable code for the examples in this article here:</p>

<p><strong><a href="https://github.com/skeeto/ptrace-examples">https://github.com/skeeto/ptrace-examples</a></strong></p>

<h3 id="strace">strace</h3>

<p>Before getting into the really interesting stuff, let’s start by
reviewing a bare bones implementation of strace. It’s <a href="/blog/2018/01/17/">no
DTrace</a>, but strace is still incredibly useful.</p>

<p>Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The <code class="language-plaintext highlighter-rouge">ptrace(2)</code>
prototype generally looks something like this, though the specific
types may be different.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">ptrace</span><span class="p">(</span><span class="kt">int</span> <span class="n">request</span><span class="p">,</span> <span class="n">pid_t</span> <span class="n">pid</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pid</code> is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.</p>

<p>The <code class="language-plaintext highlighter-rouge">request</code> field selects a specific Ptrace function, just like the
<code class="language-plaintext highlighter-rouge">ioctl(2)</code> interface. For strace, only two are needed:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code>: This process is to be traced by its parent.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code>: Continue, but stop at the next system call
entrance or exit.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code>: Get a copy of the tracee’s registers.</li>
</ul>

<p>The other two fields, <code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">data</code>, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.</p>

<p>The strace interface is essentially a prefix to another command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace [strace options] program [arguments]
</code></pre></div></div>

<p>My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is <code class="language-plaintext highlighter-rouge">fork(2)</code> and <code class="language-plaintext highlighter-rouge">exec(2)</code> the
tracee process on the tail of <code class="language-plaintext highlighter-rouge">argv</code>. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pid_t</span> <span class="n">pid</span> <span class="o">=</span> <span class="n">fork</span><span class="p">();</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="cm">/* error */</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
    <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>  <span class="cm">/* child */</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_TRACEME</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">execvp</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">argv</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The parent waits for the child’s <code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code> using <code class="language-plaintext highlighter-rouge">wait(2)</code>. When
<code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the child will be paused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
<code class="language-plaintext highlighter-rouge">PTRACE_O_TRACEFORK</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETOPTIONS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">PTRACE_O_EXITKILL</span><span class="p">);</span>
</code></pre></div></div>

<p>All that’s left is a simple, endless loop that catches on system calls
one at a time. The body of the loop has four steps:</p>

<ol>
  <li>Wait for the process to enter the next system call.</li>
  <li>Print a representation of the system call.</li>
  <li>Allow the system call to execute and wait for the return.</li>
  <li>Print the system call return value.</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
<code class="language-plaintext highlighter-rouge">wait(2)</code> is needed to wait for the tracee to enter the desired state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, <em>the operating system has not yet serviced this system call</em>.
This detail will be important later.</p>

<p>The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, <a href="/blog/2015/05/15/">the system call number is
passed in <code class="language-plaintext highlighter-rouge">rax</code></a>, and the arguments (up to 6) are passed in
<code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">r10</code>, <code class="language-plaintext highlighter-rouge">r8</code>, and <code class="language-plaintext highlighter-rouge">r9</code>. Reading the registers is
another Ptrace call, though there’s no need to <code class="language-plaintext highlighter-rouge">wait(2)</code> since the
tracee isn’t changing state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="kt">long</span> <span class="n">syscall</span> <span class="o">=</span> <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">;</span>

<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"%ld(%ld, %ld, %ld, %ld, %ld, %ld)"</span><span class="p">,</span>
        <span class="n">syscall</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rsi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdx</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r10</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r8</span><span class="p">,</span>  <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r9</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s one caveat. For <a href="https://web.archive.org/web/20190323050358/https://stackoverflow.com/a/6469069">internal kernel purposes</a>, the system
call number is stored in <code class="language-plaintext highlighter-rouge">orig_rax</code> rather than <code class="language-plaintext highlighter-rouge">rax</code>. All the other
system call arguments are straightforward.</p>

<p>Next it’s another <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> and <code class="language-plaintext highlighter-rouge">wait(2)</code>, then another
<code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code> to fetch the result. The result is stored in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">" = %ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rax</span><span class="p">);</span>
</code></pre></div></div>

<p>The output from this simple program is <em>very</em> crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use <code class="language-plaintext highlighter-rouge">process_vm_readv(2)</code> to
read those buffers from the tracee in order to print them appropriately.</p>

<p>However, this does lay the groundwork for system call interception.</p>

<h3 id="system-call-interception">System call interception</h3>

<p>Suppose we want to use Ptrace to implement something like OpenBSD’s
<a href="https://man.openbsd.org/pledge.2"><code class="language-plaintext highlighter-rouge">pledge(2)</code></a>, in which <a href="http://www.openbsd.org/papers/hackfest2015-pledge/mgp00001.html">a process <em>pledges</em> to use only a
restricted set of system calls</a>. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.</p>

<p>Before entering this main loop, a process can limit itself to the few
operations that it needs. If <a href="/blog/2017/07/19/">the program has a flaw</a> allowing it
to be exploited by bad input, the pledge significantly limits what the
exploit can accomplish.</p>

<p>Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call <code class="language-plaintext highlighter-rouge">exit(2)</code> in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.</p>

<p>The tricky part is that <strong>there’s no way to abort a system call once
it’s started</strong>. When tracer returns from <code class="language-plaintext highlighter-rouge">wait(2)</code> on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.</p>

<p>However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” <code class="language-plaintext highlighter-rouge">EPERM</code> error
in <code class="language-plaintext highlighter-rouge">errno</code> <a href="/blog/2016/09/23/">via the normal in-band signaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Enter next system call */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="cm">/* Is this system call permitted? */</span>
    <span class="kt">int</span> <span class="n">blocked</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">is_syscall_blocked</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">blocked</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// set to invalid syscall</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="cm">/* Run system call and stop on exit */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">blocked</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* errno = EPERM */</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">rax</span> <span class="o">=</span> <span class="o">-</span><span class="n">EPERM</span><span class="p">;</span> <span class="c1">// Operation not permitted</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (<code class="language-plaintext highlighter-rouge">open(2)</code>) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.</p>

<p>How <em>could</em> the tracee communicate to the tracer? Use an artificial
system call!</p>

<h3 id="creating-an-artificial-system-call">Creating an artificial system call</h3>

<p>For my new pledge-like system call — which I call <code class="language-plaintext highlighter-rouge">xpledge()</code> to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYS_xpledge 10000
</span></code></pre></div></div>

<p>Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s <code class="language-plaintext highlighter-rouge">pledge(2)</code>, which uses a <a href="https://www.tedunangst.com/flak/post/string-interfaces">string interface</a>.
<em>Actually</em> designing robust and secure sets of privileges is really
complicated, as the <code class="language-plaintext highlighter-rouge">pledge(2)</code> manpage shows. Here’s the entire
interface <em>and</em> implementation of the system call for the tracee:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="cp">#define XPLEDGE_RDWR  (1 &lt;&lt; 0)
#define XPLEDGE_OPEN  (1 &lt;&lt; 1)
</span>
<span class="cp">#define xpledge(arg) syscall(SYS_xpledge, arg)
</span></code></pre></div></div>

<p>If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. <code class="language-plaintext highlighter-rouge">brk(2)</code>). The
<code class="language-plaintext highlighter-rouge">PLEDGE_RDWR</code> bit allows <a href="/blog/2017/03/01/">various</a> read and write system calls
(<code class="language-plaintext highlighter-rouge">read(2)</code>, <code class="language-plaintext highlighter-rouge">readv(2)</code>, <code class="language-plaintext highlighter-rouge">pread(2)</code>, <code class="language-plaintext highlighter-rouge">preadv(2)</code>, etc.). The
<code class="language-plaintext highlighter-rouge">PLEDGE_OPEN</code> bit allows <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<p>To prevent privileges from being escalated back, <code class="language-plaintext highlighter-rouge">pledge()</code> blocks
itself — though this also prevents dropping more privileges later down
the line.</p>

<p>In the xpledge tracer, I just need to check for this system call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle entrance */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">register_pledge</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The operating system will return <code class="language-plaintext highlighter-rouge">ENOSYS</code> (Function not implemented)
since this isn’t a <em>real</em> system call. So on the way out I overwrite
this with a success (0).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle exit */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_POKEUSER</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">RAX</span> <span class="o">*</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I wrote a little test program that opens <code class="language-plaintext highlighter-rouge">/dev/urandom</code>, makes a read,
tries to pledge, then tries to open <code class="language-plaintext highlighter-rouge">/dev/urandom</code> a second time, then
confirms it can read from the original <code class="language-plaintext highlighter-rouge">/dev/urandom</code> file descriptor.
Running without a pledge tracer, the output looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
</code></pre></div></div>

<p>Making an invalid system call doesn’t crash an application. It just
fails, which is a rather convenient fallback. When run under the
tracer, it looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
</code></pre></div></div>

<p>The pledge succeeds but the second <code class="language-plaintext highlighter-rouge">fopen(3)</code> does not since the tracer
blocked it with <code class="language-plaintext highlighter-rouge">EPERM</code>.</p>

<p>This concept could be taken much further, to, say, change file paths or
return fake results. A tracer could effectively chroot its tracee,
prepending some chroot path to the root of any path passed through a
system call. It could even lie to the process about what user it is,
claiming that it’s running as root. In fact, this is exactly how the
<a href="https://fakeroot-ng.lingnu.com/index.php/Home_Page">Fakeroot NG</a> program works.</p>

<h3 id="foreign-system-emulation">Foreign system emulation</h3>

<p>Suppose you don’t just want to intercept <em>some</em> system calls, but
<em>all</em> system calls. You’ve got <a href="/blog/2017/11/30/">a binary intended to run on another
operating system</a>, so none of the system calls it makes will ever
work.</p>

<p>You could manage all this using only what I’ve described so far. The
tracer would always replace the system call number with a dummy, allow
it to fail, then service the system call itself. But that’s really
inefficient. That’s essentially three context switches for each system
call: one to stop on the entrance, one to make the always-failing
system call, and one to stop on the exit.</p>

<p>The Linux version of PTrace has had a more efficient operation for
this technique since 2005: <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>. PTrace stops only <em>once</em>
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSEMU</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">OS_read</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_write</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_open</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_exit</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="cm">/* ... and so on ... */</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>
tracer, a loader (to take the place of <code class="language-plaintext highlighter-rouge">exec(2)</code>), and whatever system
libraries the binary needs (or only run static binaries).</p>

<p>In fact, this sounds like a fun weekend project.</p>

<h3 id="see-also">See also</h3>

<ul>
  <li><a href="https://www.youtube.com/watch?v=uXgxMDglxVM">Implementing a clone of OpenBSD pledge into the Linux kernel</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>When FFI Function Calls Beat Native C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/27/"/>
    <id>urn:uuid:cb339e3b-382e-3762-4e5c-10cf049f7627</id>
    <updated>2018-05-27T20:03:15Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update: There’s a good discussion on <a href="https://news.ycombinator.com/item?id=17171252">Hacker News</a>.</em></p>

<p>Over on GitHub, David Yu has an interesting performance benchmark for
function calls of various Foreign Function Interfaces (<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a>):</p>

<p><a href="https://github.com/dyu/ffi-overhead">https://github.com/dyu/ffi-overhead</a></p>

<p>He created a shared object (<code class="language-plaintext highlighter-rouge">.so</code>) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.</p>

<p>For the C “FFI” he used standard dynamic linking, not <code class="language-plaintext highlighter-rouge">dlopen()</code>. This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.</p>

<p>The most surprising result of the benchmark is that
<strong><a href="http://luajit.org/">LuaJIT’s</a> FFI is substantially faster than C</strong>. It’s about
25% faster than a native C function call to a shared object function.
How could a weakly and dynamically typed scripting language come out
ahead on a benchmark? Is this accurate?</p>

<p>It’s actually quite reasonable. The benchmark was run on Linux, so the
performance penalty we’re seeing comes the <em>Procedure Linkage Table</em>
(PLT). I’ve put together a really simple experiment to demonstrate the
same effect in plain old C:</p>

<p><a href="https://github.com/skeeto/dynamic-function-benchmark">https://github.com/skeeto/dynamic-function-benchmark</a></p>

<p>Here are the results on an Intel i7-6700 (Skylake):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
</code></pre></div></div>

<p>These are three different types of function calls:</p>

<ol>
  <li>Through the PLT</li>
  <li>An indirect function call (via <code class="language-plaintext highlighter-rouge">dlsym(3)</code>)</li>
  <li>A direct function call (via a JIT-compiled function)</li>
</ol>

<p>As shown, the last one is the fastest. It’s typically not an option
for C programs, but it’s natural in the presence of a JIT compiler,
including, apparently, LuaJIT.</p>

<p>In my benchmark, the function being called is named <code class="language-plaintext highlighter-rouge">empty()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">empty</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
</code></pre></div></div>

<p>And to compile it into a shared object:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -Os -o empty.so empty.c
</code></pre></div></div>

<p>Just as in my <a href="/blog/2017/09/21/">PRNG shootout</a>, the benchmark calls this function
repeatedly as many times as possible before an alarm goes off.</p>

<h3 id="procedure-linkage-tables">Procedure Linkage Tables</h3>

<p>When a program or library calls a function in another shared object,
the compiler cannot know where that function will be located in
memory. That information isn’t known until run time, after the program
and its dependencies are loaded into memory. These are usually at
randomized locations — e.g. <em>Address Space Layout Randomization</em>
(ASLR).</p>

<p>How is this resolved? Well, there are a couple of options.</p>

<p>One option is to make a note about each such call in the binary’s
metadata. The run-time dynamic linker can then <em>patch</em> in the correct
address at each call site. How exactly this would work depends on the
particular <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">code model</a> used when compiling the binary.</p>

<p>The downside to this approach is slower loading, larger binaries, and
less <a href="/blog/2016/04/10/">sharing of code pages</a> between different processes. It’s
slower loading because every dynamic call site needs to be patched
before the program can begin execution. The binary is larger because
each of these call sites needs an entry in the relocation table. And the
lack of sharing is due to the code pages being modified.</p>

<p>On the other hand, the overhead for dynamic function calls would be
eliminated, giving JIT-like performance as seen in the benchmark.</p>

<p>The second option is to route all dynamic calls through a table. The
original call site calls into a stub in this table, which jumps to the
actual dynamic function. With this approach the code does not need to
be patched, meaning it’s <a href="/blog/2016/12/23/">trivially shared</a> between processes.
Only one place needs to be patched per dynamic function: the entries
in the table. Even more, these patches can be performed <em>lazily</em>, on
the first function call, making the load time even faster.</p>

<p>On systems using ELF binaries, this table is called the Procedure
Linkage Table (PLT). The PLT itself doesn’t actually get patched —
it’s mapped read-only along with the rest of the code. Instead the
<em>Global Offset Table</em> (GOT) gets patched. The PLT stub fetches the
dynamic function address from the GOT and <em>indirectly</em> jumps to that
address. To lazily load function addresses, these GOT entries are
initialized with an address of a function that locates the target
symbol, updates the GOT with that address, and then jumps to that
function. Subsequent calls use the lazily discovered address.</p>

<p><img src="/img/diagram/plt.svg" alt="" /></p>

<p>The downside of a PLT is extra overhead per dynamic function call,
which is what shows up in the benchmark. Since the benchmark <em>only</em>
measures function calls, this appears to be pretty significant, but in
practice it’s usually drowned out in noise.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Cleared by an alarm signal. */</span>
<span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">plt_benchmark</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">empty</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">empty()</code> is in the shared object, that call goes through the PLT.</p>

<h3 id="indirect-dynamic-calls">Indirect dynamic calls</h3>

<p>Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
<code class="language-plaintext highlighter-rouge">dlsym(3)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">h</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="s">"path/to/lib.so"</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span>
<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="s">"f"</span><span class="p">);</span>
<span class="n">f</span><span class="p">();</span>
</code></pre></div></div>

<p>Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then <code class="language-plaintext highlighter-rouge">dlsym(3)</code> may actually return the address of
the PLT stub.)</p>

<p>However, this is still an <em>indirect</em> function call. On conventional
architectures, <em>direct</em> function calls have an immediate relative
address. That is, the target of the call is some hard-coded offset from
the call site. The CPU can see well ahead of time where the call is
going.</p>

<p>An indirect function call has more overhead. First, the address has to
be stored somewhere. Even if that somewhere is just a register, it
increases register pressure by using up a register. Second, it
provokes the CPU’s branch predictor since the call target isn’t
static, making for extra bookkeeping in the CPU. In the worst case the
function call may even cause a pipeline stall.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">indirect_benchmark</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">f</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function passed to this benchmark is fetched with <code class="language-plaintext highlighter-rouge">dlsym(3)</code> so the
compiler can’t <a href="/blog/2018/05/01/">do something tricky</a> like convert that indirect
call back into a direct call.</p>

<p>If the body of the loop was complicated enough that there was register
pressure, thereby requiring the address to be spilled onto the stack,
this benchmark might not fare as well against the PLT benchmark.</p>

<h3 id="direct-function-calls">Direct function calls</h3>

<p>The first two types of dynamic function calls are simple and easy to
use. <em>Direct</em> calls to dynamic functions is trickier business since it
requires modifying code at run time. In my benchmark I put together a
<a href="/blog/2015/03/19/">little JIT compiler</a> to generate the direct call.</p>

<p>There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, <code class="language-plaintext highlighter-rouge">empty()</code>. If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.</p>

<p>To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this <a href="/blog/2016/09/03/">requires parsing virtual files under <code class="language-plaintext highlighter-rouge">/proc</code></a>.</p>

<p>Here’s what my JIT’s memory allocation looks like. It assumes
<a href="/blog/2016/05/30/">reasonable behavior for <code class="language-plaintext highlighter-rouge">uintptr_t</code> casts</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="n">jit_func</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">empty</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">desired</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="n">addr</span> <span class="o">-</span> <span class="n">SAFETY_MARGIN</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">PAGEMASK</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">desired</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It allocates two pages, one writable and the other containing
non-writable code. Similar to <a href="/blog/2017/01/08/">my closure library</a>, the lower
page is writable and holds the <code class="language-plaintext highlighter-rouge">running</code> variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">jit_benchmark:</span>
        <span class="nf">push</span>  <span class="nb">rbx</span>
        <span class="nf">xor</span>   <span class="nb">ebx</span><span class="p">,</span> <span class="nb">ebx</span>
<span class="nl">.loop:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">running</span><span class="p">]</span>
        <span class="nf">test</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">je</span>    <span class="nv">.done</span>
        <span class="nf">call</span>  <span class="nv">empty</span>
        <span class="nf">inc</span>   <span class="nb">ebx</span>
        <span class="nf">jmp</span>   <span class="nv">.loop</span>
<span class="nl">.done:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebx</span>
        <span class="nf">pop</span>   <span class="nb">rbx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">call empty</code> is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the <em>end</em> of the instruction):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// call empty</span>
    <span class="kt">uintptr_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="mh">0xe8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">empty()</code> wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for <code class="language-plaintext highlighter-rouge">plt_benchmark()</code>, assuming somehow it didn’t inline
<code class="language-plaintext highlighter-rouge">empty()</code>.</p>

<p>Ironically, calling the JIT-compiled code requires an indirect call
(e.g. via a function pointer), and there’s no way around this. What
are you going to do, JIT compile another function that makes the
direct call? Fortunately this doesn’t matter since the part being
measured in the loop is only a direct call.</p>

<h3 id="its-no-mystery">It’s no mystery</h3>

<p>Given these results, it’s really no mystery that LuaJIT can generate
more efficient dynamic function calls than a PLT, <em>even if they still
end up being indirect calls</em>. In my benchmark, the non-PLT indirect
calls were 28% faster than the PLT, and the direct calls 43% faster
than the PLT. That’s a small edge that JIT-enabled programs have over
plain old native programs, though it comes at the cost of absolutely
no code sharing between processes.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>When the Compiler Bites</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/01/"/>
    <id>urn:uuid:02b974e1-e25b-397d-a16f-c754338e9c1e</id>
    <updated>2018-05-01T23:28:06Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="ai"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Update: There are discussions <a href="https://old.reddit.com/r/cpp/comments/8gfhq3/when_the_compiler_bites/">on Reddit</a> and <a href="https://news.ycombinator.com/item?id=16974770">on Hacker
News</a>.</em></p>

<p>So far this year I’ve been bitten three times by compiler edge cases
in GCC and Clang, each time catching me totally by surprise. Two were
caused by historical artifacts, where an ambiguous specification lead
to diverging implementations. The third was a compiler optimization
being far more clever than I expected, behaving almost like an
artificial intelligence.</p>

<p>In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.</p>

<h3 id="x86-64-abi-ambiguity">x86-64 ABI ambiguity</h3>

<p>The first time I was bit — or, well, narrowly avoided being bit — was
when I examined a missed floating point optimization in both Clang and
GCC. Consider this function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function multiplies its argument by zero and returns the result. Any
number multiplied by zero is zero, so this should always return zero,
right? Unfortunately, no. IEEE 754 floating point arithmetic supports
NaN, infinities, and signed zeros. This function can return NaN,
positive zero, or negative zero. (In some cases, the operation could
also potentially produce a hardware exception.)</p>

<p>As a result, both GCC and Clang perform the multiply:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorpd</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">mulsd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-ffast-math</code> option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
<a href="https://possiblywrong.wordpress.com/2017/09/12/floating-point-agreement-between-matlab-and-c/">consistency</a>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Side note: <code class="language-plaintext highlighter-rouge">-ffast-math</code> doesn’t necessarily mean “less precise.”
Sometimes it will actually <a href="https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add">improve precision</a>.</p>

<p>Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a <code class="language-plaintext highlighter-rouge">short</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply_short</span><span class="p">(</span><span class="kt">short</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s no longer possible for the argument to be one of those special
values. The <code class="language-plaintext highlighter-rouge">short</code> will be promoted to one of 65,535 possible <code class="language-plaintext highlighter-rouge">double</code>
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (<code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">movsx</span>     <span class="nb">edi</span><span class="p">,</span> <span class="nb">di</span>       <span class="c1">; sign-extend 16-bit argument</span>
    <span class="nf">xorps</span>     <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>    <span class="c1">; xmm1 = 0.0</span>
    <span class="nf">cvtsi2sd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nb">edi</span>     <span class="c1">; convert int to double</span>
    <span class="nf">mulsd</span>     <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Clang also misses this optimization:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">cvtsi2sd</span> <span class="nv">xmm1</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">xorpd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">mulsd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (<code class="language-plaintext highlighter-rouge">movsx</code>)? Clang is treating that
<code class="language-plaintext highlighter-rouge">short</code> argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?</p>

<p>It turns out that the <a href="https://www.uclibc.org/docs/psABI-x86_64.pdf">x86-64 ABI</a> didn’t specify what happens with
the upper bits in argument registers. Are they garbage? Are they zeroed?
GCC takes the conservative position of assuming the upper bits are
arbitrary garbage. Clang takes the boldest position of assuming
arguments smaller than 32 bits have been promoted to 32 bits by the
caller. This is what the ABI specification <em>should</em> have said, but
currently it does not.</p>

<p>Fortunately GCC also conservative when passing arguments. It promotes
arguments to 32 bits as necessary, so there are no conflicts when
linking against Clang-compiled code. However, this is not true for
Intel’s ICC compiler: <a href="https://web.archive.org/web/20180908113552/https://stackoverflow.com/a/36760539"><strong>Clang and ICC are not ABI-compatible on
x86-64</strong></a>.</p>

<p>I don’t use ICC, so that particular issue wouldn’t bite me, <em>but</em> if I
was ever writing assembly routines that called Clang-compiled code, I’d
eventually get bit by this.</p>

<h3 id="floating-point-precision">Floating point precision</h3>

<p>Without looking it up or trying it, what does this function return?
Think carefully.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">float_compare</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Confident in your answer? This is a trick question, because it can
return either 0 or 1 depending on the compiler. Boy was I confused when
this comparison returned 0 in my real world code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1
</code></pre></div></div>

<p>So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations <a href="https://news.ycombinator.com/item?id=16974770">all did it differently</a>. The C99 specification
cleaned this all up and introduced <a href="https://en.wikipedia.org/wiki/C99#IEEE_754_floating_point_support"><code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD</code></a>.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.</p>

<p>Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in <code class="language-plaintext highlighter-rouge">long double</code>
precision and truncated afterward (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 2</code>).</p>

<p>In <code class="language-plaintext highlighter-rouge">float_compare()</code> the left-hand side is truncated to a <code class="language-plaintext highlighter-rouge">float</code> by the
assignment, but the right-hand side, <em>despite being a <code class="language-plaintext highlighter-rouge">float</code> literal</em>,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!</p>

<p>The remnants of this high precision trend are still in JavaScript, where
all arithmetic is double precision (even if <a href="http://thibaultlaurens.github.io/javascript/2013/04/29/how-the-v8-engine-works/#more-example-on-how-v8-optimized-javascript-code">simulated using
integers</a>), and great pains have been made <a href="https://blog.mozilla.org/javascript/2013/11/07/efficient-float32-arithmetic-in-javascript/">to work around</a>
the performance consequences of this. <a href="http://tirania.org/blog/archive/2018/Apr-11.html">Until recently</a>, Mono had
similar issues.</p>

<p>The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 0</code>). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323">backwards compatible</a> GCC on the old x86.</p>

<p>I’m a little ashamed that I’m only finding out about this now. However,
by the time I was competent enough to notice and understand this issue,
I was already doing nearly all my programming on the x86-64.</p>

<h3 id="built-in-function-elimination">Built-in Function Elimination</h3>

<p>I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, <code class="language-plaintext highlighter-rouge">new_image()</code>, that allocates a greyscale image
for, say, <a href="/blog/2017/11/03/">some multimedia library</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">shade</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a static function because this would be part of some <a href="https://github.com/nothings/stb">slick
header library</a> (and, secretly, because it’s necessary for
illustrating the issue). Being a responsible citizen, the function
even <a href="/blog/2017/07/19/">checks for integer overflow</a> before allocating anything.</p>

<p>I write a unit test to make sure it detects overflow. This function
should return 0.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_overflow</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far my test passes. Good.</p>

<p>I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make <code class="language-plaintext highlighter-rouge">malloc()</code> fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a <code class="language-plaintext highlighter-rouge">malloc(SIZE_MAX)</code>, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_oom</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I compile with GCC, test passes. I compile with Clang and the test
fails. That is, <strong>the test somehow managed to allocate 16 exibytes of
memory, <em>and</em> initialize it</strong>. Wat?</p>

<p>Disassembling the test reveals what’s going on:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">test_new_image_overflow:</span>
    <span class="nf">xor</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
    <span class="nf">ret</span>

<span class="nl">test_new_image_oom:</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with <code class="language-plaintext highlighter-rouge">malloc()</code> became dead code and
was trivially eliminated.</p>

<p>In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the <code class="language-plaintext highlighter-rouge">memset()</code>, so it eliminated the
allocation altogether and then <em>simulated</em> a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.</p>

<p>I soon realized I can take this further and trick Clang into
performing an invalid optimization, <a href="https://bugs.llvm.org/show_bug.cgi?id=37304">revealing a bug</a>. Consider
this slightly-optimized version that uses <code class="language-plaintext highlighter-rouge">calloc()</code> when the shade is
zero (black). The <code class="language-plaintext highlighter-rouge">calloc()</code> function does its own overflow check, so
<code class="language-plaintext highlighter-rouge">new_image()</code> doesn’t need to do it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">shade</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// shortcut</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">color</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With this change, my overflow unit test is now also failing. The
situation is even worse than before. The <code class="language-plaintext highlighter-rouge">calloc()</code> is being
eliminated <em>despite the overflow</em>, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, <strong>this could introduce a vulnerability in a
real program</strong>. The OpenBSD folks are so worried about this sort of
thing that <a href="https://marc.info/?l=openbsd-cvs&amp;m=150125592126437&amp;w=2">they’ve disabled this optimization</a>.</p>

<p>Here’s a slightly-contrived example of this. Imagine a program that
maintains a table of unsigned integers, and we want to keep track of
how many times the program has accessed each table entry. The “access
counter” table is initialized to zero, but the table of values need
not be initialized, since they’ll be written before first access (or
something like that).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">table</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">counter</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">table_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">table</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Overflow already tested above */</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">free</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">);</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// success</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function relies on the overflow test in <code class="language-plaintext highlighter-rouge">calloc()</code> for the second
<code class="language-plaintext highlighter-rouge">malloc()</code> allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the <code class="language-plaintext highlighter-rouge">counter</code> table, and Clang is able to
statically determine this fact, it may eliminate the <code class="language-plaintext highlighter-rouge">calloc()</code>. This
would also <strong>eliminate the overflow test, introducing a
vulnerability</strong>. If an attacker can control <code class="language-plaintext highlighter-rouge">n</code>, then they can
overwrite arbitrary memory through that <code class="language-plaintext highlighter-rouge">values</code> pointer.</p>

<h3 id="the-takeaway">The takeaway</h3>

<p>Besides this surprising little bug, the main lesson for me is that I
should probably isolate unit tests from the code being tested. The
easiest solution is to put them in separate translation units and don’t
use link-time optimization (LTO). Allowing tested functions to be
inlined into the unit tests is probably a bad idea.</p>

<p>The unit test issues in my <em>real</em> program, which was <a href="https://github.com/skeeto/growable-buf">a bit more
sophisticated</a> than what was presented here, gave me artificial
intelligence vibes. It’s that situation where a computer algorithm did
something really clever and I felt it outsmarted me. It’s creepy to
consider <a href="https://wiki.lesswrong.com/wiki/Paperclip_maximizer">how far that can go</a>. I’ve gotten that even from
observing <a href="/blog/2017/04/27/">AI I’ve written myself</a>, and I know for sure no human
taught it some particularly clever trick.</p>

<p>My favorite AI story along these lines is about <a href="https://www.youtube.com/watch?v=xOCurBYI_gY">an AI that learned
how to play games on the Nintendo Entertainment System</a>. It
didn’t understand the games it was playing. It’s optimization task was
simply to choose controller inputs that maximized memory values,
because that’s generally associated with doing well — higher scores,
more progress, etc. The most unexpected part came when playing Tetris.
Eventually the screen would fill up with blocks, and the AI would face
the inevitable situation of losing the game, with all that memory
being reinitialized to low values. So what did it do?</p>

<p>Just before the end it would pause the game and wait… forever.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Branchless UTF-8 Decoder</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/10/06/"/>
    <id>urn:uuid:d62a6a1f-0e34-325e-9196-d66a354bc9b1</id>
    <updated>2017-10-06T23:29:02Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>This week I took a crack at writing a branchless <a href="https://tools.ietf.org/html/rfc3629">UTF-8</a> decoder:
a function that decodes a single UTF-8 code point from a byte stream
without any <code class="language-plaintext highlighter-rouge">if</code> statements, loops, short-circuit operators, or other
sorts of conditional jumps. You can find the source code here along
with a test suite and benchmark:</p>

<ul>
  <li><a href="https://github.com/skeeto/branchless-utf8">https://github.com/skeeto/branchless-utf8</a></li>
</ul>

<p>In addition to decoding the next code point, it detects any errors and
returns a pointer to the next code point. It’s the complete package.</p>

<p>Why branchless? Because high performance CPUs are pipelined. That is,
a single instruction is executed over a series of stages, and many
instructions are executed in overlapping time intervals, each at a
different stage.</p>

<p>The usual analogy is laundry. You can have more than one load of
laundry in process at a time because laundry is typically a pipelined
process. There’s a washing machine stage, dryer stage, and folding
stage. One load can be in the washer, a second in the drier, and a
third being folded, all at once. This greatly increases throughput
because, under ideal circumstances with a full pipeline, an
instruction is completed each clock cycle despite any individual
instruction taking many clock cycles to complete.</p>

<p>Branches are the enemy of pipelines. The CPU can’t begin work on the
next instruction if it doesn’t know which instruction will be executed
next. It must finish computing the branch condition before it can
know. To deal with this, pipelined CPUs are also equipped with <em>branch
predictors</em>. It makes a guess at which branch will be taken and begins
executing instructions on that branch. The prediction is initially
made using static heuristics, and later those predictions are improved
<a href="http://www.agner.org/optimize/microarchitecture.pdf">by learning from previous behavior</a>. This even includes
predicting the number of iterations of a loop so that the final
iteration isn’t mispredicted.</p>

<p>A mispredicted branch has two dire consequences. First, all the
progress on the incorrect branch will need to be discarded. Second,
the pipeline will be flushed, and the CPU will be inefficient until
the pipeline fills back up with instructions on the correct branch.
With a sufficiently deep pipeline, it can easily be <strong>more efficient
to compute and discard an unneeded result than to avoid computing it
in the first place</strong>. Eliminating branches means eliminating the
hazards of misprediction.</p>

<p>Another hazard for pipelines is <em>dependencies</em>. If an instruction
depends on the result of a previous instruction, it may have to wait for
the previous instruction to make sufficient progress before it can
complete one of its stages. This is known as a <em>pipeline stall</em>, and it
is an important consideration in instruction set architecture (ISA)
design.</p>

<p>For example, on the x86-64 architecture, storing a 32-bit result in a
64-bit register will automatically clear the upper 32 bits of that
register. Any further use of that destination register cannot depend on
prior instructions since all bits have been set. This particular
optimization was missed in the design of the i386: Writing a 16-bit
result to 32-bit register leaves the upper 16 bits intact, creating
false dependencies.</p>

<p>Dependency hazards are mitigated using <em>out-of-order execution</em>.
Rather than execute two dependent instructions back to back, which
would result in a stall, the CPU may instead executing an independent
instruction further away in between. A good compiler will also try to
spread out dependent instructions in its own instruction scheduling.</p>

<p>The effects of out-of-order execution are typically not visible to a
single thread, where everything will appear to have executed in order.
However, when multiple processes or threads can access the same memory
<a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">out-of-order execution can be observed</a>. It’s one of the
many <a href="/blog/2014/09/02/">challenges of writing multi-threaded software</a>.</p>

<p>The focus of my UTF-8 decoder was to be branchless, but there was one
interesting dependency hazard that neither GCC nor Clang were able to
resolve themselves. More on that later.</p>

<h3 id="what-is-utf-8">What is UTF-8?</h3>

<p>Without getting into the history of it, you can generally think of
<a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8</a> as a method for encoding a series of 21-bit integers
(<em>code points</em>) into a stream of bytes.</p>

<ul>
  <li>
    <p>Shorter integers encode to fewer bytes than larger integers. The
shortest available encoding must be chosen, meaning there is one
canonical encoding for a given sequence of code points.</p>
  </li>
  <li>
    <p>Certain code points are off limits: <em>surrogate halves</em>. These are
code points <code class="language-plaintext highlighter-rouge">U+D800</code> through <code class="language-plaintext highlighter-rouge">U+DFFF</code>. Surrogates are used in UTF-16
to represent code points above U+FFFF and serve no purpose in UTF-8.
This has <a href="https://simonsapin.github.io/wtf-8/">interesting consequences</a> for pseudo-Unicode
strings, such “wide” strings in the Win32 API, where surrogates may
appear unpaired. Such sequences cannot legally be represented in
UTF-8.</p>
  </li>
</ul>

<p>Keeping in mind these two rules, the entire format is summarized by
this table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>length byte[0]  byte[1]  byte[2]  byte[3]
1      0xxxxxxx
2      110xxxxx 10xxxxxx
3      1110xxxx 10xxxxxx 10xxxxxx
4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">x</code> placeholders are the bits of the encoded code point.</p>

<p>UTF-8 has some really useful properties:</p>

<ul>
  <li>
    <p>It’s backwards compatible with ASCII, which never used the highest
bit.</p>
  </li>
  <li>
    <p>Sort order is preserved. Sorting a set of code point sequences has the
same result as sorting their UTF-8 encoding.</p>
  </li>
  <li>
    <p>No additional zero bytes are introduced. In C we can continue using
null terminated <code class="language-plaintext highlighter-rouge">char</code> buffers, often without even realizing they
hold UTF-8 data.</p>
  </li>
  <li>
    <p>It’s self-synchronizing. A leading byte will never be mistaken for a
continuation byte. This allows for byte-wise substring searches,
meaning UTF-8 unaware functions like <code class="language-plaintext highlighter-rouge">strstr(3)</code> continue to work
without modification (except for normalization issues). It also
allows for unambiguous recovery of a damaged stream.</p>
  </li>
</ul>

<p>A straightforward approach to decoding might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">utf8_simple</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mh">0x80</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x1f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">3</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf8</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xf0</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mh">0xf4</span><span class="p">))</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">4</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// invalid</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// skip this byte</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;=</span> <span class="mh">0xd800</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">c</span> <span class="o">&lt;=</span> <span class="mh">0xdfff</span><span class="p">)</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// surrogate half</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It branches off on the highest bits of the leading byte, extracts all of
those <code class="language-plaintext highlighter-rouge">x</code> bits from each byte, concatenates those bits, checks if it’s a
surrogate half, and returns a pointer to the next character. (This
implementation does <em>not</em> check that the highest two bits of each
continuation byte are correct.)</p>

<p>The CPU must correctly predict the length of the code point or else it
will suffer a hazard. An incorrect guess will stall the pipeline and
slow down decoding.</p>

<p>In real world text this is probably not a serious issue. For the
English language, the encoded length is nearly always a single byte.
However, even for non-English languages, text is <a href="http://utf8everywhere.org/">usually accompanied
by markup from the ASCII range of characters</a>, and, overall,
the encoded lengths will still have consistency. As I said, the CPU
predicts branches based on the program’s previous behavior, so this
means it will temporarily learn some of the statistical properties of
the language being actively decoded. Pretty cool, eh?</p>

<p>Eliminating branches from the decoder side-steps any issues with
mispredicting encoded lengths. Only errors in the stream will cause
stalls. Since that’s probably the unusual case, the branch predictor
will be very successful by continually predicting success. That’s one
optimistic CPU.</p>

<h3 id="the-branchless-decoder">The branchless decoder</h3>

<p>Here’s the interface to my branchless decoder:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">utf8_decode</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">e</span><span class="p">);</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">void *</code> for the buffer so that it doesn’t care what type was
actually chosen to represent the buffer. It could be a <code class="language-plaintext highlighter-rouge">uint8_t</code>,
<code class="language-plaintext highlighter-rouge">char</code>, <code class="language-plaintext highlighter-rouge">unsigned char</code>, etc. Doesn’t matter. The encoder accesses it
only as bytes.</p>

<p>On the other hand, with this interface you’re forced to use <code class="language-plaintext highlighter-rouge">uint32_t</code>
to represent code points. You could always change the function to suit
your own needs, though.</p>

<p>Errors are returned in <code class="language-plaintext highlighter-rouge">e</code>. It’s zero for success and non-zero when an
error was detected, without any particular meaning for different values.
Error conditions are mixed into this integer, so a zero simply means the
absence of error.</p>

<p>This is where you could accuse me of “cheating” a little bit. The
caller probably wants to check for errors, and so <em>they</em> will have to
branch on <code class="language-plaintext highlighter-rouge">e</code>. It seems I’ve just smuggled the branches outside of the
decoder.</p>

<p>However, as I pointed out, unless you’re expecting lots of errors, the
real cost is branching on encoded lengths. Furthermore, the caller
could instead accumulate the errors: count them, or make the error
“sticky” by ORing all <code class="language-plaintext highlighter-rouge">e</code> values together. Neither of these require a
branch. The caller could decode a huge stream and only check for
errors at the very end. The only branch would be the main loop (“are
we done yet?”), which is trivial to predict with high accuracy.</p>

<p>The first thing the function does is extract the encoded length of the
next code point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">lengths</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
        <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span>
    <span class="p">};</span>

    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">lengths</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">3</span><span class="p">];</span>
</code></pre></div></div>

<p>Looking back to the UTF-8 table above, only the highest 5 bits determine
the length. That’s 32 possible values. The zeros are for invalid
prefixes. This will later cause a bit to be set in <code class="language-plaintext highlighter-rouge">e</code>.</p>

<p>With the length in hand, it can compute the position of the next code
point in the buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="n">len</span> <span class="o">+</span> <span class="o">!</span><span class="n">len</span><span class="p">;</span>
</code></pre></div></div>

<p>Originally this expression was the return value, computed at the very
end of the function. However, after inspecting the compiler’s assembly
output, I decided to move it up, and the result was a solid performance
boost. That’s because it spreads out dependent instructions. With the
address of the next code point known so early, <a href="https://www.youtube.com/watch?v=2EWejmkKlxs">the instructions that
decode the next code point can get started early</a>.</p>

<p>The reason for the <code class="language-plaintext highlighter-rouge">!len</code> is so that the pointer is advanced one byte
even in the face of an error (length of zero). Adding that <code class="language-plaintext highlighter-rouge">!len</code> is
actually somewhat costly, though I couldn’t figure out why.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shiftc</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">c</span>  <span class="o">=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">masks</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;=</span> <span class="n">shiftc</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>This reads four bytes regardless of the actual length. Avoiding doing
something is branching, so this can’t be helped. The unneeded bits are
shifted out based on the length. That’s all it takes to decode UTF-8
without branching.</p>

<p>One important consequence of always reading four bytes is that <strong>the
caller <em>must</em> zero-pad the buffer to at least four bytes</strong>. In practice,
this means padding the entire buffer with three bytes in case the last
character is a single byte.</p>

<p>The padding must be zero in order to detect errors. Otherwise the
padding might look like legal continuation bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">mins</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">4194304</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">2048</span><span class="p">,</span> <span class="mi">65536</span><span class="p">};</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shifte</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">e</span>  <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&lt;</span> <span class="n">mins</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">((</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x1b</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">7</span><span class="p">;</span>  <span class="c1">// surrogate half?</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>       <span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">^=</span> <span class="mh">0x2a</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">&gt;&gt;=</span> <span class="n">shifte</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>The first line checks if the shortest encoding was used, setting a bit
in <code class="language-plaintext highlighter-rouge">e</code> if it wasn’t. For a length of 0, this always fails.</p>

<p>The second line checks for a surrogate half by checking for a certain
prefix.</p>

<p>The next three lines accumulate the highest two bits of each
continuation byte into <code class="language-plaintext highlighter-rouge">e</code>. Each should be the bits <code class="language-plaintext highlighter-rouge">10</code>. These bits are
“compared” to <code class="language-plaintext highlighter-rouge">101010</code> (<code class="language-plaintext highlighter-rouge">0x2a</code>) using XOR. The XOR clears these bits as
long as they exactly match.</p>

<p><img src="/img/diagram/utf8-bits.svg" alt="" /></p>

<p>Finally the continuation prefix bits that don’t matter are shifted out.</p>

<h3 id="the-goal">The goal</h3>

<p>My primary — and totally arbitrary — goal was to beat the performance of
<a href="http://bjoern.hoehrmann.de/utf-8/decoder/dfa/">Björn Höhrmann’s DFA-based decoder</a>. Under favorable (and
artificial) benchmark conditions I had moderate success. You can try it
out on your own system by cloning the repository and running <code class="language-plaintext highlighter-rouge">make
bench</code>.</p>

<p>With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the
DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.</p>

<p><em>Update</em>: <a href="https://github.com/skeeto/branchless-utf8/issues/1">Björn pointed out</a> that his site includes a faster
variant of his DFA decoder. It is only 10% slower than the branchless
decoder with GCC, and it’s 20% faster than the branchless decoder with
Clang. So, in a sense, it’s still faster on average, even on a
benchmark that favors a branchless decoder.</p>

<p>The benchmark operates very similarly to <a href="/blog/2017/09/21/">my PRNG shootout</a> (e.g.
<code class="language-plaintext highlighter-rouge">alarm(2)</code>). First a buffer is filled with random UTF-8 data, then the
decoder decodes it again and again until the alarm fires. The
measurement is the number of bytes decoded.</p>

<p>The number of errors is printed at the end (always 0) in order to force
errors to actually get checked for each code point. Otherwise the sneaky
compiler omits the error checking from the branchless decoder, making it
appear much faster than it really is — a serious letdown once I noticed
my error. Since the other decoder is a DFA and error checking is built
into its graph, the compiler can’t really omit its error checking.</p>

<p>I called this “favorable” because the buffer being decoded isn’t
anything natural. Each time a code point is generated, first a length is
chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that
length is generated. The <strong>even distribution of lengths greatly favors a
branchless decoder</strong>. The random distribution inhibits branch
prediction. Real text has a far more favorable distribution.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">randchar</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">rand32</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">r</span> <span class="o">&amp;</span> <span class="mh">0x3</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">&gt;&gt;=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">r</span> <span class="o">%</span> <span class="mi">128</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">128</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">2048</span> <span class="o">-</span> <span class="mi">128</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">2048</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">65536</span> <span class="o">-</span> <span class="mi">2048</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">4</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">65536</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">131072</span> <span class="o">-</span> <span class="mi">65536</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">abort</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Given the odd input zero-padding requirement and the artificial
parameters of the benchmark, despite the supposed 20% speed boost
under GCC, my branchless decoder is not really any better than the DFA
decoder in practice. It’s just a different approach. In practice I’d
prefer Björn’s DFA decoder.</p>

<p><em>Update</em>: Bryan Donlan has followed up with <a href="https://github.com/bdonlan/branchless-utf8/commit/3802d3b0e10ea16810dd40f8116243971ff7603d">a SIMD UTF-8 decoder</a>.</p>

<p><em>Update 2024</em>: NRK has followed up with <a href="https://nrk.neocities.org/articles/utf8-pext.html">parallel extract decoder</a>.</p>

<p><em>Update 2025</em>: Charles Eckman followed up <a href="https://cceckman.com/writing/branchless-utf8-encoding/">sharing a branchless
encoder</a>, which inspired me to <a href="https://github.com/skeeto/scratch/blob/master/misc/utf8_branchless.c">give it a shot</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Finding the Best 64-bit Simulation PRNG</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/21/"/>
    <id>urn:uuid:637af55f-6e33-31e5-25fa-edb590a16d44</id>
    <updated>2017-09-21T21:25:00Z</updated>
    <category term="c"/><category term="compsci"/><category term="x86"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>August 2018 Update</strong>: <em>xoroshiro128+ fails <a href="http://pracrand.sourceforge.net/">PractRand</a> very
badly. Since this article was published, its authors have supplanted it
with <strong>xoshiro256**</strong>. It has essentially the same performance, but
better statistical properties. xoshiro256** is now my preferred PRNG.</em></p>

<p>I use pseudo-random number generators (PRNGs) a whole lot. They’re an
essential component in lots of algorithms and processes.</p>

<ul>
  <li>
    <p><strong>Monte Carlo simulations</strong>, where PRNGs are used to <a href="https://possiblywrong.wordpress.com/2015/09/15/kanoodle-iq-fit-and-dancing-links/">compute
numeric estimates</a> for problems that are difficult or impossible
to solve analytically.</p>
  </li>
  <li>
    <p><a href="/blog/2017/04/27/"><strong>Monte Carlo tree search AI</strong></a>, where massive numbers of games
are played out randomly in search of an optimal move. This is a
specific application of the last item.</p>
  </li>
  <li>
    <p><a href="https://github.com/skeeto/carpet-fractal-genetics"><strong>Genetic algorithms</strong></a>, where a PRNG creates the initial
population, and then later guides in mutation and breeding of selected
solutions.</p>
  </li>
  <li>
    <p><a href="https://blog.cr.yp.to/20140205-entropy.html"><strong>Cryptography</strong></a>, where a cryptographically-secure PRNGs
(CSPRNGs) produce output that is predictable for recipients who know
a particular secret, but not for anyone else. This article is only
concerned with plain PRNGs.</p>
  </li>
</ul>

<p>For the first three “simulation” uses, there are two primary factors
that drive the selection of a PRNG. These factors can be at odds with
each other:</p>

<ol>
  <li>
    <p>The PRNG should be <em>very</em> fast. The application should spend its
time running the actual algorithms, not generating random numbers.</p>
  </li>
  <li>
    <p>PRNG output should have robust statistical qualities. Bits should
appear to be independent and the output should closely follow the
desired distribution. Poor quality output will negatively effect
the algorithms using it. Also just as important is <a href="http://mumble.net/~campbell/2014/04/28/uniform-random-float">how you use
it</a>, but this article will focus only on generating bits.</p>
  </li>
</ol>

<p>In other situations, such as in cryptography or online gambling,
another important property is that an observer can’t learn anything
meaningful about the PRNG’s internal state from its output. For the
three simulation cases I care about, this is not a concern. Only speed
and quality properties matter.</p>

<p>Depending on the programming language, the PRNGs found in various
standard libraries may be of dubious quality. They’re slower than they
need to be, or have poorer quality than required. In some cases, such
as <code class="language-plaintext highlighter-rouge">rand()</code> in C, the algorithm isn’t specified, and you can’t rely on
it for anything outside of trivial examples. In other cases the
algorithm and behavior <em>is</em> specified, but you could easily do better
yourself.</p>

<p>My preference is to BYOPRNG: <em>Bring Your Own Pseudo-random Number
Generator</em>. You get reliable, identical output everywhere. Also, in
the case of C and C++ — and if you do it right — by embedding the PRNG
in your project, it will get inlined and unrolled, making it far more
efficient than a <a href="/blog/2016/10/27/">slow call into a dynamic library</a>.</p>

<p>A fast PRNG is going to be small, making it a great candidate for
embedding as, say, a header library. That leaves just one important
question, “Can the PRNG be small <em>and</em> have high quality output?” In
the 21st century, the answer to this question is an emphatic “yes!”</p>

<p>For the past few years my main go to for a drop-in PRNG has been
<a href="https://en.wikipedia.org/wiki/Xorshift">xorshift*</a>. The body of the function is 6 lines of C, and its
entire state is a 64-bit integer, directly seeded. However, there are a
number of choices here, including other variants of Xorshift. How do I
know which one is best? The only way to know is to test it, hence my
64-bit PRNG shootout:</p>

<ul>
  <li><a href="https://github.com/skeeto/prng64-shootout"><strong>64-bit PRNG Shootout</strong></a></li>
</ul>

<p>Sure, there <a href="http://xoroshiro.di.unimi.it/">are other such shootouts</a>, but they’re all missing
something I want to measure. I also want to test in an environment very
close to how I’d use these PRNGs myself.</p>

<h3 id="shootout-results">Shootout results</h3>

<p>Before getting into the details of the benchmark and each generator,
here are the results. These tests were run on an i7-6700 (Skylake)
running Linux 4.9.0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060
</code></pre></div></div>

<p><strong>The clear winner is <a href="http://xoroshiro.di.unimi.it/">xoroshiro128+</a></strong>, with a function body of
just 7 lines of C. It’s clearly the fastest, and the output had no
observed statistical failures. However, that’s not the whole story. A
couple of the other PRNGS have advantages that situationally makes
them better suited than xoroshiro128+. I’ll go over these in the
discussion below.</p>

<p>These two versions of GCC and Clang were chosen because these are the
latest available in Debian 9 “Stretch.” It’s easy to build and run the
benchmark yourself if you want to try a different version.</p>

<h3 id="speed-benchmark">Speed benchmark</h3>

<p>In the speed benchmark, the PRNG is initialized, a 1-second <code class="language-plaintext highlighter-rouge">alarm(1)</code>
is set, then the PRNG fills a large <code class="language-plaintext highlighter-rouge">volatile</code> buffer of 64-bit unsigned
integers again and again as quickly as possible until the alarm fires.
The amount of memory written is measured as the PRNG’s speed.</p>

<p>The baseline “PRNG” writes zeros into the buffer. This represents the
absolute speed limit that no PRNG can exceed.</p>

<p>The purpose for making the buffer <code class="language-plaintext highlighter-rouge">volatile</code> is to force the entire
output to actually be “consumed” as far as the compiler is concerned.
Otherwise the compiler plays nasty tricks to make the program do as
little work as possible. Another way to deal with this would be to
<code class="language-plaintext highlighter-rouge">write(2)</code> buffer, but of course I didn’t want to introduce
unnecessary I/O into a benchmark.</p>

<p>On Linux, SIGALRM was impressively consistent between runs, meaning it
was perfectly suitable for this benchmark. To account for any process
scheduling wonkiness, the bench mark was run 8 times and only the
fastest time was kept.</p>

<p>The SIGALRM handler sets a <code class="language-plaintext highlighter-rouge">volatile</code> global variable that tells the
generator to stop. The PRNG call was unrolled 8 times to avoid the
alarm check from significantly impacting the benchmark. You can see
the effect for yourself by changing <code class="language-plaintext highlighter-rouge">UNROLL</code> to 1 (i.e. “don’t
unroll”) in the code. Unrolling beyond 8 times had no measurable
effect to my tests.</p>

<p>Due to the PRNGs being inlined, this unrolling makes the benchmark
less realistic, and it shows in the results. Using <code class="language-plaintext highlighter-rouge">volatile</code> for the
buffer helped to counter this effect and reground the results. This is
a fuzzy problem, and there’s not really any way to avoid it, but I
will also discuss this below.</p>

<h3 id="statistical-benchmark">Statistical benchmark</h3>

<p>To measure the statistical quality of each PRNG — mostly as a sanity
check — the raw binary output was run through <a href="http://webhome.phy.duke.edu/~rgb/General/dieharder.php">dieharder</a> 3.31.1:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prng | dieharder -g200 -a -m4
</code></pre></div></div>

<p>This statistical analysis has no timing characteristics and the
results should be the same everywhere. You would only need to re-run
it to test with a different version of dieharder, or a different
analysis tool.</p>

<p>There’s not much information to glean from this part of the shootout.
It mostly confirms that all of these PRNGs would work fine for
simulation purposes. The WEAK results are not very significant and is
only useful for breaking ties. Even a true RNG will get some WEAK
results. For example, the <a href="https://en.wikipedia.org/wiki/RdRand">x86 RDRAND</a> instruction (not
included in actual shootout) got 7 WEAK results in my tests.</p>

<p>The FAIL results are more significant, but a single failure doesn’t
mean much. A non-failing PRNG should be preferred to an otherwise
equal PRNG with a failure.</p>

<h3 id="individual-prngs">Individual PRNGs</h3>

<p>Admittedly the definition for “64-bit PRNG” is rather vague. My high
performance targets are all 64-bit platforms, so the highest PRNG
throughput will be built on 64-bit operations (<a href="/blog/2015/07/10/">if not wider</a>).
The original plan was to focus on PRNGs built from 64-bit operations.</p>

<p>Curiosity got the best of me, so I included some PRNGs that don’t use
<em>any</em> 64-bit operations. I just wanted to see how they stacked up.</p>

<h4 id="blowfish">Blowfish</h4>

<p>One of the reasons I <a href="/blog/2017/09/15/">wrote a Blowfish implementation</a> was to
evaluate its performance and statistical qualities, so naturally I
included it in the benchmark. It only uses 32-bit addition and 32-bit
XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit
integer. There are two different properties that combine to make four
variants in the benchmark: number of rounds and block mode.</p>

<p>Blowfish normally uses 16 rounds. This makes it a lot slower than a
non-cryptographic PRNG but gives it a <em>security margin</em>. I don’t care
about the security margin, so I included a 4-round variant. At
expected, it’s about four times faster.</p>

<p>The other feature I tested is the block mode: <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CBC">Cipher Block
Chaining</a> (CBC) versus <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29">Counter</a> (CTR) mode. In CBC mode it
encrypts zeros as plaintext. This just means it’s encrypting its last
output. The ciphertext is the PRNG’s output.</p>

<p>In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster
than CBC in the 16-round variant and 23% faster in the 4-round variant.
The reason is simple, and it’s in part an artifact of unrolling the
generation loop in the benchmark.</p>

<p>In CBC mode, each output depends on the previous, but in CTR mode all
blocks are independent. Work can begin on the next output before the
previous output is complete. The x86 architecture uses out-of-order
execution to achieve many of its performance gains: Instructions may
be executed in a different order than they appear in the program,
though their observable effects must <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">generally be ordered
correctly</a>. Breaking dependencies between instructions allows
out-of-order execution to be fully exercised. It also gives the
compiler more freedom in instruction scheduling, though the <code class="language-plaintext highlighter-rouge">volatile</code>
accesses cannot be reordered with respect to each other (hence it
helping to reground the benchmark).</p>

<p>Statistically, the 4-round cipher was not significantly worse than the
16-round cipher. For simulation purposes the 4-round cipher would be
perfectly sufficient, though xoroshiro128+ is still more than 9 times
faster without sacrificing quality.</p>

<p>On the other hand, CTR mode had a single failure in both the 4-round
(dab_filltree2) and 16-round (dab_filltree) variants. At least for
Blowfish, is there something that makes CTR mode less suitable than CBC
mode as a PRNG?</p>

<p>In the end Blowfish is too slow and too complicated to serve as a
simulation PRNG. This was entirely expected, but it’s interesting to see
how it stacks up.</p>

<h4 id="mersenne-twister-mt19937-64">Mersenne Twister (MT19937-64)</h4>

<p>Nobody ever got fired for choosing <a href="https://en.wikipedia.org/wiki/Mersenne_Twister">Mersenne Twister</a>. It’s the
classical choice for simulations, and is still usually recommended to
this day. However, Mersenne Twister’s best days are behind it. I
tested the 64-bit variant, MT19937-64, and there are four problems:</p>

<ul>
  <li>
    <p>It’s between 1/4 and 1/5 the speed of xoroshiro128+.</p>
  </li>
  <li>
    <p>It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.</p>
  </li>
  <li>
    <p>Its implementation is three times bigger than xoroshiro128+, and much
more complicated.</p>
  </li>
  <li>
    <p>It had one statistical failure (dab_filltree2).</p>
  </li>
</ul>

<p>Curiously my implementation is 16% faster with Clang than GCC. Since
Mersenne Twister isn’t seriously in the running, I didn’t take time to
dig into why.</p>

<p>Ultimately I would never choose Mersenne Twister for anything anymore.
This was also not surprising.</p>

<h4 id="permuted-congruential-generator-pcg">Permuted Congruential Generator (PCG)</h4>

<p>The <a href="http://www.pcg-random.org/">Permuted Congruential Generator</a> (PCG) has some really
interesting history behind it, particularly with its somewhat <a href="http://www.pcg-random.org/paper.html">unusual
paper</a>, controversial for both its excessive length (58 pages)
and informal style. It’s in close competition with Xorshift and
xoroshiro128+. I was really interested in seeing how it stacked up.</p>

<p>PCG is really just a Linear Congruential Generator (LCG) that doesn’t
output the lowest bits (too poor quality), and has an extra
permutation step to make up for the LCG’s other weaknesses. I included
two variants in my benchmark: the official PCG and a “simplified” PCG
(sPCG) with a simple permutation step. sPCG is just the first PCG
presented in the paper (34 pages in!).</p>

<p>Here’s essentially what the simplified version looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">spcg32</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The third line with the modular multiplication and addition is the
LCG. The bit shift is the permutation. This PCG uses the most
significant three bits of the result to determine which 32 bits to
output. That’s <em>the</em> novel component of PCG.</p>

<p>The two constants are entirely my own devising. It’s two 64-bit primes
generated using Emacs’ <code class="language-plaintext highlighter-rouge">M-x calc</code>: <code class="language-plaintext highlighter-rouge">2 64 ^ k r k n k p k p k p</code>.</p>

<p>Heck, that’s so simple that I could easily memorize this and code it
from scratch on demand. Key takeaway: This is <strong>one way that PCG is
situationally better than xoroshiro128+</strong>. In a pinch I could use Emacs
to generate a couple of primes and code the rest from memory. If you
participate in coding competitions, take note.</p>

<p>However, you probably also noticed PCG only generates 32-bit integers
despite using 64-bit operations. To properly generate a 64-bit value
we’d need 128-bit operations, which would need to be implemented in
software.</p>

<p>Instead, I doubled up on everything to run two PRNGs in parallel.
Despite the doubling in state size, the period doesn’t get any larger
since the PRNGs don’t interact with each other. We get something in
return, though. Remember what I said about out-of-order execution?
Except for the last step combining their results, since the two PRNGs
are independent, doubling up shouldn’t <em>quite</em> halve the performance,
particularly with the benchmark loop unrolling business.</p>

<p>Here’s my doubled-up version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">spcg64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span>  <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a0</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a1</span> <span class="o">=</span> <span class="mh">0x8b260b70b8e98891</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">p0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p1</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r0</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">uint64_t</span> <span class="n">high</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="n">r0</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">low</span>  <span class="o">=</span> <span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="n">r1</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">high</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span><span class="p">)</span> <span class="o">|</span> <span class="n">low</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The “full” PCG has some extra shifts that makes it 25% (GCC) to 50%
(Clang) slower than the “simplified” PCG, but it does halve the WEAK
results.</p>

<p>In this 64-bit form, both are significantly slower than xoroshiro128+.
However, if you find yourself only needing 32 bits at a time (always
throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is
faster than using xoroshiro128+ and throwing away half its output.</p>

<h4 id="rc4">RC4</h4>

<p>This is another CSPRNG where I was curious how it would stack up. It
only uses 8-bit operations, and it generates a 64-bit integer one byte
at a time. It’s the slowest after 16-round Blowfish and generally not
useful as a simulation PRNG.</p>

<h4 id="xoroshiro128">xoroshiro128+</h4>

<p>xoroshiro128+ is the obvious winner in this benchmark and it seems to be
the best 64-bit simulation PRNG available. If you need a fast, quality
PRNG, just drop these 11 lines into your C or C++ program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xoroshiro128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">s0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">s1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">result</span> <span class="o">=</span> <span class="n">s0</span> <span class="o">+</span> <span class="n">s1</span><span class="p">;</span>
    <span class="n">s1</span> <span class="o">^=</span> <span class="n">s0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">s0</span> <span class="o">&lt;&lt;</span> <span class="mi">55</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s0</span> <span class="o">&gt;&gt;</span> <span class="mi">9</span><span class="p">))</span> <span class="o">^</span> <span class="n">s1</span> <span class="o">^</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">14</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">36</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&gt;&gt;</span> <span class="mi">28</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s one important caveat: <strong>That 16-byte state must be
well-seeded.</strong> Having lots of zero bytes will lead <em>terrible</em> initial
output until the generator mixes it all up. Having all zero bytes will
completely break the generator. If you’re going to seed from, say, the
unix epoch, then XOR it with 16 static random bytes.</p>

<h4 id="xorshift128-and-xorshift64">xorshift128+ and xorshift64*</h4>

<p>These generators are closely related and, like I said, xorshift64* was
what I used for years. Looks like it’s time to retire it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift64star</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">25</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x2545f4914f6cdd1d</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will
tolerate weak seeding so long as it’s not literally zero. Zero will also
break this generator.</p>

<p>If it weren’t for xoroshiro128+, then xorshift128+ would have been the
winner of the benchmark and my new favorite choice.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">y</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">23</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">^</span> <span class="n">y</span> <span class="o">^</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">26</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a lot like xoroshiro128+, including the need to be well-seeded,
but it’s just slow enough to lose out. There’s no reason to use
xorshift128+ instead of xoroshiro128+.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My own takeaway (until I re-evaluate some years in the future):</p>

<ul>
  <li>The best 64-bit simulation PRNG is xoroshiro128+.</li>
  <li>“Simplified” PCG can be useful in a pinch.</li>
  <li>When only 32-bit integers are necessary, use PCG.</li>
</ul>

<p>Things can change significantly between platforms, though. Here’s the
shootout on a ARM Cortex-A53:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080
</code></pre></div></div>

<p>LLVM is not as mature on this platform, but, with GCC, both
xoroshiro128+ and xorshift128+ matched the baseline! It seems memory
is the bottleneck.</p>

<p>So don’t necessarily take my word for it. You can run this shootout in
your own environment — perhaps even tossing in more PRNGs — to find
what’s appropriate for your own situation.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>C Closures as a Library</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/08/"/>
    <id>urn:uuid:a5f897bc-0510-3164-a949-fcb848d9279b</id>
    <updated>2017-01-08T22:45:38Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>A common idiom is C is the callback function pointer, either to
deliver information (i.e. a <em>visitor</em> or <em>handler</em>) or to customize
the function’s behavior (e.g. a comparator). Examples of the latter in
the C standard library are <code class="language-plaintext highlighter-rouge">qsort()</code> and <code class="language-plaintext highlighter-rouge">bsearch()</code>, each requiring a
comparator function in order to operate on arbitrary types.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">bsearch</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span>
              <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
              <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>A problem with these functions is that there’s no way to pass context
to the callback. The callback may need information beyond the two
element pointers when making its decision, or to <a href="/blog/2016/09/05/">update a
result</a>. For example, suppose I have a structure representing a
two-dimensional coordinate, and a coordinate distance function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">y</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">float</span>
<span class="nf">distance</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dx</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">x</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">y</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">y</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">dx</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">+</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I have an array of coordinates and I want to sort them based on
their distance from some target, the comparator needs to know the
target. However, the <code class="language-plaintext highlighter-rouge">qsort()</code> interface has no way to directly pass
this information. Instead it has to be passed by another means, such
as a global variable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">coord_cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">size_t</span> <span class="n">ncoords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">coords</span> <span class="o">*</span><span class="n">coords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">current_target</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="c1">// ...</span>
    <span class="n">target</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span>
    <span class="nf">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp</span><span class="p">);</span>
</code></pre></div></div>

<p>Potential problems are that it’s neither thread-safe nor re-entrant.
Two different threads cannot use this comparator <a href="/blog/2014/10/12/">at the same
time</a>. Also, on some platforms and configurations, repeatedly
accessing a global variable in a comparator <a href="/blog/2016/12/23/">may have a significant
cost</a>. A common workaround for thread safety is to make the
global variable thread-local by allocating it in thread-local storage
(TLS):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">_Thread_local</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>       <span class="c1">// C11</span>
<span class="kr">__thread</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>            <span class="c1">// GCC and Clang</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="kr">thread</span><span class="p">)</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>  <span class="c1">// Visual Studio</span>
</code></pre></div></div>

<p>This makes the comparator thread-safe. However, it’s still not
re-entrant (usually unimportant) and accessing thread-local variables
on some platforms is even more expensive — which is the situation for
Pthreads TLS, though not a problem for native x86-64 TLS.</p>

<p>Modern libraries usually provide some sort of “user data” pointer — a
generic pointer that is passed to the callback function as an
additional argument. For example, the GNU C Library has long had
<code class="language-plaintext highlighter-rouge">qsort_r()</code>: <em>re-entrant</em> qsort.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort_r</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">),</span>
           <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>The new comparator looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">coord_cmp_r</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">;</span>
    <span class="n">qsort_r</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp_r</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>User data arguments are thread-safe, re-entrant, performant, and
perfectly portable. They completely and cleanly solve the entire
problem with virtually no drawbacks. If every library did this, there
would be nothing left to discuss and this article would be boring.</p>

<h3 id="the-closure-solution">The closure solution</h3>

<p>In order to make things more interesting, suppose you’re stuck calling
a function in some old library that takes a callback but doesn’t
support a user data argument. A global variable is insufficient, and
the thread-local storage solution isn’t viable for one reason or
another. What do you do?</p>

<p>The core problem is that a function pointer is just an address, and
it’s the same address no matter the context for any particular
callback. On any particular call, the callback has three ways to
distinguish this call from other calls. These align with the three
solutions above:</p>

<ol>
  <li>Inspect some global state: the <strong>global variable solution</strong>. The
caller will change this state for some other calls.</li>
  <li>Query its unique thread ID: the <strong>thread-local storage solution</strong>.
Calls on different threads will have different thread IDs.</li>
  <li>Examine a context argument: the <strong>user pointer solution</strong>.</li>
</ol>

<p>A wholly different approach is to <strong>use a unique function pointer for
each callback</strong>. The callback could then inspect its own address to
differentiate itself from other callbacks. Imagine defining multiple
instances of <code class="language-plaintext highlighter-rouge">coord_cmp</code> each getting their context from a different
global variable. Using a unique copy of <code class="language-plaintext highlighter-rouge">coord_cmp</code> on each thread for
each usage would be both re-entrant and thread-safe, and wouldn’t
require TLS.</p>

<p>Taking this idea further, I’d like to <strong>generate these new functions
on demand at run time</strong> akin to a JIT compiler. This can be done as a
library, mostly agnostic to the implementation of the callback. Here’s
an example of what its usage will be like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The callback to be converted into a closure is <code class="language-plaintext highlighter-rouge">f</code> and the number of
arguments it takes is <code class="language-plaintext highlighter-rouge">nargs</code>. A new closure is allocated and returned
as a function pointer. This closure takes <code class="language-plaintext highlighter-rouge">nargs - 1</code> arguments, and
it will call the original callback with the additional argument
<code class="language-plaintext highlighter-rouge">userdata</code>.</p>

<p>So, for example, this code uses a closure to convert <code class="language-plaintext highlighter-rouge">coord_cmp_r</code>
into a function suitable for <code class="language-plaintext highlighter-rouge">qsort()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">closure</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="n">closure</span> <span class="o">=</span> <span class="n">closure_create</span><span class="p">(</span><span class="n">coord_cmp_r</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">);</span>

<span class="n">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">closure</span><span class="p">);</span>

<span class="n">closure_destroy</span><span class="p">(</span><span class="n">closure</span><span class="p">);</span>
</code></pre></div></div>

<p><strong>Caveat</strong>: This API is <em>utterly insufficient</em> for any sort of
portability. The number of arguments isn’t nearly enough information
for the library to generate a closure. For practically every
architecture and ABI, it’s going to depend on the types of each of
those arguments. On x86-64 with the System V ABI — where I’ll be
implementing this — this argument will only count integer/pointer
arguments. To find out what it takes to do this properly, see the
<a href="https://www.gnu.org/software/libjit/">libjit</a> documentation.</p>

<h3 id="memory-design">Memory design</h3>

<p>This implementation will be for x86-64 Linux, though the high level
details will be the same for any program running in virtual memory. My
closures will span exactly two consecutive pages (typically 8kB),
though it’s possible to use exactly one page depending on the desired
trade-offs. The reason I need two pages are because each page will
have different protections.</p>

<p><img src="/img/diagram/closure-pages.svg" alt="" /></p>

<p>Native code — the <em>thunk</em> — lives in the upper page. The user data
pointer and callback function pointer lives at the high end of the
lower page. The two pointers could really be anywhere in the lower
page, and they’re only at the end for aesthetic reasons. The thunk
code will be identical for all closures of the same number of
arguments.</p>

<p>The upper page will be executable and the lower page will be writable.
This allows new pointers to be set without writing to executable thunk
memory. In the future I expect operating systems to enforce W^X
(“write xor execute”), and this code will already be compliant.
Alternatively, the pointers could be “baked in” with the thunk page
and immutable, but since creating closure requires two system calls, I
figure it’s better that the pointers be mutable and the closure object
reusable.</p>

<p>The address for the closure itself will be the upper page, being what
other functions will call. The thunk will load the user data pointer
from the lower page as an additional argument, then jump to the
actual callback function also given by the lower page.</p>

<h3 id="thunk-assembly">Thunk assembly</h3>

<p>The x86-64 thunk assembly for a 2-argument closure calling a
3-argument callback looks like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">user:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="nl">func:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">user</span><span class="p">]</span>
        <span class="nf">jmp</span>  <span class="p">[</span><span class="nv">rel</span> <span class="nv">func</span><span class="p">]</span>
</code></pre></div></div>

<p>As a reminder, the integer/pointer argument register order for the
System V ABI calling convention is: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">r8</code>,
<code class="language-plaintext highlighter-rouge">r9</code>. The third argument is passed through <code class="language-plaintext highlighter-rouge">rdx</code>, so the user pointer
is loaded into this register. Then it jumps to the callback address
with the original arguments still in place, plus the new argument. The
<code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">func</code> values are loaded <em>RIP-relative</em> (<code class="language-plaintext highlighter-rouge">rel</code>) to the
address of the code. The thunk is using the callback address (its own
address) to determine the context.</p>

<p>The assembled machine code for the thunk is just 13 bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk2</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1">// mov  rdx, [rel user]</span>
    <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
    <span class="c1">// jmp  [rel func]</span>
    <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
<span class="p">}</span>
</code></pre></div></div>

<p>All <code class="language-plaintext highlighter-rouge">closure_create()</code> has to do is allocate two pages, copy this
buffer into the upper page, adjust the protections, and return the
address of the thunk. Since <code class="language-plaintext highlighter-rouge">closure_create()</code> will work for <code class="language-plaintext highlighter-rouge">nargs</code>
number of arguments, there will actually be 6 slightly different
thunks, one for each of the possible register arguments (<code class="language-plaintext highlighter-rouge">rdi</code> through
<code class="language-plaintext highlighter-rouge">r9</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk</span><span class="p">[</span><span class="mi">6</span><span class="p">][</span><span class="mi">13</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x3d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x05</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Given a closure pointer returned from <code class="language-plaintext highlighter-rouge">closure_create()</code>, here are the
setter functions for setting the closure’s two pointers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_set_data</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">closure_set_function</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">closure_create()</code>, allocation is done with an anonymous <code class="language-plaintext highlighter-rouge">mmap()</code>,
just like in <a href="/blog/2015/03/19/">my JIT compiler</a>. It’s initially mapped writable in
order to copy the thunk, then the thunk page is set to executable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="kt">void</span> <span class="o">*</span><span class="n">closure</span> <span class="o">=</span> <span class="n">p</span> <span class="o">+</span> <span class="n">page_size</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">thunk</span><span class="p">[</span><span class="n">nargs</span> <span class="o">-</span> <span class="mi">1</span><span class="p">],</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">thunk</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>

    <span class="n">closure_set_function</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="n">closure_set_data</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">userdata</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">closure</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Destroying a closure is done by computing the lower page address and
calling <code class="language-plaintext highlighter-rouge">munmap()</code> on it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="n">munmap</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">closure</span> <span class="o">-</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it! You can see the entire demo here:</p>

<ul>
  <li><a href="/download/closure-demo.c" class="download">closure-demo.c</a></li>
</ul>

<p>It’s a lot simpler for x86-64 than it is for x86, where there’s no
RIP-relative addressing and arguments are passed on the stack. The
arguments must all be copied back onto the stack, above the new
argument, and it cannot be a tail call since the stack has to be fixed
before returning. Here’s what the thunk looks like for a 2-argument
closure:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">data:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="nl">func:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">call</span> <span class="nv">.rip2eax</span>
<span class="nl">.rip2eax:</span>
        <span class="nf">pop</span> <span class="nb">eax</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">13</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">call</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">9</span><span class="p">]</span>
        <span class="nf">add</span> <span class="nb">esp</span><span class="p">,</span> <span class="mi">12</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Exercise for the reader: Port the closure demo to a different
architecture or to the the Windows x64 ABI.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Relocatable Global Data on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/23/"/>
    <id>urn:uuid:56be19e0-ce9a-3f37-dc85-578f397ed3e1</id>
    <updated>2016-12-23T22:50:51Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Relocatable code — program code that executes correctly from any
properly-aligned address — is an essential feature for shared
libraries. Otherwise all of a system’s shared libraries would need to
coordinate their virtual load addresses. Loading programs and
libraries to random addresses is also a valuable security feature:
Address Space Layout Randomization (ASLR). But how does a compiler
generate code for a function that accesses a global variable if that
variable’s address isn’t known at compile time?</p>

<p>Consider this simple C code sample.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function needs the base address of <code class="language-plaintext highlighter-rouge">values</code> in order to
dereference it for <code class="language-plaintext highlighter-rouge">values[x]</code>. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -Os -fPIC get_value.c
</code></pre></div></div>

<p>I optimized for size (<code class="language-plaintext highlighter-rouge">-Os</code>) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with <code class="language-plaintext highlighter-rouge">objdump</code>. Alternatively I
could have asked for the compiler’s assembly output with <code class="language-plaintext highlighter-rouge">-S</code>, but
this will be good reverse engineering practice.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -d -Mintel get_value.o
0000000000000000 &lt;get_value&gt;:
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 &lt;get_value+0x16&gt;
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret
</code></pre></div></div>

<p>There are a couple of interesting things going on, but let’s start
from the beginning.</p>

<ol>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">The ABI</a> specifies that the first integer/pointer argument
(the 32-bit integer <code class="language-plaintext highlighter-rouge">x</code>) is passed through the <code class="language-plaintext highlighter-rouge">edi</code> register. The
function compares <code class="language-plaintext highlighter-rouge">x</code> to 3, to satisfy <code class="language-plaintext highlighter-rouge">x &lt; 4</code>.</p>
  </li>
  <li>
    <p>The ABI specifies that floating point values are returned through
the <a href="/blog/2015/07/10/">SSE2 SIMD register</a> <code class="language-plaintext highlighter-rouge">xmm0</code>. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of <code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>It then uses the result of the previous comparison to perform a
jump, <code class="language-plaintext highlighter-rouge">ja</code> (“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to <code class="language-plaintext highlighter-rouge">cmp</code>
(<code class="language-plaintext highlighter-rouge">edi</code>) comes after the first operand (<code class="language-plaintext highlighter-rouge">0x3</code>) as <em>unsigned</em> values.
Its cousin, <code class="language-plaintext highlighter-rouge">jg</code> (“jump if greater”), is for signed values. If <code class="language-plaintext highlighter-rouge">x</code>
is outside the array bounds, it jumps straight to <code class="language-plaintext highlighter-rouge">ret</code>, returning
<code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">x</code> was in bounds, it uses a <code class="language-plaintext highlighter-rouge">lea</code> (“load effective address”) to
load <em>something</em> into the 64-bit <code class="language-plaintext highlighter-rouge">rax</code> register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into <code class="language-plaintext highlighter-rouge">rax</code> is the address of the <code class="language-plaintext highlighter-rouge">values</code> array. More on
this in a moment.</p>
  </li>
  <li>
    <p>Finally it uses <code class="language-plaintext highlighter-rouge">x</code> as an index into address in <code class="language-plaintext highlighter-rouge">rax</code>. The <code class="language-plaintext highlighter-rouge">movss</code>
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of <code class="language-plaintext highlighter-rouge">xmm0</code>, where the caller expects to find the
return value. This is all preceded by a <code class="language-plaintext highlighter-rouge">mov edi, edi</code> which
<a href="/blog/2016/03/31/"><em>looks</em> like a hotpatch nop</a>, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses <code class="language-plaintext highlighter-rouge">rdi</code> not <code class="language-plaintext highlighter-rouge">edi</code>.
All 32-bit register assignments clear the upper 32 bits, and so
this <code class="language-plaintext highlighter-rouge">mov</code> zero-extends <code class="language-plaintext highlighter-rouge">edi</code> into <code class="language-plaintext highlighter-rouge">rdi</code>. This is in case of the
unlikely event that the caller left garbage in those upper bits.</p>
  </li>
</ol>

<h3 id="clearing-xmm0">Clearing <code class="language-plaintext highlighter-rouge">xmm0</code></h3>

<p>The first interesting part: <code class="language-plaintext highlighter-rouge">xmm0</code> is cleared even when its first lane
is loaded with a value. There are two reasons to do this.</p>

<p>The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra <code class="language-plaintext highlighter-rouge">ret</code> or an conditional <code class="language-plaintext highlighter-rouge">jmp</code> over the “else” branch.</p>

<p>The less obvious reason is that it breaks a <em>data dependency</em>. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called <a href="https://en.wikipedia.org/wiki/Register_renaming">register renaming</a>. <em>Architectural
registers</em> (<code class="language-plaintext highlighter-rouge">rax</code>, <code class="language-plaintext highlighter-rouge">edi</code>, etc.) are just temporary names for
underlying <em>physical registers</em>. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.</p>

<p>For example, take this assembly sample. It assembles to 9 bytes of
machine code.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>This reads a 32-bit value from the address stored in <code class="language-plaintext highlighter-rouge">rcx</code>, then
assigns <code class="language-plaintext highlighter-rouge">ecx</code> and uses <code class="language-plaintext highlighter-rouge">cl</code> (the lowest byte of <code class="language-plaintext highlighter-rouge">rcx</code>) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of <code class="language-plaintext highlighter-rouge">rcx</code>, wiping the unused parts of
register.</p>

<p>So after the second instruction, it’s guaranteed that the value in
<code class="language-plaintext highlighter-rouge">rcx</code> has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, <em>before</em> the load. Ingenious!</p>

<p>Compare it to this example, where the second instruction assigns to
<code class="language-plaintext highlighter-rouge">cl</code> instead of <code class="language-plaintext highlighter-rouge">ecx</code>. This assembles to just 6 bytes.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">cl</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of <code class="language-plaintext highlighter-rouge">rcx</code> still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.</p>

<p>By clearing <code class="language-plaintext highlighter-rouge">xmm0</code>, instructions in <code class="language-plaintext highlighter-rouge">get_value</code> involving <code class="language-plaintext highlighter-rouge">xmm0</code> have
the opportunity to be executed prior to instructions in the callee
that use <code class="language-plaintext highlighter-rouge">xmm0</code>.</p>

<h3 id="rip-relative-addressing">RIP-relative addressing</h3>

<p>Going back to the instruction that computes the address of <code class="language-plaintext highlighter-rouge">values</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
</code></pre></div></div>

<p>Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
<em>RIP-relative addressing</em> that’s still the case, but the register with
the absolute address is <code class="language-plaintext highlighter-rouge">rip</code>, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.</p>

<p>That means this instruction copies the instruction pointer (pointing
to the next instruction) into <code class="language-plaintext highlighter-rouge">rax</code>, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you <em>want</em> a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a <em>relocation entry</em> to the object file so that the linker knows how
to do this.</p>

<p>On platforms that <a href="/blog/2016/11/17/">use ELF</a> we can inspect relocations this with
<code class="language-plaintext highlighter-rouge">readelf</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
</code></pre></div></div>

<p>The relocation type is <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code>. In the <a href="http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_amd64.pdf">AMD64 Architecture
Processor Supplement</a>, this is defined as “S + A - P”.</p>

<ul>
  <li>
    <p>S: Represents the value of the symbol whose index resides in the
relocation entry.</p>
  </li>
  <li>
    <p>A: Represents the addend used to compute the value of the
relocatable field.</p>
  </li>
  <li>
    <p>P: Represents the place of the storage unit being relocated.</p>
  </li>
</ul>

<p>The symbol, S, is <code class="language-plaintext highlighter-rouge">.rodata</code> — the final address for this object file’s
portion of <code class="language-plaintext highlighter-rouge">.rodata</code> (where <code class="language-plaintext highlighter-rouge">values</code> resides). The addend, A, is <code class="language-plaintext highlighter-rouge">-4</code>
since the instruction pointer points at the <em>next</em> instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the <code class="language-plaintext highlighter-rouge">lea</code> instruction. These values are all known at
link-time, so no run-time support is necessary.</p>

<p>Being “S - P” (overall), this will be the displacement between these
two addresses: the 32-bit value is relative. It’s relocatable so long
as these two parts of the binary (code and data) maintain a fixed
distance from each other. The binary is relocated as a whole, so this
assumption holds.</p>

<h3 id="32-bit-relocation">32-bit relocation</h3>

<p>Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the <code class="language-plaintext highlighter-rouge">-m32</code> flag for a 32-bit target, and <code class="language-plaintext highlighter-rouge">-fomit-frame-pointer</code> to
make it simpler for explanatory purposes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 &lt;get_value&gt;:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 &lt;get_value+0x7&gt;
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f &lt;get_value+0x1f&gt;
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 &lt;__x86.get_pc_thunk.cx&gt;:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret
</code></pre></div></div>

<p>Hmm, this one includes an extra function.</p>

<ol>
  <li>
    <p>In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, <code class="language-plaintext highlighter-rouge">x</code>, into <code class="language-plaintext highlighter-rouge">eax</code>.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">fldz</code> instruction clears the x87 floating pointer return
register, just like clearing <code class="language-plaintext highlighter-rouge">xmm0</code> in the x86-64 version.</p>
  </li>
  <li>
    <p>Next it calls <code class="language-plaintext highlighter-rouge">__x86.get_pc_thunk.cx</code>. The call pushes the
instruction pointer, <code class="language-plaintext highlighter-rouge">eip</code>, onto the stack. This function reads
that value off the stack into <code class="language-plaintext highlighter-rouge">ecx</code> and returns. In other words,
calling this function copies <code class="language-plaintext highlighter-rouge">eip</code> into <code class="language-plaintext highlighter-rouge">ecx</code>. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.</p>
  </li>
  <li>
    <p>Next a 32-bit displacement is added to <code class="language-plaintext highlighter-rouge">ecx</code>. In this case it’s
<code class="language-plaintext highlighter-rouge">2</code>, but, like before, this is actually going be filled in later by
the linker.</p>
  </li>
  <li>
    <p>Then it’s just like before: a branch to optionally load a value.
The floating pointer load (<code class="language-plaintext highlighter-rouge">fld</code>) is another relocation.</p>
  </li>
</ol>

<p>Let’s look at the relocations. There are three this time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata
</code></pre></div></div>

<p>The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is <code class="language-plaintext highlighter-rouge">R_386_PC32</code>: a code-relative
displacement just like on x86-64.</p>

<p>The next is of type <code class="language-plaintext highlighter-rouge">R_386_GOTPC</code> and sets the second operand in that
<code class="language-plaintext highlighter-rouge">add ecx</code>. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since <code class="language-plaintext highlighter-rouge">values</code> is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.</p>

<p>The final relocation is of type <code class="language-plaintext highlighter-rouge">R_386_GOTOFF</code>. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, <code class="language-plaintext highlighter-rouge">fld</code>. Ultimately the load adds these last
two relocations together, canceling the GOT:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
</code></pre></div></div>

<p>So the GOT isn’t relevant in this case. It’s just a mechanism for
constructing a custom relocation type.</p>

<h3 id="branch-optimization">Branch optimization</h3>

<p>Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will <code class="language-plaintext highlighter-rouge">x</code> be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.</p>

<p>The typical way to provide such a compiler hint is with a pair of
macros, <code class="language-plaintext highlighter-rouge">likely()</code> and <code class="language-plaintext highlighter-rouge">unlikely()</code>. With GCC and Clang, these would
be defined to use <code class="language-plaintext highlighter-rouge">__builtin_expect</code>. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)
</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">unlikely</span><span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this makes no difference even in the latest version of
GCC. In Clang it changes branch fall-through (for <a href="http://www.agner.org/optimize/microarchitecture.pdf">static branch
prediction</a>), but still always calls the thunk. It seems
compilers <a href="https://ewontfix.com/18/">have difficulty</a> with <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232">optimizing relocatable
code</a> on x86.</p>

<h3 id="x86-64-isnt-just-about-more-memory">x86-64 isn’t just about more memory</h3>

<p>It’s commonly understood that the advantage of 64-bit versus 32-bit
systems is processes having access to more than 4GB of memory. But as
this shows, there’s more to it than that. Even programs that don’t
need that much memory can really benefit from newer features like
RIP-relative addressing.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Portable Structure Access with Member Offset Constants</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/22/"/>
    <id>urn:uuid:81ff4064-17f1-3a9b-a5ec-61acb03385b9</id>
    <updated>2016-11-22T12:55:29Z</updated>
    <category term="c"/><category term="posix"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you need to write a C program to access a long sequence of
structures from a binary file in a specified format. These structures
have different lengths and contents, but also a common header
identifying its type and size. Here’s the definition of that header
(no padding):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">time</span><span class="p">;</span>   <span class="c1">// unix epoch (microseconds)</span>
    <span class="kt">uint32_t</span> <span class="n">size</span><span class="p">;</span>   <span class="c1">// including this header (bytes)</span>
    <span class="kt">uint16_t</span> <span class="n">source</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">type</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">size</code> member is used to find the offset of the next structure in
the file without knowing anything else about the current structure.
Just add <code class="language-plaintext highlighter-rouge">size</code> to the offset of the current structure.</p>

<p>The <code class="language-plaintext highlighter-rouge">type</code> member indicates what kind of data follows this structure.
The program is likely to <code class="language-plaintext highlighter-rouge">switch</code> on this value.</p>

<p>The actual structures might look something like this (in the spirit of
<a href="http://openxcom.org/">X-COM</a>). Note how each structure begins with <code class="language-plaintext highlighter-rouge">struct event</code> as
header. All angles are expressed using <a href="https://en.wikipedia.org/wiki/Binary_scaling">binary scaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30
</span>
<span class="k">struct</span> <span class="n">observer</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">latitude</span><span class="p">;</span>   <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">longitude</span><span class="p">;</span>  <span class="c1">//</span>
    <span class="kt">uint16_t</span> <span class="n">source_id</span><span class="p">;</span>  <span class="c1">// later used for event source</span>
    <span class="kt">uint16_t</span> <span class="n">name_size</span><span class="p">;</span>  <span class="c1">// not including null terminator</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[];</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">ufo_sighting</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">azimuth</span><span class="p">;</span>    <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">elevation</span><span class="p">;</span>  <span class="c1">//</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">suspicious_signal</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">num_channels</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">sample_rate</span><span class="p">;</span>  <span class="c1">// Hz</span>
    <span class="kt">uint32_t</span> <span class="n">num_samples</span><span class="p">;</span>  <span class="c1">// per channel</span>
    <span class="kt">int16_t</span> <span class="n">samples</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>If all integers are stored in little endian byte order (least
significant byte first), there’s a strong temptation to lay the
structures directly over the data. After all, this will work correctly
on most computers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="n">header</span><span class="p">;</span>
<span class="n">fread</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">header</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">header</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This code will not work correctly when:</p>

<ol>
  <li>
    <p>The host machine doesn’t use little endian byte order, though this
is now uncommon. Sometimes developers will attempt to detect the
byte order at compile time and use the preprocessor to byte-swap if
needed. This is a mistake.</p>
  </li>
  <li>
    <p>The host machine has different alignment requirements and so
introduces additional padding to the structure. Sometimes this can
be resolved with a non-standard <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.4/gcc/Structure_002dPacking-Pragmas.html"><code class="language-plaintext highlighter-rouge">#pragma pack</code></a>.</p>
  </li>
</ol>

<h3 id="integer-extraction-functions">Integer extraction functions</h3>

<p>Fortunately it’s easy to write fast, correct, portable code for this
situation. First, define some functions to extract little endian
integers from an octet buffer (<code class="language-plaintext highlighter-rouge">uint8_t</code>). These will work correctly
regardless of the host’s alignment and byte order.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint16_t</span>
<span class="nf">extract_u16le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint64_t</span>
<span class="nf">extract_u64le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The big endian version is identical, but with shifts in reverse order.</p>

<p>A common concern is that these functions are a lot less efficient than
they could be. On x86 where alignment is very relaxed, each could be
implemented as a single load instruction. However, on GCC 4.x and
earlier, <code class="language-plaintext highlighter-rouge">extract_u32le</code> compiles to something like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">24</span>
        <span class="nf">mov</span>     <span class="nb">edx</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">16</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">edx</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>It’s tempting to fix the problem with the following definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: Don't do this.</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s unportable, it’s undefined behavior, and worst of all, it <a href="http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html">might
not work correctly even on x86</a>. Fortunately I have some great
news. On GCC 5.x and above, the correct definition compiles to the
desired, fast version. It’s the best of both worlds.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="n">mov</span>     <span class="n">eax</span><span class="p">,</span> <span class="p">[</span><span class="n">rdi</span><span class="p">]</span>
        <span class="n">ret</span>
</code></pre></div></div>

<p>It’s even smart about the big endian version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32be</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Is compiled to exactly what you’d want:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">mov</span>     <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">bswap</span>   <span class="nb">eax</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Or, even better, if your system supports <code class="language-plaintext highlighter-rouge">movbe</code> (<code class="language-plaintext highlighter-rouge">gcc -mmovbe</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">movbe</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Unfortunately, Clang/LLVM is <em>not</em> this smart as of 3.9, but I’m
betting it will eventually learn how to do this, too.</p>

<h3 id="member-offset-constants">Member offset constants</h3>

<p>For this next technique, that <code class="language-plaintext highlighter-rouge">struct event</code> from above need not
actually be in the source. It’s purely documentation. Instead, let’s
define the structure in terms of <em>member offset constants</em> — a term I
just made up for this article. I’ve included the integer types as part
of the name to aid in their correct use.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14
</span></code></pre></div></div>

<p>Given a buffer, the integer extraction functions, and these offsets,
structure members can be plucked out on demand.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="kt">uint64_t</span> <span class="n">time</span>   <span class="o">=</span> <span class="n">extract_u64le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U64LE_TIME</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">size</span>   <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">;</span>
<span class="kt">uint16_t</span> <span class="n">source</span> <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_SOURCE</span><span class="p">);</span>
<span class="kt">uint16_t</span> <span class="n">type</span>   <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_TYPE</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86 with GCC 5.x, each member access will be inlined and compiled
to a one-instruction extraction. As far as performance is concerned,
it’s identical to using a structure overlay, but this time the C code
is clean and portable. A slight downside is the lack of type checking
on member access: it’s easy to mismatch the types and accidentally
read garbage.</p>

<h3 id="memory-mapping-and-iteration">Memory mapping and iteration</h3>

<p>There’s a real advantage to memory mapping the input file and using
its contents directly. On a system with a huge virtual address space,
such as x86-64 or AArch64, this memory is almost “free.” Already being
backed by a file, paging out this memory costs nothing (i.e. it’s
discarded). The input file can comfortably be much larger than
physical memory without straining the system.</p>

<p>Unportable structure overlay can take advantage of memory mapping this
way, but has the previously-described issues. An approach with member
offset constants will take advantage of it just as well, all while
remaining clean and portable.</p>

<p>I like to wrap the memory mapping code into a simple interface, which
makes porting to non-POSIX platforms, such Windows, easier. Caveat:
This won’t work with files whose size exceeds the available contiguous
virtual memory of the system — a real problem for 32-bit systems.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/stat.h&gt;</span><span class="cp">
</span>
<span class="kt">uint8_t</span> <span class="o">*</span>
<span class="nf">map_file</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">size_t</span> <span class="o">*</span><span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">struct</span> <span class="n">stat</span> <span class="n">stat</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fstat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">stat</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="o">*</span><span class="n">length</span> <span class="o">=</span> <span class="n">stat</span><span class="p">.</span><span class="n">st_size</span><span class="p">;</span>  <span class="c1">// TODO: possible overflow</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">*</span><span class="n">length</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">unmap_file</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next, here’s an example that iterates over all the structures in
<code class="language-plaintext highlighter-rouge">input_file</code>, in this case counting each. The <code class="language-plaintext highlighter-rouge">size</code> member is
extracted in order to stride to the next structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">length</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">map_file</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">length</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">data</span><span class="p">)</span>
    <span class="n">FATAL</span><span class="p">();</span>

<span class="kt">size_t</span> <span class="n">event_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">&lt;</span> <span class="n">data</span> <span class="o">+</span> <span class="n">length</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">event_count</span><span class="o">++</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">p</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">length</span> <span class="o">-</span> <span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">data</span><span class="p">))</span>
        <span class="n">FATAL</span><span class="p">();</span>  <span class="c1">// invalid size</span>
    <span class="n">p</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"I see %zu events.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">event_count</span><span class="p">);</span>

<span class="n">unmap_file</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
</code></pre></div></div>

<p>This is the basic structure for navigating this kind of data. A deeper
dive would involve a <code class="language-plaintext highlighter-rouge">switch</code> inside the loop, extracting the relevant
members for whatever use is needed.</p>

<p>Fast, correct, simple. Pick three.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A Magnetized Needle and a Steady Hand</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/17/"/>
    <id>urn:uuid:1abbb17d-9836-3efc-8493-52dd93a90736</id>
    <updated>2016-11-17T23:35:26Z</updated>
    <category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Now they’ve gone an done it. An unidentified agency has spread a
potent computer virus across all the world’s computers and deleted the
binaries for every copy of every software development tool. Even the
offline copies — it’s <em>that</em> potent.</p>

<p>Most of the source code still exists, even for the compilers, and most
computer systems will continue operating without disruption, but no
new software can be developed unless it’s written byte by byte in raw
machine code. Only <em>real programmers</em> can get anything done.</p>

<p><a href="http://xkcd.com/378/"><img src="http://imgs.xkcd.com/comics/real_programmers.png" alt="" /></a></p>

<p>The world’s top software developers have been put to work
bootstrapping a C compiler (and others) completely from scratch so
that we can get back to normal. Without even an assembler, it’s a
slow, tedious process.</p>

<p>In the mean time, rather than wait around for the bootstrap work to
complete, the rest of us have been assigned individual programs hit by
the virus. For example, many basic unix utilities have been wiped out,
and the bootstrap would benefit from having them. Having different
groups tackle each missing program will allow the bootstrap effort to
move forward somewhat in parallel. <em>At least that’s what the compiler
nerds told us.</em> The real reason is that they’re tired of being asked
if they’re done yet, and these tasks will keep the rest of us quietly
busy.</p>

<p>Fortunately you and I have been assigned the easiest task of all:
<strong>We’re to write the <code class="language-plaintext highlighter-rouge">true</code> command from scratch.</strong> We’ll have to
figure it out byte by byte. The target is x86-64 Linux, which means
we’ll need the following documentation:</p>

<ol>
  <li>
    <p><a href="http://refspecs.linuxbase.org/elf/elf.pdf">Executable and Linking Format (ELF) Specification</a>. This is
the binary format used by modern Unix-like systems, including
Linux. A more convenient way to access this document is <a href="http://man7.org/linux/man-pages/man5/elf.5.html"><code class="language-plaintext highlighter-rouge">man 5
elf</code></a>.</p>
  </li>
  <li>
    <p><a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">Intel 64 and IA-32 Architectures Software Developer’s
Manual</a> (Volume 2). This fully documents the instruction set
and its encoding. It’s all the information needed to write x86
machine code by hand. The AMD manuals would work too.</p>
  </li>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">System V Application Binary Interface: AMD64 Architecture
Processor Supplement</a>. Only a few pieces of information are
needed from this document, but more would be needed for a more
substantial program.</p>
  </li>
  <li>
    <p>Some magic numbers from header files.</p>
  </li>
</ol>

<h3 id="manual-assembly">Manual Assembly</h3>

<p>The program we’re writing is <code class="language-plaintext highlighter-rouge">true</code>, whose behavior is documented as
“do nothing, successfully.” All command line arguments are ignored and
no input is read. The program only needs to perform the exit system
call, immediately terminating the process.</p>

<p>According to the ABI document (3) Appendix A, the registers for system
call arguments are: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">r10</code>, <code class="language-plaintext highlighter-rouge">r8</code>, <code class="language-plaintext highlighter-rouge">r9</code>. The system
call number goes in <code class="language-plaintext highlighter-rouge">rax</code>. The exit system call takes only one
argument, and that argument will be 0 (success), so <code class="language-plaintext highlighter-rouge">rdi</code> should be
set to zero. It’s likely that it’s already zero when the program
starts, but the ABI document says its contents are undefined (§3.4),
so we’ll set it explicitly.</p>

<p>For Linux on x86-64, the system call number for exit is 60,
(/usr/include/asm/unistd_64.h), so <code class="language-plaintext highlighter-rouge">rax</code> will be set to 60, followed
by <code class="language-plaintext highlighter-rouge">syscall</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">xor</span>  <span class="nb">edi</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">60</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<p>There’s no assembler available to turn this into machine code, so it
has to be assembled by hand. For that we need the Intel manual (2).</p>

<p>The first instruction is <code class="language-plaintext highlighter-rouge">xor</code>, so look up that mnemonic in the
manual. Like most x86 mnemonics, there are many different opcodes and
multiple ways to encode the same operation. For <code class="language-plaintext highlighter-rouge">xor</code>, we have 22
opcodes to examine.</p>

<p><img src="/img/steady-hand/xor.png" alt="" /></p>

<p>The operands are two 32-bit registers, so there are two options:
opcodes 0x31 and 0x33.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 /r      XOR r/m32, r32
33 /r      XOR r32, r/m32
</code></pre></div></div>

<p>The “r/m32” means the operand can be either a register or the address
of a 32-bit region of memory. With two register operands, both
encodings are equally valid, both have the same length (2 bytes), and
neither is canonical, so the decision is entirely arbitrary. Let’s
pick the first one, opcode 0x31, since it’s listed first.</p>

<p>The “/r” after the opcode means the register-only operand (“r32” in
both cases) will be specified in the ModR/M byte. This is the byte
that immediately follows the opcode and specifies one of two of the
operands.</p>

<p>The ModR/M byte is broken into three parts: mod (2 bits), reg (3
bits), r/m (3 bits). This gets a little complicated, but if you stare
at Table 2-1 in the Intel manual for long enough it eventually makes
sense. In short, two high bits (11) for mod indicates we’re working
with a register rather than a load. Here’s where we’re at for ModR/M:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 ??? ???
</code></pre></div></div>

<p>The order of the x86 registers is unintuitive: <code class="language-plaintext highlighter-rouge">ax</code>, <code class="language-plaintext highlighter-rouge">cx</code>, <code class="language-plaintext highlighter-rouge">dx</code>, <code class="language-plaintext highlighter-rouge">bx</code>,
<code class="language-plaintext highlighter-rouge">sp</code>, <code class="language-plaintext highlighter-rouge">bp</code>, <code class="language-plaintext highlighter-rouge">si</code>, <code class="language-plaintext highlighter-rouge">di</code>. With 0-indexing, that gives <code class="language-plaintext highlighter-rouge">di</code> a value of 7
(111 in binary). With <code class="language-plaintext highlighter-rouge">edi</code> as both operands, this makes ModR/M:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 111 111
</code></pre></div></div>

<p>Or, in hexadecimal, FF. And that’s it for this instruction. With the
opcode (0x31) and the ModR/M byte (0xFF):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 FF
</code></pre></div></div>

<p>The encoding for <code class="language-plaintext highlighter-rouge">mov</code> is a bit different. Look it up and match the
operands. Like before, there are two possible options:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B8+rd id   MOV r32, imm32
C7 /0 id   MOV r/m32, imm32
</code></pre></div></div>

<p>In the <code class="language-plaintext highlighter-rouge">B8+rd</code> notation means the 32-bit register operand (<em>rd</em> for
“register double word”) is added to the opcode instead of having a
ModR/M byte. It’s followed by a 32-bit immediate value (<em>id</em> for
“integer double word”). That’s a total of 5 bytes.</p>

<p>The “/0” in second means 0 goes in the “reg” field of ModR/M, and the
whole instruction is followed by the 32-bit immediate (id). That’s a
total of 6 bytes. Since this is longer, we’ll use the first encoding.</p>

<p>So, that’s opcode <code class="language-plaintext highlighter-rouge">0xB8 + 0</code>, since <code class="language-plaintext highlighter-rouge">eax</code> is register number 0,
followed by 60 (0x3C) as a little endian, 4-byte value. Here’s the
encoding for the second instruction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B8 3C 00 00 00
</code></pre></div></div>

<p>The final instruction is a cakewalk. There are no operands, it comes
in only one form of two opcode bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0F 05   SYSCALL
</code></pre></div></div>

<p>So the encoding for this instruction is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0F 05
</code></pre></div></div>

<p>Putting it all together the program is 9 bytes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 FF B8 3C 00 00 00 0F 05
</code></pre></div></div>

<p>Aren’t you glad you don’t normally have to assemble entire programs by
hand?</p>

<h3 id="constructing-the-elf">Constructing the ELF</h3>

<p>Back in the old days you may have been able to simply drop these bytes
into a file and execute it. That’s how <a href="/blog/2014/12/09/">DOS COM programs worked</a>.
But this definitely won’t work if you tried it on Linux. Binaries must
be in the Executable and Linking Format (ELF). This format tells the
loader how to initialize the program in memory and how to start it.</p>

<p>Fortunately for this program we’ll only need to fill out two
structures: the ELF header and one program header. The binary will be
the ELF header, followed immediately by the program header, followed
immediately by the program.</p>

<p><img src="/img/steady-hand/elf.svg" alt="" /></p>

<p>To fill this binary out, we’d use whatever method the virus left
behind for writing raw bytes to a file. For now I’ll assume the <code class="language-plaintext highlighter-rouge">echo</code>
command is still available, and we’ll use hexadecimal <code class="language-plaintext highlighter-rouge">\xNN</code> escapes
to write raw bytes. If this isn’t available, you might need to use the
magnetic needle and steady hand method, or the butterflies.</p>

<p>The very first structure in an ELF file must be the ELF header, from
the ELF specification (1):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">e_ident</span><span class="p">[</span><span class="n">EI_NIDENT</span><span class="p">];</span>
        <span class="kt">uint16_t</span>      <span class="n">e_type</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_machine</span><span class="p">;</span>
        <span class="kt">uint32_t</span>      <span class="n">e_version</span><span class="p">;</span>
        <span class="n">ElfN_Addr</span>     <span class="n">e_entry</span><span class="p">;</span>
        <span class="n">ElfN_Off</span>      <span class="n">e_phoff</span><span class="p">;</span>
        <span class="n">ElfN_Off</span>      <span class="n">e_shoff</span><span class="p">;</span>
        <span class="kt">uint32_t</span>      <span class="n">e_flags</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_ehsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_phentsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_phnum</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shentsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shnum</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shstrndx</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">ElfN_Ehdr</span><span class="p">;</span>
</code></pre></div></div>

<p>No other data is at a fixed location because this header specifies
where it can be found. If you’re writing a C program in the future,
once compilers have been bootstrapped back into existence, you can
access this structure in <code class="language-plaintext highlighter-rouge">elf.h</code>.</p>

<h4 id="the-elf-header">The ELF header</h4>

<p>The <code class="language-plaintext highlighter-rouge">EI_NIDENT</code> macro is 16, so <code class="language-plaintext highlighter-rouge">e_ident</code> is 16 bytes. The first 4
bytes are fixed: 0x7F, E, L, F.</p>

<p>The 5th byte is called <code class="language-plaintext highlighter-rouge">EI_CLASS</code>: a 32-bit program (<code class="language-plaintext highlighter-rouge">ELFCLASS32</code> =
1) or a 64-bit program (<code class="language-plaintext highlighter-rouge">ELFCLASS64</code> = 2). This will be a 64-bit
program (2).</p>

<p>The 6th byte indicates the integer format (<code class="language-plaintext highlighter-rouge">EI_DATA</code>). The one we want
for x86-64 is <code class="language-plaintext highlighter-rouge">ELFDATA2LSB</code> (1), two’s complement, little-endian.</p>

<p>The 7th byte is the ELF version (<code class="language-plaintext highlighter-rouge">EI_VERSION</code>), always 1 as of this
writing.</p>

<p>The 8th byte is the ABI (<code class="language-plaintext highlighter-rouge">ELF_OSABI</code>), which in this case is
<code class="language-plaintext highlighter-rouge">ELFOSABI_SYSV</code> (0).</p>

<p>The 9th byte is the version (<code class="language-plaintext highlighter-rouge">EI_ABIVERSION</code>), which is just 0 again.</p>

<p>The rest is zero padding.</p>

<p>So writing the ELF header:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x7FELF\x02\x01\x01\x00' &gt; true
echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The next field is the <code class="language-plaintext highlighter-rouge">e_type</code>. This is an executable program, so it’s
<code class="language-plaintext highlighter-rouge">ET_EXEC</code> (2). Other options are object files (<code class="language-plaintext highlighter-rouge">ET_REL</code> = 1), shared
libraries (<code class="language-plaintext highlighter-rouge">ET_DYN</code> = 3), and core files (<code class="language-plaintext highlighter-rouge">ET_CORE</code> = 4).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x02\x00' &gt;&gt; true
</code></pre></div></div>

<p>The value for <code class="language-plaintext highlighter-rouge">e_machine</code> is <code class="language-plaintext highlighter-rouge">EM_X86_64</code> (0x3E). This value isn’t in
the ELF specification but rather the ABI document (§4.1.1). On BSD
this is instead named <code class="language-plaintext highlighter-rouge">EM_AMD64</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x3E\x00' &gt;&gt; true
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">e_version</code> it’s always 1, like in the header.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_entry</code> field will be 8 bytes because this is a 64-bit ELF. This
is the virtual address of the program’s entry point. It’s where the
loader will pass control and so it’s where we’ll load the program. The
typical entry address is somewhere around 0x400000. For a reason I’ll
explain shortly, our entry point will be 120 bytes (0x78) after that
nice round number, at 0x40000078.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phoff</code> field holds the offset of the program header table. The
ELF header is 64 bytes (0x40) and this structure will immediately
follow. It’s also 8 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shoff</code> header holds the offset of the section table. In an
executable program we don’t need sections, so this is zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_flags</code> field has processor-specific flags, which in our case is
just 0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_ehsize</code> holds the size of the ELF header, which, as I said, is
64 bytes (0x40).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phentsize</code> is the size of one program header, which is 56 bytes
(0x38).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x38\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phnum</code> field indicates how many program headers there are. We
only need the one: the segment with the 9 program bytes, to be loaded
into memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shentsize</code> is the size of a section header. We’re not using
this, but we’ll do our due diligence. These are 64 bytes (0x40).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shnum</code> field is the number of sections (0).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shstrndx</code> is the index of the section with the string table. It
doesn’t exist, so it’s 0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00' &gt;&gt; true
</code></pre></div></div>

<h4 id="the-program-header">The program header</h4>

<p>Next is our program header.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">uint32_t</span>   <span class="n">p_type</span><span class="p">;</span>
        <span class="kt">uint32_t</span>   <span class="n">p_flags</span><span class="p">;</span>
        <span class="n">Elf64_Off</span>  <span class="n">p_offset</span><span class="p">;</span>
        <span class="n">Elf64_Addr</span> <span class="n">p_vaddr</span><span class="p">;</span>
        <span class="n">Elf64_Addr</span> <span class="n">p_paddr</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_filesz</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_memsz</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_align</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">Elf64_Phdr</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_type</code> field indicates the segment type. This segment will hold
the program and will be loaded into memory, so we want <code class="language-plaintext highlighter-rouge">PT_LOAD</code> (1).
Other kinds of segments set up dynamic loading and such.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_flags</code> field gives the memory protections. We want executable
(<code class="language-plaintext highlighter-rouge">PF_X</code> = 1) and readable (<code class="language-plaintext highlighter-rouge">PF_R</code> = 4). These are ORed together to
make 5.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x05\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_offset</code> is the file offset for the content of this segment.
This will be the program we assembled. It will immediately follow the
this header. The ELF header was 64 bytes, plus a 56 byte program
header, which is 120 (0x78).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_vaddr</code> is the virtual address where this segment will be
loaded. This is the entry point from before. A restriction is that
this value must be congruent with <code class="language-plaintext highlighter-rouge">p_offset</code> modulo the page size.
That’s why the entry point was offset by 120 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_paddr</code> is unused for this platform.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_filesz</code> is the size of the segment in the file: 9 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_memsz</code> is the size of the segment in memory, also 9 bytes. It
might sound redundant, but these are allowed to differ, in which case
it’s either truncated or padded with zeroes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_align</code> indicates the segment’s alignment. We don’t care about
alignment.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<h4 id="append-the-program">Append the program</h4>

<p>Finally, append the program we assembled at the beginning.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x31\xFF\xB8\x3C\x00\x00\x00\x0F\x05' &gt;&gt; true
</code></pre></div></div>

<p>Set it executable (hopefully <code class="language-plaintext highlighter-rouge">chmod</code> survived!):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x true
</code></pre></div></div>

<p>And test it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./true &amp;&amp; echo 'Success'
</code></pre></div></div>

<p>Here’s the whole thing as a shell script:</p>

<ul>
  <li><a href="/download/make-true.sh">make-true.sh</a></li>
</ul>

<p>Is the C compiler done bootstrapping yet?</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>An Array of Pointers vs. a Multidimensional Array</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/27/"/>
    <id>urn:uuid:d1302ff9-f958-3486-134d-01c8ab84aa51</id>
    <updated>2016-10-27T21:01:33Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In a C program, suppose I have a table of color names of similar
length. There are two straightforward ways to construct this table.
The most common would be an array of <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The other is a two-dimensional <code class="language-plaintext highlighter-rouge">char</code> array.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The initializers are identical, and the syntax by which these tables
are used is the same, but the underlying data structures are very
different. For example, suppose I had a lookup() function that
searches the table for a particular color.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lookup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">color</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">ncolors</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ncolors</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">color</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is <code class="language-plaintext highlighter-rouge">char
colors[][7]</code> or <code class="language-plaintext highlighter-rouge">char *colors[]</code>. It’s a little bit misleading because
the compiler generates different code depending on the type.</p>

<h3 id="memory-layout">Memory Layout</h3>

<p>Here’s what <code class="language-plaintext highlighter-rouge">colors_ptr</code>, a <em>jagged array</em>, typically looks like in
memory.</p>

<p><img src="/img/colortab/pointertab.png" alt="" /></p>

<p>The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression <code class="language-plaintext highlighter-rouge">colors_ptr[n]</code> is <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<p>On x86-64, suppose the base of the table is in <code class="language-plaintext highlighter-rouge">rax</code>, the index of the
string I want to retrieve is <code class="language-plaintext highlighter-rouge">rcx</code>, and I want to put the string’s
address back into <code class="language-plaintext highlighter-rouge">rax</code>. It’s one load instruction.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Contrast this with <code class="language-plaintext highlighter-rouge">colors_2d</code>: six 7-byte elements in a row. No
pointers or addresses. Only strings.</p>

<p><img src="/img/colortab/arraytab.png" alt="" /></p>

<p>The strings are in their defined order, packed together. The type of
the expression <code class="language-plaintext highlighter-rouge">colors_2d[n]</code> is <code class="language-plaintext highlighter-rouge">char [7]</code>, an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.</p>

<p>In the same scenario before with x86-64, it takes two instructions to
put the string’s address in <code class="language-plaintext highlighter-rouge">rax</code>, but neither is a load.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">imul</span>  <span class="nb">rcx</span><span class="p">,</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">7</span>
<span class="nf">add</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rcx</span>
</code></pre></div></div>

<p>In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. <code class="language-plaintext highlighter-rouge">char colors_2d[][8]</code>). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">lea</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="relocation">Relocation</h3>

<p>There’s another factor to consider: relocation. Nearly every process
running on a modern system takes advantage of a security feature
called Address Space Layout Randomization (ASLR). The virtual address
of code and data is randomized at process load time. For shared
libraries, it’s not just a security feature, it’s essential to their
basic operation. Libraries cannot possibly coordinate their preferred
load addresses with every other library on the system, and so must be
relocatable.</p>

<p>If the program is compiled with GCC or Clang configured for position
independent code — <code class="language-plaintext highlighter-rouge">-fPIC</code> (for libraries) or <code class="language-plaintext highlighter-rouge">-fpie</code> + <code class="language-plaintext highlighter-rouge">-pie</code> (for
programs) — extra work has to be done to support <code class="language-plaintext highlighter-rouge">colors_ptr</code>. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.</p>

<p>However, <code class="language-plaintext highlighter-rouge">colors_2d</code> doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.</p>

<p>On x86-64, in both cases the table itself typically doesn’t need a
relocation entry because it will be <em>RIP-relative</em> (in the <a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">small code
model</a>). That is, code that uses the table will be at a fixed
offset from the table no matter where the program is loaded. It won’t
need to be looked up using the Global Offset Table (GOT).</p>

<p>In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in <code class="language-plaintext highlighter-rouge">rax</code> looks
like so:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; NASM:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">address</span><span class="p">]</span>
<span class="c1">;; Some others:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rip</span> <span class="o">+</span> <span class="nv">address</span><span class="p">]</span>
</code></pre></div></div>

<p>Or in AT&amp;T syntax:</p>

<pre><code class="language-gas">lea    address(%rip), %rax
</code></pre>

<h3 id="virtual-memory">Virtual Memory</h3>

<p>Besides (trivially) more work for the loader, there’s another
consequence to relocations: Pages containing relocations are not
shared between processes (except after fork()). When loading a
program, the loader doesn’t copy programs and libraries to memory so
much as it memory maps their binaries with copy-on-write semantics. If
another process is running with the same binaries loaded (e.g.
libc.so), they’ll share the same physical memory so long as those
pages haven’t been modified by either process. Modifying the page
creates a unique copy for that process.</p>

<p>Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means <code class="language-plaintext highlighter-rouge">colors_2d</code> has the possibility of being shared
between processes, but <code class="language-plaintext highlighter-rouge">colors_ptr</code> (and its entire page) definitely
does not. Shucks.</p>

<p>This is one of the reasons why the Procedure Linkage Table (PLT)
exists. The PLT is an array of function stubs for shared library
functions, such as those in the C standard library. Sure, the loader
<em>could</em> go through the program and fill out the address of every
library function call, but this would modify lots and lots of code
pages, creating a unique copy of large parts of the program. Instead,
the dynamic linker <a href="https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html">lazily supplies jump addresses</a> for PLT
function stubs, one per accessed library function.</p>

<p>However, as I’ve written it above, it’s unlikely that even <code class="language-plaintext highlighter-rouge">colors_2d</code>
will be shared. It’s still missing an important ingredient: const.</p>

<h3 id="const">Const</h3>

<p>They say <a href="/blog/2016/07/25/">const isn’t for optimization</a> but, darnit, this
situation keeps coming up. Since <code class="language-plaintext highlighter-rouge">colors_ptr</code> and <code class="language-plaintext highlighter-rouge">colors_2d</code> are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
<code class="language-plaintext highlighter-rouge">colors_2d</code> to being a local copy.</p>

<p>Fortunately it’s trivial to fix by adding a const:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (<code class="language-plaintext highlighter-rouge">.rodata</code>) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.</p>

<p>Note, the equivalent for <code class="language-plaintext highlighter-rouge">colors_ptr</code> requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="k">const</span> <span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>String literals are already effectively const, though the C
specification (unlike C++) doesn’t actually define them to be this
way. But, like setting your relationship status on Facebook, declaring
it makes it official.</p>

<h3 id="its-just-micro-optimization">It’s just micro-optimization</h3>

<p>These little details are all deep down the path of micro-optimization
and will rarely ever matter in practice, but perhaps you learned
something broader from all this. This stuff fascinates me.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Linux System Calls, Error Numbers, and In-Band Signaling</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/23/"/>
    <id>urn:uuid:ee8b3af5-ce09-3f9a-ef9c-0d95807bf95e</id>
    <updated>2016-09-23T01:07:40Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Today I got an e-mail asking about a previous article on <a href="/blog/2015/05/15/">creating
threads on Linux using raw system calls</a> (specifically x86-64).
The questioner was looking to use threads in a program without any
libc dependency. However, he was concerned about checking for mmap(2)
errors when allocating the thread’s stack. The <a href="http://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2) man
page</a> says it returns -1 (a.k.a. <code class="language-plaintext highlighter-rouge">MAP_FAILED</code>) on error and sets
errno. But how do you check errno without libc?</p>

<p>As a reminder here’s what the (unoptimized) assembly looks like.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>As usual, the system call return value is in <code class="language-plaintext highlighter-rouge">rax</code>, which becomes the
return value for <code class="language-plaintext highlighter-rouge">stack_create()</code>. Again, its C prototype would look
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>If you were to, say, intentionally botch the arguments to force an
error, you might notice that the system call isn’t returning -1, but
other negative values. What gives?</p>

<p>The trick is that <strong>errno is a C concept</strong>. That’s why it’s documented
as <a href="http://man7.org/linux/man-pages/man3/errno.3.html">errno(3)</a> — the 3 means it belongs to C. Just think about
how messy this thing is: it’s a thread-local value living in the
application’s address space. The kernel rightfully has nothing to do
with it. Instead, the mmap(2) wrapper in libc assigns errno (if
needed) after the system call returns. This is how <a href="http://man7.org/linux/man-pages/man2/intro.2.html"><em>all</em> system calls
through libc work</a>, even with the <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">syscall(2)
wrapper</a>.</p>

<p>So how does the kernel report the error? It’s an old-fashioned return
value. If you have any doubts, take it straight from the horse’s
mouth: <a href="http://lxr.free-electrons.com/source/mm/mmap.c?v=4.6#L1143">mm/mmap.c:do_mmap()</a>. Here’s a sample of return
statements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

<span class="cm">/* Careful about overflows.. */</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">PAGE_ALIGN</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

<span class="cm">/* offset overflow? */</span>
<span class="k">if</span> <span class="p">((</span><span class="n">pgoff</span> <span class="o">+</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">))</span> <span class="o">&lt;</span> <span class="n">pgoff</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EOVERFLOW</span><span class="p">;</span>

<span class="cm">/* Too many mappings? */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mm</span><span class="o">-&gt;</span><span class="n">map_count</span> <span class="o">&gt;</span> <span class="n">sysctl_max_map_count</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s returning the negated error number. Simple enough.</p>

<p>If you think about it a moment, you might notice a complication: This
is a form of in-band signaling. On success, mmap(2) returns a memory
address. All those negative error numbers are potentially addresses
that a caller might want to map. How can we tell the difference?</p>

<p>1) None of the possible error numbers align on a page boundary, so
   they’re not actually valid return values. NULL <em>does</em> lie on a page
   boundary, which is one reason why it’s not used as an error return
   value for mmap(2). The other is that you might actually want to map
   NULL, for better <a href="https://blogs.oracle.com/ksplice/entry/much_ado_about_null_exploiting1">or worse</a>.</p>

<p>2) Those low negative values lie in a region of virtual memory
   reserved exclusively for the kernel (sometimes called “<a href="https://linux-mm.org/HighMemory">low
   memory</a>”). On x86-64, any address with the most significant
   bit set (i.e. the sign bit of a signed integer) is one of these
   addresses. Processes aren’t allowed to map these addresses, and so
   mmap(2) will never return such a value on success.</p>

<p>So what’s a clean, safe way to go about checking for error values?
It’s a lot easier to read <a href="https://www.musl-libc.org/">musl</a> than glibc, so let’s take a
peek at how musl does it in its own mmap: <a href="https://git.musl-libc.org/cgit/musl/tree/src/mman/mmap.c?h=v1.1.15">src/mman/mmap.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">&amp;</span> <span class="n">OFF_MASK</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">EINVAL</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;=</span> <span class="n">PTRDIFF_MAX</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_FIXED</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">__vm_wait</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">syscall</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="n">start</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
</code></pre></div></div>

<p>Hmm, it looks like its returning the result directly. What happened
to setting errno? Well, syscall() is actually a macro that runs the
result through __syscall_ret().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))
</span></code></pre></div></div>

<p>Looking a little deeper: <a href="https://git.musl-libc.org/cgit/musl/tree/src/internal/syscall_ret.c?h=v1.1.15">src/internal/syscall_ret.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">__syscall_ret</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">errno</span> <span class="o">=</span> <span class="o">-</span><span class="n">r</span><span class="p">;</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Bingo. As documented, if the value falls within that “high” (unsigned)
range of negative values for <em>any</em> system call, it’s an error number.</p>

<p>Getting back to the original question, we could employ this same check
in the assembly code. However, since this is a anonymous memory map
with a kernel-selected address, <strong>there’s only one possible error:
ENOMEM</strong> (12). This error happens if the maximum number of memory maps
has been reached, or if there’s no contiguous region available for the
4MB stack. The check will only need to test the result against -12.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Const and Optimization in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/07/25/"/>
    <id>urn:uuid:f785bc3b-dd3d-3952-2696-91eafe6b019d</id>
    <updated>2016-07-25T02:06:04Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Today there was a <a href="https://redd.it/4udqwj">question on /r/C_Programming</a> about the
effect of C’s <code class="language-plaintext highlighter-rouge">const</code> on optimization. Variations of this question
have been asked many times over the past two decades. Personally, I
blame naming of <code class="language-plaintext highlighter-rouge">const</code>.</p>

<p>Given this program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">);</span>
        <span class="n">y</span> <span class="o">+=</span> <span class="n">x</span><span class="p">;</span>  <span class="c1">// this load not optimized out</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">foo</code> takes a pointer to const, which is a promise from
the author of <code class="language-plaintext highlighter-rouge">foo</code> that it won’t modify the value of <code class="language-plaintext highlighter-rouge">x</code>. Given this
information, it would seem the compiler may assume <code class="language-plaintext highlighter-rouge">x</code> is always zero,
and therefore <code class="language-plaintext highlighter-rouge">y</code> is always zero.</p>

<p>However, inspecting the assembly output of several different compilers
shows that <code class="language-plaintext highlighter-rouge">x</code> is loaded each time around the loop. Here’s gcc 4.9.2
at -O3, with annotations, for x86-64,</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbp</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">xor</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; y = 0</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>              <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>    <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>        <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">add</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>  <span class="c1">; y += x  (not optmized?)</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; deallocate x</span>
     <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; return y</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">pop</span>    <span class="nb">rbp</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The output of clang 3.5 (with -fno-unroll-loops) is the same, except
ebp and ebx are swapped, and the computation of <code class="language-plaintext highlighter-rouge">&amp;x</code> is hoisted out of
the loop, into <code class="language-plaintext highlighter-rouge">r14</code>.</p>

<p>Are both compilers failing to take advantage of this useful
information? Wouldn’t it be undefined behavior for <code class="language-plaintext highlighter-rouge">foo</code> to modify
<code class="language-plaintext highlighter-rouge">x</code>? Surprisingly, the answer is no. <em>In this situation</em>, this would
be a perfectly legal definition of <code class="language-plaintext highlighter-rouge">foo</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">readonly_x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">readonly_x</span><span class="p">;</span>  <span class="c1">// cast away const</span>
    <span class="p">(</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key thing to remember is that <a href="http://yarchive.net/comp/const.html"><strong><code class="language-plaintext highlighter-rouge">const</code> doesn’t mean
constant</strong></a>. Chalk it up as a misnomer. It’s not an
optimization tool. It’s there to inform programmers — not the compiler
— as a tool to catch a certain class of mistakes at compile time. I
like it in APIs because it communicates how a function will use
certain arguments, or how the caller is expected to handle returned
pointers. It’s usually not strong enough for the compiler to change
its behavior.</p>

<p>Despite what I just said, occasionally the compiler <em>can</em> take
advantage of <code class="language-plaintext highlighter-rouge">const</code> for optimization. The C99 specification, in
§6.7.3¶5, has one sentence just for this:</p>

<blockquote>
  <p>If an attempt is made to modify an object defined with a
const-qualified type through use of an lvalue with
non-const-qualified type, the behavior is undefined.</p>
</blockquote>

<p>The original <code class="language-plaintext highlighter-rouge">x</code> wasn’t const-qualified, so this rule didn’t apply.
And there aren’t any rules against casting away <code class="language-plaintext highlighter-rouge">const</code> to modify an
object that isn’t itself <code class="language-plaintext highlighter-rouge">const</code>. This means the above (mis)behavior
of <code class="language-plaintext highlighter-rouge">foo</code> isn’t undefined behavior <em>for this call</em>. Notice how the
undefined-ness of <code class="language-plaintext highlighter-rouge">foo</code> depends on how it was called.</p>

<p>With one tiny tweak to <code class="language-plaintext highlighter-rouge">bar</code>, I can make this rule apply, allowing the
optimizer do some work on it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">const</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>The compiler may now assume that <strong><code class="language-plaintext highlighter-rouge">foo</code> modifying <code class="language-plaintext highlighter-rouge">x</code> is undefined
behavior, therefore <em>it never happens</em></strong>. For better or worse, this is
a major part of how a C optimizer reasons about your programs. The
compiler is free to assume <code class="language-plaintext highlighter-rouge">x</code> never changes, allowing it to optimize
out both the per-iteration load and <code class="language-plaintext highlighter-rouge">y</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>            <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>  <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>      <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; deallocate x</span>
     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>            <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The load disappears, <code class="language-plaintext highlighter-rouge">y</code> is gone, and the function always returns
zero.</p>

<p>Curiously, the specification <em>almost</em> allows the compiler to go
further. Consider what would happen if <code class="language-plaintext highlighter-rouge">x</code> were allocated somewhere
off the stack in read-only memory. That transformation would look like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">__x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">__x</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We would see a few more instructions shaved off (<a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">-fPIC, small code
model</a>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">section</span> <span class="nv">.rodata</span>
<span class="nl">x:</span>   <span class="kd">dd</span>     <span class="mi">0</span>

<span class="nf">section</span> <span class="nv">.text</span>
<span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>        <span class="c1">; loop variable i</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">x</span><span class="p">]</span>    <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>        <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>Because the address of <code class="language-plaintext highlighter-rouge">x</code> is taken and “leaked,” this last transform
is not permitted. If <code class="language-plaintext highlighter-rouge">bar</code> is called recursively such that a second
address is taken for <code class="language-plaintext highlighter-rouge">x</code>, that second pointer would compare equally
(<code class="language-plaintext highlighter-rouge">==</code>) with the first pointer depsite being semantically distinct
objects, which is forbidden (§6.5.9¶6).</p>

<p>Even with this special <code class="language-plaintext highlighter-rouge">const</code> rule, stick to using <code class="language-plaintext highlighter-rouge">const</code> for
yourself and for your fellow human programmers. Let the optimizer
reason for itself about what is constant and what is not.</p>

<p>Travis Downs nicely summed up this article in the comments:</p>

<blockquote>
  <p>In general, <code class="language-plaintext highlighter-rouge">const</code> <em>declarations</em> can’t help the optimizer, but
<code class="language-plaintext highlighter-rouge">const</code> <em>definitions</em> can.</p>
</blockquote>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Hotpatching a C Function on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/03/31/"/>
    <id>urn:uuid:49f6ea3c-d44a-3bed-1aad-70ad47e325c6</id>
    <updated>2016-03-31T23:59:59Z</updated>
    <category term="x86"/><category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In this post I’m going to do a silly, but interesting, exercise that
should never be done in any program that actually matters. I’m going
write a program that changes one of its function definitions while
it’s actively running and using that function. Unlike <a href="/blog/2014/12/23/">last
time</a>, this won’t involve shared libraries, but it will require
x86_64 and GCC. Most of the time it will work with Clang, too, but
it’s missing an important compiler option that makes it stable.</p>

<p>If you want to see it all up front, here’s the full source:
<a href="/download/hotpatch.c">hotpatch.c</a></p>

<p>Here’s the function that I’m going to change:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s dead simple, but that’s just for demonstration purposes. This
will work with any function of arbitrary complexity. The definition
will be changed to this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"goodbye %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">x</span><span class="o">++</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I was only going change the string, but I figured I should make it a
little more interesting.</p>

<p>Here’s how it’s going to work: I’m going to overwrite the beginning of
the function with an unconditional jump that immediately moves control
to the new definition of the function. It’s vital that the function
prototype does not change, since that would be a <em>far</em> more complex
problem.</p>

<p><strong>But first there’s some preparation to be done.</strong> The target needs to
be augmented with some GCC function attributes to prepare it for its
redefinition. As is, there are three possible problems that need to be
dealt with:</p>

<ul>
  <li>I want to hotpatch this function <em>while it is being used</em> by another
thread <em>without</em> any synchronization. It may even be executing the
function at the same time I clobber its first instructions with my
jump. If it’s in between these instructions, disaster will strike.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">ms_hook_prologue</code> function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.</p>

<ul>
  <li>The prologue NOP needs to be updated atomically. I can’t let the
other thread see a half-written instruction or, again, disaster. On
x86 this means I have an alignment requirement. Since I’m
overwriting an 8-byte instruction, I’m specifically going to need
8-byte alignment to get an atomic write.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">aligned</code> function attribute, ensuring the
hotpatch prologue is properly aligned.</p>

<ul>
  <li>The final problem is that there must be exactly one copy of this
function in the compiled program. It must never be inlined or
cloned, since these won’t be hotpatched.</li>
</ul>

<p>As you might have guessed, this is primarily fixed with the <code class="language-plaintext highlighter-rouge">noinline</code>
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the <code class="language-plaintext highlighter-rouge">noclone</code> attribute.</p>

<p>Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (<code class="language-plaintext highlighter-rouge">__asm("")</code>). Since <code class="language-plaintext highlighter-rouge">puts()</code> has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.</p>

<p>What does the function look like now?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span> <span class="p">((</span><span class="n">ms_hook_prologue</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noclone</span><span class="p">))</span>
<span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span><span class="s">""</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And what does the assembly look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -Mintel -d hotpatch
0000000000400848 &lt;hello&gt;:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 &lt;puts@plt&gt;
</code></pre></div></div>

<p>It’s 8-byte aligned and it has the 8-byte NOP: that <code class="language-plaintext highlighter-rouge">lea</code> instruction
does nothing. It copies <code class="language-plaintext highlighter-rouge">rsp</code> into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.</p>

<h3 id="hotpatching">Hotpatching</h3>

<p>Next, let’s take a look at the function that will perform the
hotpatch. I’ve written a generic patching function for this purpose.
This part is entirely specific to x86.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hotpatch</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">replacement</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 8-byte aligned?</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">page</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="o">~</span><span class="mh">0xfff</span><span class="p">);</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_WRITE</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">replacement</span> <span class="o">-</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="kt">uint8_t</span> <span class="n">bytes</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
        <span class="kt">uint64_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">instruction</span> <span class="o">=</span> <span class="p">{</span> <span class="p">{</span><span class="mh">0xe9</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">}</span> <span class="p">};</span>
    <span class="o">*</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">=</span> <span class="n">instruction</span><span class="p">.</span><span class="n">value</span><span class="p">;</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It takes the address of the function to be patched and the address of
the function to replace it. As mentioned, the target <em>must</em> be 8-byte
aligned (enforced by the assert). It’s also important this function is
only called by one thread at a time, even on different targets. If
that was a concern, I’d wrap it in a mutex to create a critical
section.</p>

<p>There are a number of things going on here, so let’s go through them
one at a time:</p>

<h4 id="make-the-function-writeable">Make the function writeable</h4>

<p>The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
<code class="language-plaintext highlighter-rouge">mprotect()</code>. If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
<code class="language-plaintext highlighter-rouge">hotpatch()</code>) I’d be in trouble.</p>

<p>It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hugepages). <em>Warning</em>: I’m being a
bad programmer and not checking the result of <code class="language-plaintext highlighter-rouge">mprotect()</code>. If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard <a href="https://marc.info/?t=145942649500004">in the
future</a>. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.</p>

<p>What if the function straddles pages? Well, I’m only patching the
first 8 bytes, which, thanks to alignment, will sit entirely inside
the page I just found. It’s not an issue.</p>

<p>At the end of the function, I <code class="language-plaintext highlighter-rouge">mprotect()</code> the page back to
non-writeable.</p>

<h4 id="create-the-instruction">Create the instruction</h4>

<p>I’m assuming the replacement function is within 2GB of the original in
virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s
no 64-bit relative jump, and I only have 8 bytes to work within
anyway. Looking that up in <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">the Intel manual</a>, I see this:</p>

<p><img src="/img/misc/jmp-e9.png" alt="" /></p>

<p>Fortunately it’s a really simple instruction. It’s opcode 0xE9 and
it’s followed immediately by the 32-bit displacement. The instruction
is 5 bytes wide.</p>

<p>To compute the relative jump, I take the difference between the
functions, minus 5. Why the 5? The jump address is computed from the
position <em>after</em> the jump instruction and, as I said, it’s 5 bytes
wide.</p>

<p>I put 0xE9 in a byte array, followed by the little endian
displacement. The astute may notice that the displacement is signed
(it can go “up” or “down”) and I used an unsigned integer. That’s
because it will overflow nicely to the right value and make those
shifts clean.</p>

<p>Finally, the instruction byte array I just computed is written over
the hotpatch NOP as a single, atomic, 64-bit store.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    *(uint64_t *)target = instruction.value;
</code></pre></div></div>

<p>Other threads will see either the NOP or the jump, nothing in between.
There’s no synchronization, so other threads may continue to execute
the NOP for a brief moment even through I’ve clobbered it, but that’s
fine.</p>

<h3 id="trying-it-out">Trying it out</h3>

<p>Here’s what my test program looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">worker</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">arg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">hello</span><span class="p">();</span>
        <span class="n">usleep</span><span class="p">(</span><span class="mi">100000</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">worker</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="n">getchar</span><span class="p">();</span>
    <span class="n">hotpatch</span><span class="p">(</span><span class="n">hello</span><span class="p">,</span> <span class="n">new_hello</span><span class="p">);</span>
    <span class="n">pthread_join</span><span class="p">(</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I fire off the other thread to keep it pinging at <code class="language-plaintext highlighter-rouge">hello()</code>. In the
main thread, it waits until I hit enter to give the program input,
after which it calls <code class="language-plaintext highlighter-rouge">hotpatch()</code> and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.</p>

<h3 id="further-reading">Further Reading</h3>

<p>These related articles have been shared with me since publishing this
article:</p>

<ul>
  <li><a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583">Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?</a></li>
  <li><a href="http://jbremer.org/x86-api-hooking-demystified/">x86 API Hooking Demystified</a></li>
  <li><a href="http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-living-on-the-edge-rapid-toggling-probes-with-cross-modification-on-x86">Living on the edge: Rapid-toggling probes with cross modification on x86</a></li>
  <li><a href="https://lwn.net/Articles/620640/">arm64: alternatives runtime patching</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Calling the Native API While Freestanding</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/02/28/"/>
    <id>urn:uuid:3649a761-d3dc-391b-7f24-a28398100102</id>
    <updated>2016-02-28T23:47:22Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>When developing <a href="/blog/2016/01/31/">minimal, freestanding Windows programs</a>, it’s
obviously beneficial to take full advantage of dynamic libraries that
are already linked rather than duplicate that functionality in the
application itself. Every Windows process automatically, and
involuntarily, has kernel32.dll and ntdll.dll loaded into its process
space before it starts. As discussed previously, kernel32.dll provides
the Windows API (Win32). The other, ntdll.dll, provides the <em>Native
API</em> for user space applications, and is the focus of this article.</p>

<p>The Native API is a low-level API, a foundation for the implementation
of the Windows API and various components that don’t use the Windows
API (drivers, etc.). It includes a runtime library (RTL) suitable for
replacing important parts of the C standard library, unavailable to
freestanding programs. Very useful for a minimal program.</p>

<p>Unfortunately, <em>using</em> the Native API is a bit of a minefield. Not all
of the documented Native API functions are actually exported by
ntdll.dll, making them inaccessible both for linking and
GetProcAddress(). Some are exported, but not documented as such.
Others are documented as exported but are not documented <em>when</em> (which
release of Windows). If a particular function wasn’t exported until
Windows 8, I don’t want to use when supporting Windows 7.</p>

<p>This is further complicated by the Microsoft Windows SDK, where many
of these functions are just macros that alias C runtime functions.
Naturally, MinGW closely follows suit. For example, in both cases,
here is how the Native API function <code class="language-plaintext highlighter-rouge">RtlCopyMemory</code> is “declared.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))
</span></code></pre></div></div>

<p>This is certainly not useful for freestanding programs, though it has
a significant benefit for <em>hosted</em> programs: The C compiler knows the
semantics of <code class="language-plaintext highlighter-rouge">memcpy()</code> and can properly optimize around it. Any C
compiler worth its salt will replace a small or aligned, fixed-sized
<code class="language-plaintext highlighter-rouge">memcpy()</code> or <code class="language-plaintext highlighter-rouge">memmove()</code> with the equivalent inlined code. For
example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buffer0</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="kt">char</span> <span class="n">buffer1</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="c1">// ...</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">buffer0</span><span class="p">,</span> <span class="n">buffer1</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="c1">// ...</span>
</code></pre></div></div>

<p>On x86_64 (GCC 4.9.3, -Os), this <code class="language-plaintext highlighter-rouge">memmove()</code> call is replaced with
two instructions. This isn’t possible when calling an opaque function
in a non-standard dynamic library. The side effects could be anything.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movaps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">48</span><span class="p">]</span>
    <span class="nf">movaps</span>  <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">32</span><span class="p">],</span> <span class="nv">xmm0</span>
</code></pre></div></div>

<p>These Native API macro aliases are what have allowed certain Wine
issues <a href="https://bugs.winehq.org/show_bug.cgi?id=38783">to slip by unnoticed for years</a>. Very few user space
applications actually call Native API functions, even when addressed
directly by name in the source. The development suite is pulling a
bait and switch.</p>

<p>Like <a href="/blog/2014/12/09/">last time I danced at the edge of the compiler</a>, this has
caused headaches in my recent experimentation with freestanding
executables. The MinGW headers assume that the programs including them
will link against a C runtime. Dirty hack warning: To work around it,
I have to undo the definition in the MinGW headers and make my own.
For example, to use the real <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="cp">#undef RtlMoveMemory
</span><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">void</span> <span class="nf">RtlMoveMemory</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>Anywhere where I might have previously used <code class="language-plaintext highlighter-rouge">memmove()</code> I can instead
use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>. Or I could trivially supply my own wrapper:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">memmove</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlMoveMemory</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">d</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the same approach is not reliable with
<code class="language-plaintext highlighter-rouge">RtlCopyMemory()</code>, the cousin to <code class="language-plaintext highlighter-rouge">memcpy()</code>. As far as I can tell, it
was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June
2015). Use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code> instead. The overlap-handling overhead is
negligible compared to the function call overhead anyway.</p>

<p>As a side note: one reason besides minimalism for not implementing
your own <code class="language-plaintext highlighter-rouge">memmove()</code> is that it can’t be implemented efficiently in a
conforming C program. According to the language specification, your
implementation of <code class="language-plaintext highlighter-rouge">memmove()</code> would not be permitted to compare its
pointer arguments with <code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;=</code>, or <code class="language-plaintext highlighter-rouge">&gt;=</code>. That would lead to
undefined behavior when pointing to unrelated objects (ISO/IEC
9899:2011 §6.5.8¶5). The simplest legal approach is to allocate a
temporary buffer, copy the source buffer into it, then copy it into
the destination buffer. However, buffer allocation may fail — i.e.
NULL return from <code class="language-plaintext highlighter-rouge">malloc()</code> — introducing a failure case to
<code class="language-plaintext highlighter-rouge">memmove()</code>, which isn’t supposed to fail.</p>

<p>Update July 2016: Alex Elsayed pointed out a solution to the
<code class="language-plaintext highlighter-rouge">memmove()</code> problem in the comments. In short: iterate over the
buffers bytewise (<code class="language-plaintext highlighter-rouge">char *</code>) using equality (<code class="language-plaintext highlighter-rouge">==</code>) tests to check for
an overlap. In theory, a compiler could optimize away the loop and
make it efficient.</p>

<p>I keep mentioning Wine because I’ve been careful to ensure my
applications run correctly with it. So far it’s worked <em>perfectly</em>
with both Windows API and Native API functions. Thanks to the hard
work behind the Wine project, despite being written sharply against
the Windows API, these tiny programs remain relatively portable (x86
and ARM). It’s a good fit for graphical applications (games), but I
would <em>never</em> write a command line application like this. The command
line has always been a second class citizen on Windows.</p>

<p>Now that I’ve got these Native API issues sorted out, I’ve
significantly expanded the capabilities of my tiny, freestanding
programs without adding anything to their size. Functions like
<code class="language-plaintext highlighter-rouge">RtlUnicodeToUTF8N()</code> and <code class="language-plaintext highlighter-rouge">RtlUTF8ToUnicodeN()</code> will surely be handy.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Small, Freestanding Windows Executables</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/01/31/"/>
    <id>urn:uuid:8eddc701-52d3-3b0c-a8a8-dd13da6ead2c</id>
    <updated>2016-01-31T22:53:03Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>Update</strong>: This is old and <a href="/blog/2023/02/15/">was <strong>updated in 2023</strong></a>!</p>

<p>Recently I’ve been experimenting with freestanding C programs on
Windows. <em>Freestanding</em> refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and <a href="/blog/2014/12/09/">similar, bare metal
situations</a>. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size <code class="language-plaintext highlighter-rouge">memmove()</code> with move instructions. Since a freestanding
program would supply its own, it may have different semantics.</p>

<p>My usual go to for C/C++ on Windows is <a href="http://mingw-w64.org/">Mingw-w64</a>, which has
greatly suited my needs the past couple of years. It’s <a href="https://packages.debian.org/search?keywords=mingw-w64">packaged on
Debian</a>, and, when combined with Wine, allows me to fully develop
Windows applications on Linux. Being GCC, it’s also great for
cross-platform development since it’s essentially the same compiler as
the other platforms. The primary difference is the interface to the
operating system (POSIX vs. Win32).</p>

<p>However, it has one glaring flaw inherited from MinGW: it links
against msvcrt.dll, an ancient version of the Microsoft C runtime
library that currently ships with Windows. Besides being dated and
quirky, <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">it’s not an official part of Windows</a> and never has
been, despite its inclusion with every release since Windows 95.
Mingw-w64 doesn’t have a C library of its own, instead patching over
some of the flaws of msvcrt.dll and linking against it.</p>

<p>Since so much depends on msvcrt.dll despite its unofficial nature,
it’s unlikely Microsoft will ever drop it from future releases of
Windows. However, if strict correctness is a concern, we must ask
Mingw-w64 not to link against it. An alternative would be
<a href="http://plibc.sourceforge.net/">PlibC</a>, though the LGPL licensing is unfortunate. Another is
Cygwin, which is a very complete POSIX environment, but is heavy and
GPL-encumbered.</p>

<p>Sometimes I’d prefer to be more direct: <a href="https://hero.handmade.network/forums/code-discussion/t/94-guide_-_how_to_avoid_c_c++_runtime_on_windows">skip the C standard library
altogether</a> and talk directly to the operating system. On Windows
that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only
links against system DLLs.</p>

<h3 id="linux-vs-windows">Linux vs. Windows</h3>

<p>The most important benefit of a standard library like libc is a
portable, uniform interface to the host system. So long as the
standard library suits its needs, the same program can run anywhere.
Without it, the programs needs an implementation of each
host-specific interface.</p>

<p>On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (<code class="language-plaintext highlighter-rouge">int 0x80</code> on x86, <code class="language-plaintext highlighter-rouge">syscall</code> on
x86-64, <code class="language-plaintext highlighter-rouge">swi</code> on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.</p>

<p>For example, here’s a function for a 1-argument system call on x86-64.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">syscall1</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">result</span><span class="p">;</span>
    <span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">exit()</code> is implemented on top. Note: A <em>real</em> libc would do
cleanup before exiting, like calling registered <code class="language-plaintext highlighter-rouge">atexit()</code> functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;syscall.h&gt;</span><span class="c1">  // defines SYS_exit</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">code</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">code</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to <code class="language-plaintext highlighter-rouge">malloc()</code>). It’s not POSIX, but it has analogs to much of
the same functionality.</p>

<h3 id="program-entry">Program Entry</h3>

<p>The standard entry for a C program is <code class="language-plaintext highlighter-rouge">main()</code>. However, this is not
the application’s <em>true</em> entry. The entry is in the C library, which
does some initialization before calling your <code class="language-plaintext highlighter-rouge">main()</code>. When <code class="language-plaintext highlighter-rouge">main()</code>
returns, it performs cleanup and exits. Without a C library, programs
don’t start at <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<p>On Linux the default entry is the symbol <code class="language-plaintext highlighter-rouge">_start</code>. It’s prototype
would look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Returning from this function leads to a segmentation fault, so it’s up
to your application to perform the exit system call rather than
return.</p>

<p>On Windows, the entry depends on the type of application. The two
relevant subsystems today are the <em>console</em> and <em>windows</em> subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give <code class="language-plaintext highlighter-rouge">-mconsole</code> (default) or <code class="language-plaintext highlighter-rouge">-mwindows</code> to the linker to
choose the subsystem.</p>

<p>The default <a href="https://msdn.microsoft.com/en-us/library/f9t8842e.aspx">entry for each is slightly different</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike Linux’s <code class="language-plaintext highlighter-rouge">_start</code>, Windows programs can safely return from these
functions, similar to <code class="language-plaintext highlighter-rouge">main()</code>, hence the <code class="language-plaintext highlighter-rouge">int</code> return. The <code class="language-plaintext highlighter-rouge">WINAPI</code>
macro means the function may have a special calling convention,
depending on the platform.</p>

<p>On any system, you can choose a different entry symbol or address
using the <code class="language-plaintext highlighter-rouge">--entry</code> option to the GNU linker.</p>

<h3 id="disabling-libgcc">Disabling libgcc</h3>

<p>One problem I’ve run into is Mingw-w64 generating code that calls
<code class="language-plaintext highlighter-rouge">__chkstk_ms()</code> from libgcc. I believe this is a long-standing bug,
since <code class="language-plaintext highlighter-rouge">-ffreestanding</code> should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable <a href="https://metricpanda.com/rival-fortress-update-45-dealing-with-__chkstk-__chkstk_ms-when-cross-compiling-for-windows/">the stack
probe</a> and pre-commit the whole stack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
</code></pre></div></div>

<p>Alternatively you could link against libgcc (statically) with <code class="language-plaintext highlighter-rouge">-lgcc</code>,
but, again, I’m going for a tiny executable.</p>

<h3 id="a-freestanding-example">A freestanding example</h3>

<p>Here’s an example of a Windows “Hello, World” that doesn’t use a C
library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">WINAPI</span>
<span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">[]){</span><span class="mi">0</span><span class="p">},</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32
</code></pre></div></div>

<p>Notice I manually linked against kernel32.dll. The stripped final
result is only 4kB, mostly PE padding. There are <a href="http://www.phreedom.org/research/tinype/">techniques to trim
this down even further</a>, but for a substantial program it
wouldn’t make a significant difference.</p>

<p>From here you could create a GUI by linking against <code class="language-plaintext highlighter-rouge">user32.dll</code> and
<code class="language-plaintext highlighter-rouge">gdi32.dll</code> (both also part of Win32) and calling the appropriate
functions. I already <a href="/blog/2015/06/06/">ported my OpenGL demo</a> to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).</p>

<p>I may go this route for <a href="http://7drl.org/2016/01/13/7drl-2016-announced-for-5-13-march/">the upcoming 7DRL 2016</a> in March.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Quickly Access x86 Documentation in Emacs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/11/21/"/>
    <id>urn:uuid:982279c7-22a7-3b69-016b-749883870385</id>
    <updated>2015-11-21T05:42:17Z</updated>
    <category term="x86"/><category term="emacs"/>
    <content type="html">
      <![CDATA[<p>I recently released an Emacs package called <a href="https://github.com/skeeto/x86-lookup"><strong>x86-lookup</strong></a>.
Given a mnemonic, Emacs will open up a local copy of an <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">Intel’s
software developer manual</a> PDF at the page documenting the
instruction. It complements <a href="/blog/2015/04/19/">nasm-mode</a>, released earlier this
year.</p>

<ul>
  <li><a href="https://github.com/skeeto/x86-lookup">https://github.com/skeeto/x86-lookup</a></li>
</ul>

<p>x86-lookup is also available from <a href="https://melpa.org/">MELPA</a>.</p>

<p>To use it, you’ll need <a href="http://poppler.freedesktop.org/">Poppler’s</a> pdftotext command line
program — used to build an index of the PDF — and a copy of the
complete Volume 2 of Intel’s instruction set manual. There’s only one
command to worry about: <code class="language-plaintext highlighter-rouge">M-x x86-lookup</code>.</p>

<h3 id="minimize-documentation-friction">Minimize documentation friction</h3>

<p>This package should be familiar to anyone who’s used
<a href="https://github.com/skeeto/javadoc-lookup">javadoc-lookup</a>, one of my older packages. It has a common
underlying itch: the context switch to read API documentation while
coding should have as little friction as possible, otherwise I’m
discouraged from doing it. In an ideal world I wouldn’t ever need to
check documentation because it’s already in my head. By visiting
documentation frequently with ease, it’s going to become familiar that
much faster and I’ll be reaching for it less and less, approaching the
ideal.</p>

<p>I picked up x86 assembly [about a year ago][x86] and for the first few
months I struggled to find a good online reference for the instruction
set. There are little scraps here and there, but not much of
substance. The big exception is <a href="http://www.felixcloutier.com/x86/">Félix Cloutier’s reference</a>,
which is an amazingly well-done HTML conversion of Intel’s PDF
manuals. Unfortunately I could never get it working locally to
generate my own. There’s also the <a href="http://ref.x86asm.net/">X86 Opcode and Instruction
Reference</a>, but it’s more for machines than humans.</p>

<p>Besides, I often work without an Internet connection, so offline
documentation is absolutely essential. (You hear that Microsoft? Not
only do I avoid coding against Win32 because it’s badly designed, but
even more so because you don’t offer offline documentation anymore!
The friction to API reference your documentation is enormous.)</p>

<p>I avoided the official x86 documentation for awhile, thinking it would
be too opaque, at least until I became more accustomed to the
instruction set. But really, it’s not bad! With a handle on the
basics, I would encourage anyone to dive into either Intel’s or <a href="http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/">AMD’s
manuals</a>. The reason there’s not much online in HTML form is
because these manuals are nearly everything you need.</p>

<p>I chose Intel’s manuals for x86-lookup because I’m more familiar with
it, it’s more popular, it’s (slightly) easier to parse, it’s offered
as a single PDF, and it’s more complete. The regular expression for
finding instructions is tuned for Intel’s manual and it won’t work
with AMD’s manuals.</p>

<p>For a couple months prior to writing x86-lookup, I had a couple of
scratch functions to very roughly accomplish the same thing. The
tipping point for formalizing it was that last month I wrote my own
x86 assembler. A single mnemonic often has a dozen or more different
opcodes depending on the instruction’s operands, and there are often
several ways to encode the same operation. I was frequently looking up
opcodes, and navigating the PDF quickly became a real chore. I only
needed about 80 different opcodes, so I was just adding them to the
assembler’s internal table manually as needed.</p>

<h3 id="how-does-it-work">How does it work?</h3>

<p>Say you want to look up the instruction RDRAND.</p>

<p><img src="/img/screenshot/rdrand-pdf.png" alt="" /></p>

<p>Initially Emacs has no idea what page this is on, so the first step is
to build an index mapping mnemonics to pages. x86-lookup runs the
pdftotext command line program on the PDF and loads the result into a
temporary buffer.</p>

<p>The killer feature of pdftotext is that it emits FORM FEED (U+0012)
characters between pages. Think of these as page breaks. By counting
form feed characters, x86-lookup can track the page for any part of
the document. In fact, Emacs is already set up to do this with its
<code class="language-plaintext highlighter-rouge">forward-page</code> and <code class="language-plaintext highlighter-rouge">backward-page</code> commands. So to build the index,
x86-lookup steps forward page-by-page looking for mnemonics, keeping
note of the page. Since this process typically takes about 10 seconds,
the index is cached in a file (see <code class="language-plaintext highlighter-rouge">x86-lookup-cache-directory</code>) for
future use. It only needs to happen once for a particular manual on a
particular computer.</p>

<p>The mnemonic listing is slightly incomplete, so x86-lookup expands
certain mnemonics into the familiar set. For example, all the
conditional jumps are listed under “Jcc,” but this is probably not
what you’d expect to look up. I compared x86-lookup’s mnemonic listing
against NASM/nasm-mode’s mnemonics to ensure everything was accounted
for. Both packages benefited from this process.</p>

<p>Once the index is built, pdftotext is no longer needed. If you’re
desperate and don’t have this program available, you can borrow the
index file from another computer. But you’re on your own for figuring
that out!</p>

<p>So to look up RDRAND, x86-lookup checks the index for the page number
and invokes a PDF reader on that page. This is where not all PDF
readers are created equal. There’s no convention for opening a PDF to
a particular page and each PDF reader differs. Some don’t even support
it. To deal with this, x86-lookup has a function specialized for
different PDF readers. Similar to <code class="language-plaintext highlighter-rouge">browse-url-browser-function</code>,
x86-lookup has <code class="language-plaintext highlighter-rouge">x86-lookup-browse-pdf-function</code>.</p>

<p>By default it tries to open the PDF for viewing within Emacs (did you
know Emacs is a PDF viewer?), falling back to on options if the
feature is unavailable. I welcome pull requests for any PDF readers
not yet supported by x86-lookup. Perhaps this functionality deserves
its own package.</p>

<p>That’s it! It’s a simple feature that has already saved me a lot of
time. If you’re ever programming in x86 assembly, give x86-lookup a
spin.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Raw Linux Threads via System Calls</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/05/15/"/>
    <id>urn:uuid:9d5de15b-9308-3715-2bd7-565d6649ab2f</id>
    <updated>2015-05-15T17:33:40Z</updated>
    <category term="x86"/><category term="linux"/><category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article has <a href="/blog/2016/09/23/">a followup</a>.</em></p>

<p>Linux has an elegant and beautiful design when it comes to threads:
threads are nothing more than processes that share a virtual address
space and file descriptor table. Threads spawned by a process are
additional child processes of the main “thread’s” parent process.
They’re manipulated through the same process management system calls,
eliminating the need for a separate set of thread-related system
calls. It’s elegant in the same way file descriptors are elegant.</p>

<p>Normally on Unix-like systems, processes are created with fork(). The
new process gets its own address space and file descriptor table that
starts as a copy of the original. (Linux uses copy-on-write to do this
part efficiently.) However, this is too high level for creating
threads, so Linux has a separate <a href="http://man7.org/linux/man-pages/man2/clone.2.html">clone()</a> system call. It
works just like fork() except that it accepts a number of flags to
adjust its behavior, primarily to share parts of the parent’s
execution context with the child.</p>

<p>It’s <em>so</em> simple that <strong>it takes less than 15 instructions to spawn a
thread with its own stack</strong>, no libraries needed, and no need to call
Pthreads! In this article I’ll demonstrate how to do this on x86-64.
All of the code with be written in <a href="http://www.nasm.us/">NASM</a> syntax since, IMHO,
it’s by far the best (see: <a href="/blog/2015/04/19/">nasm-mode</a>).</p>

<p>I’ve put the complete demo here if you want to see it all at once:</p>

<ul>
  <li><a href="https://github.com/skeeto/pure-linux-threads-demo">Pure assembly, library-free Linux threading demo</a></li>
</ul>

<h3 id="an-x86-64-primer">An x86-64 Primer</h3>

<p>I want you to be able to follow along even if you aren’t familiar with
x86_64 assembly, so here’s a short primer of the relevant pieces. If
you already know x86-64 assembly, feel free to skip to the next
section.</p>

<p>x86-64 has 16 64-bit <em>general purpose registers</em>, primarily used to
manipulate integers, including memory addresses. There are <em>many</em> more
registers than this with more specific purposes, but we won’t need
them for threading.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rsp</code> : stack pointer</li>
  <li><code class="language-plaintext highlighter-rouge">rbp</code> : “base” pointer (still used in debugging and profiling)</li>
  <li><code class="language-plaintext highlighter-rouge">rax</code> <code class="language-plaintext highlighter-rouge">rbx</code> <code class="language-plaintext highlighter-rouge">rcx</code> <code class="language-plaintext highlighter-rouge">rdx</code> : general purpose (notice: a, b, c, d)</li>
  <li><code class="language-plaintext highlighter-rouge">rdi</code> <code class="language-plaintext highlighter-rouge">rsi</code> : “destination” and “source”, now meaningless names</li>
  <li><code class="language-plaintext highlighter-rouge">r8</code> <code class="language-plaintext highlighter-rouge">r9</code> <code class="language-plaintext highlighter-rouge">r10</code> <code class="language-plaintext highlighter-rouge">r11</code> <code class="language-plaintext highlighter-rouge">r12</code> <code class="language-plaintext highlighter-rouge">r13</code> <code class="language-plaintext highlighter-rouge">r14</code> <code class="language-plaintext highlighter-rouge">r15</code> : added for x86-64</li>
</ul>

<p><img src="/img/x86/register.png" alt="" /></p>

<p>The “r” prefix indicates that they’re 64-bit registers. It won’t be
relevant in this article, but the same name prefixed with “e”
indicates the lower 32-bits of these same registers, and no prefix
indicates the lowest 16 bits. This is because x86 was <a href="/blog/2014/12/09/">originally a
16-bit architecture</a>, extended to 32-bits, then to 64-bits.
Historically each of of these registers had a specific, unique
purpose, but on x86-64 they’re almost completely interchangeable.</p>

<p>There’s also a “rip” instruction pointer register that conceptually
walks along the machine instructions as they’re being executed, but,
unlike the other registers, it can only be manipulated indirectly.
Remember that data and code <a href="http://en.wikipedia.org/wiki/Von_Neumann_architecture">live in the same address space</a>, so
rip is not much different than any other data pointer.</p>

<h4 id="the-stack">The Stack</h4>

<p>The rsp register points to the “top” of the call stack. The stack
keeps track of who called the current function, in addition to local
variables and other function state (a <em>stack frame</em>). I put “top” in
quotes because the stack actually grows <em>downward</em> on x86 towards
lower addresses, so the stack pointer points to the lowest address on
the stack. This piece of information is critical when talking about
threads, since we’ll be allocating our own stacks.</p>

<p>The stack is also sometimes used to pass arguments to another
function. This happens much less frequently on x86-64, especially with
the <a href="http://wiki.osdev.org/System_V_ABI">System V ABI</a> used by Linux, where the first 6 arguments are
passed via registers. The return value is passed back via rax. When
calling another function function, integer/pointer arguments are
passed in these registers in this order:</p>

<ul>
  <li>rdi, rsi, rdx, rcx, r8, r9</li>
</ul>

<p>So, for example, to perform a function call like <code class="language-plaintext highlighter-rouge">foo(1, 2, 3)</code>, store
1, 2 and 3 in rdi, rsi, and rdx, then <code class="language-plaintext highlighter-rouge">call</code> the function. The <code class="language-plaintext highlighter-rouge">mov</code>
instruction stores the source (second) operand in its destination
(first) operand. The <code class="language-plaintext highlighter-rouge">call</code> instruction pushes the current value of
rip onto the stack, then sets rip (<em>jumps</em>) to the address of the
target function. When the callee is ready to return, it uses the <code class="language-plaintext highlighter-rouge">ret</code>
instruction to <em>pop</em> the original rip value off the stack and back
into rip, returning control to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="mi">2</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">3</span>
    <span class="nf">call</span> <span class="nv">foo</span>
</code></pre></div></div>

<p>Called functions <em>must</em> preserve the contents of these registers (the
same value must be stored when the function returns):</p>

<ul>
  <li>rbx, rsp, rbp, r12, r13, r14, r15</li>
</ul>

<h4 id="system-calls">System Calls</h4>

<p>When making a <em>system call</em>, the argument registers are <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">slightly
different</a>. Notice rcx has been changed to r10.</p>

<ul>
  <li>rdi, rsi, rdx, r10, r8, r9</li>
</ul>

<p>Each system call has an integer identifying it. This number is
different on each platform, but, in Linux’s case, <a href="https://www.youtube.com/watch?v=1Mg5_gxNXTo#t=8m28">it will <em>never</em>
change</a>. Instead of <code class="language-plaintext highlighter-rouge">call</code>, rax is set to the number of the
desired system call and the <code class="language-plaintext highlighter-rouge">syscall</code> instruction makes the request to
the OS kernel. Prior to x86-64, this was done with an old-fashioned
interrupt. Because interrupts are slow, a special,
statically-positioned “vsyscall” page (now deprecated as a <a href="http://en.wikipedia.org/wiki/Return-oriented_programming">security
hazard</a>), later <a href="https://lwn.net/Articles/446528/">vDSO</a>, is provided to allow certain system
calls to be made as function calls. We’ll only need the <code class="language-plaintext highlighter-rouge">syscall</code>
instruction in this article.</p>

<p>So, for example, the write() system call has this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86-64, the write() system call is at the top of <a href="https://filippo.io/linux-syscall-table/">the system call
table</a> as call 1 (read() is 0). Standard output is file
descriptor 1 by default (standard input is 0). The following bit of
code will write 10 bytes of data from the memory address <code class="language-plaintext highlighter-rouge">buffer</code> (a
symbol defined elsewhere in the assembly program) to standard output.
The number of bytes written, or -1 for error, will be returned in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; fd</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">buffer</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">10</span>       <span class="c1">; 10 bytes</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; SYS_write</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<h4 id="effective-addresses">Effective Addresses</h4>

<p>There’s one last thing you need to know: registers often hold a memory
address (i.e. a pointer), and you need a way to read the data behind
that address. In NASM syntax, wrap the register in brackets (e.g.
<code class="language-plaintext highlighter-rouge">[rax]</code>), which, if you’re familiar with C, would be the same as
<em>dereferencing</em> the pointer.</p>

<p>These bracket expressions, called an <em>effective address</em>, may be
limited mathematical expressions to offset that <em>base</em> address
entirely within a single instruction. This expression can include
another register (<em>index</em>), a power-of-two <em>scalar</em> (bit shift), and
an immediate signed <em>offset</em>. For example, <code class="language-plaintext highlighter-rouge">[rax + rdx*8 + 12]</code>. If
rax is a pointer to a struct, and rdx is an array index to an element
in array on that struct, only a single instruction is needed to read
that element. NASM is smart enough to allow the assembly programmer to
break this mold a little bit with more complex expressions, so long as
it can reduce it to the <code class="language-plaintext highlighter-rouge">[base + index*2^exp + offset]</code> form.</p>

<p>The details of addressing aren’t important this for this article, so
don’t worry too much about it if that didn’t make sense.</p>

<h3 id="allocating-a-stack">Allocating a Stack</h3>

<p>Threads share everything except for registers, a stack, and
thread-local storage (TLS). The OS and underlying hardware will
automatically ensure that registers are per-thread. Since it’s not
essential, I won’t cover thread-local storage in this article. In
practice, the stack is often used for thread-local data anyway. The
leaves the stack, and before we can span a new thread, we need to
allocate a stack, which is nothing more than a memory buffer.</p>

<p>The trivial way to do this would be to reserve some fixed .bss
(zero-initialized) storage for threads in the executable itself, but I
want to do it the Right Way and allocate the stack dynamically, just
as Pthreads, or any other threading library, would. Otherwise the
application would be limited to a compile-time fixed number of
threads.</p>

<p>You <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">can’t just read from and write to arbitrary addresses</a> in
virtual memory, you first <a href="/blog/2015/03/19/">have to ask the kernel to allocate
pages</a>. There are two system calls this on Linux to do this:</p>

<ul>
  <li>
    <p>brk(): Extends (or shrinks) the heap of a running process, typically
located somewhere shortly after the .bss segment. Many allocators
will do this for small or initial allocations. This is a less
optimal choice for thread stacks because the stacks will be very
near other important data, near other stacks, and lack a guard page
(by default). It would be somewhat easier for an attacker to exploit
a buffer overflow. A guard page is a locked-down page just past the
absolute end of the stack that will trigger a segmentation fault on
a stack overflow, rather than allow a stack overflow to trash other
memory undetected. A guard page could still be created manually with
mprotect(). Also, there’s also no room for these stacks to grow.</p>
  </li>
  <li>
    <p>mmap(): Use an anonymous mapping to allocate a contiguous set of
pages at some randomized memory location. As we’ll see, you can even
tell the kernel specifically that you’re going to use this memory as
a stack. Also, this is simpler than using brk() anyway.</p>
  </li>
</ul>

<p>On x86-64, mmap() is system call 9. I’ll define a function to allocate
a stack with this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>The mmap() system call takes 6 arguments, but when creating an
anonymous memory map the last two arguments are ignored. For our
purposes, it looks like this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">mmap</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">,</span> <span class="kt">int</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">);</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">flags</code>, we’ll choose a private, anonymous mapping that, being a
stack, grows downward. Even with that last flag, the system call will
still return the bottom address of the mapping, which will be
important to remember later. It’s just a simple matter of setting the
arguments in the registers and making the system call.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_mmap	9
%define STACK_SIZE	(4096 * 1024)	</span><span class="c1">; 4 MB
</span>
<span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Now we can allocate new stacks (or stack-sized buffers) as needed.</p>

<h3 id="spawning-a-thread">Spawning a Thread</h3>

<p>Spawning a thread is so simple that it doesn’t even require a branch
instruction! It’s a call to clone() with two arguments: clone flags
and a pointer to the new thread’s stack. It’s important to note that,
as in many cases, the glibc wrapper function has the arguments in a
different order than the system call. With the set of flags we’re
using, it takes two arguments.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">sys_clone</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">child_stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Our thread spawning function will have this C prototype. It takes a
function as its argument and starts the thread running that function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">thread_create</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="p">)(</span><span class="kt">void</span><span class="p">));</span>
</code></pre></div></div>

<p>The function pointer argument is passed via rdi, per the ABI. Store
this for safekeeping on the stack (<code class="language-plaintext highlighter-rouge">push</code>) in preparation for calling
stack_create(). When it returns, the address of the low end of stack
will be in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">thread_create:</span>
    <span class="nf">push</span> <span class="nb">rdi</span>
    <span class="nf">call</span> <span class="nv">stack_create</span>
    <span class="nf">lea</span> <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nv">STACK_SIZE</span> <span class="o">-</span> <span class="mi">8</span><span class="p">]</span>
    <span class="nf">pop</span> <span class="kt">qword</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">CL</span><span class="nv">ONE_VM</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FS</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FILES</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_SIGHAND</span> <span class="o">|</span> <span class="err">\</span>
             <span class="nf">CLONE_PARENT</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_THREAD</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_IO</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_clone</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The second argument to clone() is a pointer to the <em>high address</em> of
the stack (specifically, just above the stack). So we need to add
<code class="language-plaintext highlighter-rouge">STACK_SIZE</code> to rax to get the high end. This is done with the <code class="language-plaintext highlighter-rouge">lea</code>
instruction: <strong>l</strong>oad <strong>e</strong>ffective <strong>a</strong>ddress. Despite the brackets,
it doesn’t actually read memory at that address, but instead stores
the address in the destination register (rsi). I’ve moved it back by 8
bytes because I’m going to place the thread function pointer at the
“top” of the new stack in the next instruction. You’ll see why in a
moment.</p>

<p><img src="/img/x86/clone.png" alt="" /></p>

<p>Remember that the function pointer was pushed onto the stack for
safekeeping. This is popped off the current stack and written to that
reserved space on the new stack.</p>

<p>As you can see, it takes a lot of flags to create a thread with
clone(). Most things aren’t shared with the callee by default, so lots
of options need to be enabled. See the clone(2) man page for full
details on these flags.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CLONE_THREAD</code>: Put the new process in the same thread group.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_VM</code>: Runs in the same virtual memory space.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_PARENT</code>: Share a parent with the callee.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_SIGHAND</code>: Share signal handlers.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_FS</code>, <code class="language-plaintext highlighter-rouge">CLONE_FILES</code>, <code class="language-plaintext highlighter-rouge">CLONE_IO</code>: Share filesystem information.</li>
</ul>

<p>A new thread will be created and the syscall will return in each of
the two threads at the same instruction, <em>exactly</em> like fork(). All
registers will be identical between the threads, except for rax, which
will be 0 in the new thread, and rsp which has the same value as rsi
in the new thread (the pointer to the new stack).</p>

<p><strong>Now here’s the really cool part</strong>, and the reason branching isn’t
needed. There’s no reason to check rax to determine if we are the
original thread (in which case we return to the caller) or if we’re
the new thread (in which case we jump to the thread function).
Remember how we seeded the new stack with the thread function? When
the new thread returns (<code class="language-plaintext highlighter-rouge">ret</code>), it will jump to the thread function
with a completely empty stack. The original thread, using the original
stack, will return to the caller.</p>

<p>The value returned by thread_create() is the process ID of the new
thread, which is essentially the thread object (e.g. Pthread’s
<code class="language-plaintext highlighter-rouge">pthread_t</code>).</p>

<h3 id="cleaning-up">Cleaning Up</h3>

<p>The thread function has to be careful not to return (<code class="language-plaintext highlighter-rouge">ret</code>) since
there’s nowhere to return. It will fall off the stack and terminate
the program with a segmentation fault. Remember that threads are just
processes? It must use the exit() syscall to terminate. This won’t
terminate the other threads.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_exit	60
</span>
<span class="nl">exit:</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_exit</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<p>Before exiting, it should free its stack with the munmap() system
call, so that no resources are leaked by the terminated thread. The
equivalent of pthread_join() by the main parent would be to use the
wait4() system call on the thread process.</p>

<h3 id="more-exploration">More Exploration</h3>

<p>If you found this interesting, be sure to check out the full demo link
at the top of this article. Now with the ability to spawn threads,
it’s a great opportunity to explore and experiment with x86’s
synchronization primitives, such as the <code class="language-plaintext highlighter-rouge">lock</code> instruction prefix,
<code class="language-plaintext highlighter-rouge">xadd</code>, and <a href="/blog/2014/09/02/">compare-and-exchange</a> (<code class="language-plaintext highlighter-rouge">cmpxchg</code>). I’ll discuss
these in a future article.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>NASM x86 Assembly Major Mode for Emacs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/04/19/"/>
    <id>urn:uuid:6966e5d3-9e81-3fc0-5d47-eeb29677f7da</id>
    <updated>2015-04-19T02:38:23Z</updated>
    <category term="emacs"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Last weekend I created a new Emacs mode, <a href="https://github.com/skeeto/nasm-mode"><strong>nasm-mode</strong></a>,
for editing <a href="http://www.nasm.us/">Netwide Assembler</a> (NASM) x86 assembly programs.
Over the past week I tweaked it until it felt comfortable enough to
share on <a href="http://melpa.org/">MELPA</a>. It’s got what you’d expect from a standard
Emacs programming language mode: syntax highlighting, automatic
indentation, and imenu support. It’s not a full parser, but it knows
all of NASM’s instructions and directives.</p>

<p>Until recently I didn’t really have preferences about x86 assemblers
(<a href="https://www.gnu.org/software/binutils/">GAS</a>, NASM, <a href="http://yasm.tortall.net/">YASM</a>, <a href="http://flatassembler.net/">FASM</a>, MASM, etc.) or syntax
(Intel, AT&amp;T). I stuck to the GNU Assembler (GAS) since it’s already
there with all the other GNU development tools I know and love, and
it’s required for inline assembly in GCC. However, nasm-mode now marks
my commitment to NASM as my primary x86 assembler.</p>

<h3 id="why-nasm">Why NASM?</h3>

<p>I need an assembler that can assemble 16-bit code (8086, 8088, 80186,
80286), because <a href="/blog/2014/12/09/">real mode is fun</a>. Despite its <code class="language-plaintext highlighter-rouge">.code16gcc</code>
directive, GAS is not suitable for this purpose. It’s <em>just</em> enough to
get the CPU into protected mode — as needed when writing an operating
system with GCC — and that’s it. A different assembler is required
for serious 16-bit programming.</p>

<p><a href="http://x86asm.net/articles/what-i-dislike-about-gas/">GAS syntax has problems</a>. I’m not talking about the argument
order (source first or destination first), since there’s no right
answer to that one. The linked article covers a number of problems,
with these being the big ones for me:</p>

<ul>
  <li>
    <p>The use of <code class="language-plaintext highlighter-rouge">%</code> sigils on all registers is tedious. I’m sure it’s
handy when generating code, where it becomes a register namespace,
but it’s annoying to write.</p>
  </li>
  <li>
    <p>Integer constants are an easy source of bugs. Forget the <code class="language-plaintext highlighter-rouge">$</code> and
suddenly you’re doing absolute memory access, which is a poor
default. NASM simplifies this by using brackets <code class="language-plaintext highlighter-rouge">[]</code> for all such
“dereferences.”</p>
  </li>
  <li>
    <p>GAS cannot produce pure binaries — raw machine code without any
headers or container (ELF, COFF, PE). Pure binaries are useful for
developing <a href="http://www.vividmachines.com/shellcode/shellcode.html">shellcode</a>, bootloaders, 16-bit COM programs,
and <a href="/blog/2015/03/19/">just-in-time compilers</a>.</p>
  </li>
</ul>

<p>Being a portable assembler, GAS is the jack of all instruction sets,
master of none. If I’m going to write a lot of x86 assembly, I want a
tool specialized for the job.</p>

<h4 id="yasm">YASM</h4>

<p>I also looked at YASM, a rewrite of NASM. It supports 16-bit assembly
and mostly uses NASM syntax. In my research I found that NASM used to
lag behind in features due to slower development, which is what
spawned YASM. In recent years this seems to have flipped around, with
YASM lagging behind. If you’re using YASM, nasm-mode should work
pretty well for you, since it’s still very similar.</p>

<p>YASM optionally supports GAS syntax, but this reintroduces almost all
of GAS’s problems. Even YASM’s improvements (i.e. its <code class="language-plaintext highlighter-rouge">ORG</code> directive)
become broken when switching to GAS syntax.</p>

<h4 id="fasm">FASM</h4>

<p>FASM is the “flat assembler,” an assembler written in assembly
language. This means it’s only available on x86 platforms. While I
don’t really plan on developing x86 assembly on a Raspberry Pi, I’d
rather not limit my options! I already regard 16-bit DOS programming
as a form of embedded programming, and this may very well extend to
the rest of x86 someday.</p>

<p>Also, it hasn’t made its way into the various Linux distribution
package repositories, including Debian, so it’s already at a
disadvantage for me.</p>

<h4 id="masm">MASM</h4>

<p>This is Microsoft’s assembler that comes with Visual Studio. Windows
only and not open source, this is in no way a serious consideration.
But since NASM’s syntax was originally derived from MASM, it’s worth
mentioning. NASM takes the good parts of MASM and <a href="https://courses.engr.illinois.edu/ece390/archive/mp/f99/mp5/masm_nasm.html">fixes the
mistakes</a> (such as the <code class="language-plaintext highlighter-rouge">offset</code> operator). It’s different enough
that nasm-mode would not work well with MASM.</p>

<h4 id="nasm">NASM</h4>

<p>It’s not perfect, but it’s got an <a href="http://www.nasm.us/doc/">excellent manual</a>, it’s a
solid program that does exactly what it says it will do, has a
powerful macro system, great 16-bit support, highly portable, easy to
build, and its semantics and syntax has been carefully considered. It
also comes with a simple, pure binary disassembler (<code class="language-plaintext highlighter-rouge">ndisasm</code>). In
retrospect it seems like an obvious choice!</p>

<p>My one complaint would be that it’s that it’s <em>too</em> flexible about
labels. The colon on labels is optional, which can lead to subtle
bugs. NASM will warn about this under some conditions (orphan-labels).
Combined with the preprocessor, the difference between a macro and a
label is ambiguous, short of re-implementing the entire preprocessor
in Emacs Lisp.</p>

<h3 id="why-nasm-mode">Why nasm-mode?</h3>

<p>Emacs comes with an <code class="language-plaintext highlighter-rouge">asm-mode</code> for editing assembly code for various
architectures. Unfortunately it’s another jack-of-all-trades that’s
not very good. More so, it doesn’t follow Emacs’ normal editing
conventions, having unusual automatic indentation and self-insertion
behaviors. It’s what prompted me to make nasm-mode.</p>

<p>To be fair, I don’t think it’s possible to write a major mode that
covers many different instruction set architectures. Each architecture
has its own quirks and oddities that essentially makes gives it a
unique language. This is especially true with x86, which, from its 37
year tenure touched by so many different vendors, comes in a number of
incompatible flavors. Each assembler/architecture pair needs its own
major mode. I hope I just wrote NASM’s.</p>

<p>One area where I’m still stuck is that I can’t find an x86 style
guide. It’s easy to find half a dozen style guides of varying
authority for any programming language that’s more than 10 years old
… except x86. There’s no obvious answer when it comes to automatic
indentation. How are comments formatted and indented? How are
instructions aligned? Should labels be on the same line as the
instruction? Should labels require a colon? (I’ve decided this is
“yes.”) What about long label names? How are function
prototypes/signatures documented? (The mode could take advantage of
such a standard, a la ElDoc.) It seems everyone uses their own style.
This is another conundrum for a generic asm-mode.</p>

<p>There are a couple of <a href="http://matthieuhauglustaine.blogspot.com/2011/08/nasm-mode-for-emacs.html">other nasm-modes</a> floating around with
different levels of completeness. Mine should supersede these, and
will be much easier to maintain into the future as NASM evolves.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A Basic Just-In-Time Compiler</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/19/"/>
    <id>urn:uuid:95e0437f-61f0-3932-55b7-f828e171d9ca</id>
    <updated>2015-03-19T04:57:55Z</updated>
    <category term="c"/><category term="tutorial"/><category term="netsec"/><category term="x86"/><category term="posix"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=17747759">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/akxq8q/a_basic_justintime_compiler/">on reddit</a>.</em></p>

<p><a href="http://redd.it/2z68di">Monday’s /r/dailyprogrammer challenge</a> was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(<code class="language-plaintext highlighter-rouge">u(0)</code>) and a sequence of operations, <code class="language-plaintext highlighter-rouge">f</code>, to apply to the previous
term (<code class="language-plaintext highlighter-rouge">u(n + 1) = f(u(n))</code>) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.</p>

<!--more-->

<p>For example, the relation <code class="language-plaintext highlighter-rouge">u(n + 1) = (u(n) + 2) * 3 - 5</code> would be
input as <code class="language-plaintext highlighter-rouge">+2 *3 -5</code>. If <code class="language-plaintext highlighter-rouge">u(0) = 0</code> then,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">u(1) = 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(2) = 4</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(3) = 13</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(4) = 40</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(5) = 121</code></li>
  <li>…</li>
</ul>

<p>Rather than write an interpreter to apply the sequence of operations,
for <a href="https://gist.github.com/skeeto/3a1aa3df31896c9956dc">my submission</a> (<a href="/download/jit.c">mirror</a>) I took the opportunity to
write a simple x86-64 Just-In-Time (JIT) compiler. So rather than
stepping through the operations one by one, my program converts the
operations into native machine code and lets the hardware do the work
directly. In this article I’ll go through how it works and how I did
it.</p>

<p><strong>Update</strong>: The <a href="http://redd.it/2zna5q">follow-up challenge</a> uses Reverse Polish
notation to allow for more complicated expressions. I wrote another
JIT compiler for <a href="https://gist.github.com/anonymous/f7e4a5086a2b0acc83aa">my submission</a> (<a href="/download/rpn-jit.c">mirror</a>).</p>

<h3 id="allocating-executable-memory">Allocating Executable Memory</h3>

<p>Modern operating systems have page-granularity protections for
different parts of <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">process memory</a>: read, write, and execute.
Code can only be executed from memory with the execute bit set on its
page, memory can only be changed when its write bit is set, and some
pages aren’t allowed to be read. In a running process, the pages
holding program code and loaded libraries will have their write bit
cleared and execute bit set. Most of the other pages will have their
execute bit cleared and their write bit set.</p>

<p>The reason for this is twofold. First, it significantly increases the
security of the system. If untrusted input was read into executable
memory, an attacker could input machine code (<em>shellcode</em>) into the
buffer, then exploit a flaw in the program to cause control flow to
jump to and execute that code. If the attacker is only able to write
code to non-executable memory, this attack becomes a lot harder. The
attacker has to rely on code already loaded into executable pages
(<a href="http://en.wikipedia.org/wiki/Return-oriented_programming"><em>return-oriented programming</em></a>).</p>

<p>Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, <code class="language-plaintext highlighter-rouge">NULL</code>
points to a special page with read, write, and execute disabled.</p>

<h4 id="an-instruction-buffer">An Instruction Buffer</h4>

<p>Memory returned by <code class="language-plaintext highlighter-rouge">malloc()</code> and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through <code class="language-plaintext highlighter-rouge">malloc()</code>, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an <code class="language-plaintext highlighter-rouge">asmbuf</code> struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PAGE_SIZE 4096
</span>
<span class="k">struct</span> <span class="n">asmbuf</span> <span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">code</span><span class="p">[</span><span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint64_t</span><span class="p">)];</span>
    <span class="kt">uint64_t</span> <span class="n">count</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use <code class="language-plaintext highlighter-rouge">sysconf(_SC_PAGESIZE)</code> to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.</p>

<p>Instead of <code class="language-plaintext highlighter-rouge">malloc()</code>, the compiler allocates memory as an anonymous
memory map (<code class="language-plaintext highlighter-rouge">mmap()</code>). It’s anonymous because it’s not backed by a
file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Windows doesn’t have POSIX <code class="language-plaintext highlighter-rouge">mmap()</code>, so on that platform we use
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> instead. Here’s the equivalent in Win32.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">type</span> <span class="o">=</span> <span class="n">MEM_RESERVE</span> <span class="o">|</span> <span class="n">MEM_COMMIT</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">VirtualAlloc</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">PAGE_READWRITE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Anyone reading closely should notice that I haven’t actually requested
that the memory be executable, which is, like, the whole point of all
this! This was intentional. Some operating systems employ a security
feature called W^X: “write xor execute.” That is, memory is either
writable or executable, but never both at the same time. This makes
the shellcode attack I described before even harder. For <a href="http://www.tedunangst.com/flak/post/now-or-never-exec">well-behaved
JIT compilers</a> it means memory protections need to be adjusted
after code generation and before execution.</p>

<p>The POSIX <code class="language-plaintext highlighter-rouge">mprotect()</code> function is used to change memory protections.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or on Win32 (that last parameter is not allowed to be <code class="language-plaintext highlighter-rouge">NULL</code>),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">old</span><span class="p">;</span>
    <span class="n">VirtualProtect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PAGE_EXECUTE_READ</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">old</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, instead of <code class="language-plaintext highlighter-rouge">free()</code> it gets unmapped.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And on Win32,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">VirtualFree</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MEM_RELEASE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I won’t list the definitions here, but there are two “methods” for
inserting instructions and immediate values into the buffer. This will
be raw machine code, so the caller will be acting a bit like an
assembler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">ins</span><span class="p">);</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="calling-conventions">Calling Conventions</h3>

<p>We’re only going to be concerned with three of x86-64’s many
registers: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rax</code>, and <code class="language-plaintext highlighter-rouge">rdx</code>. These are 64-bit (<code class="language-plaintext highlighter-rouge">r</code>) extensions
of <a href="/blog/2014/12/09/">the original 16-bit 8086 registers</a>. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">recurrence</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions">The System V AMD64 ABI calling convention</a> says that the first
integer/pointer function argument is passed in the <code class="language-plaintext highlighter-rouge">rdi</code> register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in <code class="language-plaintext highlighter-rouge">rax</code> when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy <code class="language-plaintext highlighter-rouge">rdi</code> to <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rdi</span>
</code></pre></div></div>

<p>There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in <code class="language-plaintext highlighter-rouge">asmbuf</code>. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in <code class="language-plaintext highlighter-rouge">rcx</code> rather than <code class="language-plaintext highlighter-rouge">rdi</code>. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.</p>

<p>The very last thing it will do, assuming the result is in <code class="language-plaintext highlighter-rouge">rax</code>, is
return to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">ret</span>
</code></pre></div></div>

<p>So we know the assembly, but what do we pass to <code class="language-plaintext highlighter-rouge">asmbuf_ins()</code>? This
is where we get our hands dirty.</p>

<h4 id="finding-the-code">Finding the Code</h4>

<p>If you want to do this the Right Way, you go download the x86-64
documentation, look up the instructions we’re using, and manually work
out the bytes we need and how the operands fit into it. You know, like
they used to do <a href="/blog/2016/11/17/">out of necessity</a> back in the 60’s.</p>

<p>Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file <code class="language-plaintext highlighter-rouge">peek.s</code> and hand it to <code class="language-plaintext highlighter-rouge">nasm</code>. It will produce a raw binary
with the machine code, which we’ll disassemble with <code class="language-plaintext highlighter-rouge">nidsasm</code> (the
NASM disassembler).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret
</code></pre></div></div>

<p>That’s straightforward. The first instruction is 3 bytes and the
return is 1 byte.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4889f8</span><span class="p">);</span>  <span class="c1">// mov   rax, rdi</span>
<span class="c1">// ... generate code ...</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mh">0xc3</span><span class="p">);</span>      <span class="c1">// ret</span>
</code></pre></div></div>

<p>For each operation, we’ll set it up so the operand will already be
loaded into <code class="language-plaintext highlighter-rouge">rdi</code> regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use <code class="language-plaintext highlighter-rouge">0x0123456789abcdef</code> as the
operand.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rdi</span><span class="p">,</span> <span class="mh">0x0123456789abcdef</span>
</code></pre></div></div>

<p>Which disassembled with <code class="language-plaintext highlighter-rouge">ndisasm</code> is,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301
</code></pre></div></div>

<p>Notice the operand listed little endian immediately after the
instruction. That’s also easy!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">operand</span><span class="p">;</span>
<span class="n">scanf</span><span class="p">(</span><span class="s">"%ld"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mh">0x48bf</span><span class="p">);</span>         <span class="c1">// mov   rdi, operand</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
</code></pre></div></div>

<p>Apply the same discovery process individually for each operator you
want to support, accumulating the result in <code class="language-plaintext highlighter-rouge">rax</code> for each.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">switch</span> <span class="p">(</span><span class="n">operator</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4801f8</span><span class="p">);</span>   <span class="c1">// add   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4829f8</span><span class="p">);</span>   <span class="c1">// sub   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'*'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mh">0x480fafc7</span><span class="p">);</span> <span class="c1">// imul  rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'/'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4831d2</span><span class="p">);</span>   <span class="c1">// xor   rdx, rdx</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x48f7ff</span><span class="p">);</span>   <span class="c1">// idiv  rdi</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As an exercise, try adding support for modulus operator (<code class="language-plaintext highlighter-rouge">%</code>), XOR
(<code class="language-plaintext highlighter-rouge">^</code>), and bit shifts (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the <a href="https://old.reddit.com/r/dailyprogrammer/comments/2z68di/_/cpgkcx7">closed form solution</a> to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.</p>

<h3 id="calling-the-generated-code">Calling the Generated Code</h3>

<p>Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a <code class="language-plaintext highlighter-rouge">void *</code> just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_finalize</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">recurrence</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="o">-&gt;</span><span class="n">code</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="n">x</span><span class="p">[</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">recurrence</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]);</span>
</code></pre></div></div>

<p>That’s pretty cool if you ask me! Now this was an extremely simplified
situation. There’s no branching, no intermediate values, no function
calls, and I didn’t even touch the stack (push, pop). The recurrence
relation definition in this challenge is practically an assembly
language itself, so after the initial setup it’s a 1:1 translation.</p>

<p>I’d like to build a JIT compiler more advanced than this in the
future. I just need to find a suitable problem that’s more complicated
than this one, warrants having a JIT compiler, but is still simple
enough that I could, on some level, justify not using LLVM.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
