<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged linux at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/linux/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/linux/feed/"/>
  <updated>2026-04-09T13:25:45Z</updated>
  <id>urn:uuid:9d7de7b3-b357-464b-a963-da196bdd5954</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  <entry>
    <title>Frankenwine: Multiple personas in a Wine process</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/19/"/>
    <id>urn:uuid:d2b53f8d-88a6-400b-a748-693a758741c5</id>
    <updated>2026-01-19T21:51:38Z</updated>
    <category term="c"/><category term="win32"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I came across a recent article on <a href="https://gpfault.net/posts/drunk-exe.html">making Linux system calls from a Wine
process</a>. Windows programs running under Wine are still normal Linux
processes and may interact with the Linux kernel like any other process.
None of this was surprising, and the demonstration works just as I expect.
Still, it got the wheels spinning and I realized an <em>almost</em> practical
application: build <a href="/blog/2023/01/18/">my pkg-config implementation</a> such that on Windows
<code class="language-plaintext highlighter-rouge">pkg-config.exe</code> behaves as a native pkg-config, but when run under Wine
this same binary takes the persona of a Linux program and becomes a cross
toolchain pkg-config, bypassing Win32 and talking directly with the Linux
kernel. <a href="https://justine.lol/cosmopolitan/">Cosmopolitcan Libc</a> cleverly does this out-of-the-box, but
in this article we’ll mash together a couple existing sources with a bit
of glue.</p>

<p>The results are in <a href="https://github.com/skeeto/u-config/commit/e0008d7e">the merge-demo branch</a> of u-config, and took
hardly any work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)
</code></pre></div></div>

<p>A platform layer, <code class="language-plaintext highlighter-rouge">main_wine.c</code>, is a merge of two existing platform
layers, one of which required unavoidable tweaks. We’ll get to those
details in a moment. First we’ll need to detect if we’re running under
Wine, and <a href="https://web.archive.org/web/20250923061634/https://stackoverflow.com/questions/7372388/determine-whether-a-program-is-running-under-wine-at-runtime/42333249#42333249">the best solution I found</a> was to locate
<code class="language-plaintext highlighter-rouge">ntdll!wine_get_version</code>. If this function exists, we’re in Wine. That
works out to a pretty one-liner because <code class="language-plaintext highlighter-rouge">ntdll.dll</code> is already loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">running_on_wine</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">GetModuleHandleA</span><span class="p">(</span><span class="s">"ntdll"</span><span class="p">),</span> <span class="s">"wine_get_version"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>An x86-64 Linux syscall wrapper with <a href="/blog/2024/12/20/">thorough inline assembly</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">syscall3</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">b</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="p">(</span><span class="kt">ptrdiff_t</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’d normally use <code class="language-plaintext highlighter-rouge">long</code> for all these integers because Linux is <a href="https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models">LP64</a>
(<code class="language-plaintext highlighter-rouge">long</code> is pointer-sized), but Windows is LLP64 (only <code class="language-plaintext highlighter-rouge">long long</code> is 64
bits). It’s so bizarre to interface with Linux from LLP64, and this will
have consequences later. With these pieces we can see the basic shape of a
split personality program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">running_on_wine</span><span class="p">())</span> <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello, wine</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
        <span class="n">WriteFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"hello, windows</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We can cram two programs into this binary and select which program at run
time depending on what we see. In typical programs locating and calling
into glibc would be a challenge, particularly with the incompatible ABIs
involved. We’re avoiding it here by interfacing directly with the kernel.</p>

<h3 id="application-to-u-config">Application to u-config</h3>

<p>Luckily u-config has completely-optional platform layers implemented with
Linux system calls. The POSIX platform layer works fine, and that’s what
distributions should generally use, but these bonus platforms are unhosted
and do not require libc. That means we can shove it into a Windows build
with relatively little trouble.</p>

<p>Before we do that, let’s think about what we’re doing. <a href="/blog/2021/08/21/">Debian has great
cross toolchain support</a>, including Mingw-w64. There are even a few
Windows libraries in the Debian package repository, <a href="https://packages.debian.org/trixie/x32/libz-mingw-w64">such as zlib</a>, and
we can build Windows programs against them. If you’re cross-building and
using pkg-config, you ought to use the cross toolchain pkg-config, which
in GNU ecosystems gets an architecture prefix like the other cross tools.
Debian cross toolchains each include a cross pkg-config, and it sometimes
<em>almost</em> works correctly! Here’s what I get on Debian 13:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz
</code></pre></div></div>

<p>Note the architecture in the <code class="language-plaintext highlighter-rouge">-I</code> and <code class="language-plaintext highlighter-rouge">-L</code> options. It really is querying
the <a href="https://peter0x44.github.io/posts/cross-compilers/">cross sysroot</a>. Though these paths are in the cross sysroot,
and so should not be listed by pkg-config. It’s unoptimal and indicates
this pkg-config is probably misconfigured. In other cases it’s far from
correct:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...
</code></pre></div></div>

<p>A tool prefixed <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-</code> should not produce paths containing
<code class="language-plaintext highlighter-rouge">x86_64-linux-gnu</code> (the host architecture in this case). Our version won’t
have these issues.</p>

<p>The u-config platform interface is five functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// read whole files</span>
<span class="n">s8node</span> <span class="o">*</span><span class="nf">os_listing</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// list directories</span>
<span class="kt">void</span>    <span class="nf">os_write</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>          <span class="c1">// standard out/err</span>
<span class="kt">void</span>    <span class="nf">os_fail</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">);</span>                       <span class="c1">// non-zero exit</span>

<span class="kt">void</span> <span class="nf">uconfig</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Platforms implement the first four functions, and call <code class="language-plaintext highlighter-rouge">uconfig()</code> with
the platform’s configuration, context pointer (<code class="language-plaintext highlighter-rouge">os *</code>), command line
arguments, environment, and some memory (all in the <code class="language-plaintext highlighter-rouge">config</code> object). My
strategy is to link two platforms into the binary, and the first challenge
is they both define <code class="language-plaintext highlighter-rouge">os_write</code>, etc. I did not plan nor intend for one
binary to contain more than one platform layer. Unity builds offer a fix
without changing a single line of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include</span> <span class="cpf">"main_windows.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span>
<span class="cp">#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include</span> <span class="cpf">"main_linux_amd64.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span></code></pre></div></div>

<p>This dirty, but effective trick <a href="/blog/2025/02/05/">may look familiar</a>. It also doesn’t
interfere with the other builds. Next I define the real platform functions
as a dispatch based on our run-time situation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="n">wine_detected</span><span class="p">;</span>

<span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">linux_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">win32_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I were serious about keeping this experiment, I’d lift <code class="language-plaintext highlighter-rouge">os</code> as I did
the functions (as <code class="language-plaintext highlighter-rouge">win32_os</code>, <code class="language-plaintext highlighter-rouge">linux_os</code>) and include <code class="language-plaintext highlighter-rouge">wine_detected</code> in
the context, eliminating this global variable. That cannot be done with
simple hacks and macros.</p>

<p>The next challenge is that I wrote the Linux platform layer assuming LP64,
and so it uses <code class="language-plaintext highlighter-rouge">long</code> instead of an equivalent platform-agnostic type like
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code>. I never thought this would be an issue because this source
literally contains <code class="language-plaintext highlighter-rouge">asm</code> blocks and no conditional compilation, yet here
we are. Lesson learned. I wanted to try an extremely janky <code class="language-plaintext highlighter-rouge">#define</code> on
<code class="language-plaintext highlighter-rouge">long</code> to fix it, but this source file has a couple <code class="language-plaintext highlighter-rouge">long long</code> that won’t
play along. These multi-token type names of C are antithetical to its
preprocessor! So I adjusted the source manually instead.</p>

<p>The Windows and Linux platform entry points are completely different, both
in name and form, and so co-exist naturally. The merged platform layer is
a new entry point that will pass control to the appropriate entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">entrypoint</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>  <span class="c1">// Linux</span>
<span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">mainCRTStartup</span><span class="p">();</span>    <span class="c1">// Windows</span>
</code></pre></div></div>

<p>On Linux <code class="language-plaintext highlighter-rouge">stack</code> is <a href="/blog/2025/03/06/">the initial value of the stack pointer</a>, which
<a href="https://articles.manugarg.com/aboutelfauxiliaryvectors">points to <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, <code class="language-plaintext highlighter-rouge">envp</code>, and <code class="language-plaintext highlighter-rouge">auxv</code></a>. We’ll need construct
an artificial “stack” for the Linux platform layer to harvest. On Windows
this is <a href="/blog/2023/02/15/">the process entry point</a>, and it will find the rest on its
own as a normal Windows process. Ultimately this ended up simpler than I
expected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">merge_entrypoint</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">wine_detected</span> <span class="o">=</span> <span class="n">running_on_wine</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="o">*</span><span class="n">fakestack</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">c16</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">GetCommandLineW</span><span class="p">();</span>
        <span class="n">fakestack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)(</span><span class="n">iz</span><span class="p">)</span><span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">fakestack</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="c1">// TODO: append envp to the fake stack</span>
        <span class="n">entrypoint</span><span class="p">((</span><span class="n">iz</span> <span class="o">*</span><span class="p">)</span><span class="n">fakestack</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">mainCRTStartup</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Where <a href="/blog/2022/02/18/"><code class="language-plaintext highlighter-rouge">cmdline_to_argv8</code> is my Windows argument parser</a>, already
used by u-config, and I reserve one element at the front to store <code class="language-plaintext highlighter-rouge">argc</code>.
Since this is just a proof-of-concept I didn’t bother fabricating and
pushing <code class="language-plaintext highlighter-rouge">envp</code> onto the fake stack. The Linux entry point doesn’t need
<code class="language-plaintext highlighter-rouge">auxv</code> and can be omitted. Once in the Linux entry point it’s essentially
a Linux process from then on, except the x64 calling convention still in
use internally.</p>

<p>Finally, I configure the Linux platform layer for Debian’s cross sysroot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include</span><span class="cpf">"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "</span><span class="c1">/usr/x86_64-w64-mingw32/lib"</span><span class="cp">
</span></code></pre></div></div>

<p>And that’s it! We have our platform merge. Build (<a href="https://github.com/skeeto/w64devkit">w64devkit</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c
</code></pre></div></div>

<p>On Debian use <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-gcc</code> for <code class="language-plaintext highlighter-rouge">cc</code>. The <code class="language-plaintext highlighter-rouge">-e</code> linker option
selects the new, higher level entry point. After installing <a href="https://packages.debian.org/trixie/wine-binfmt">Wine
binfmt</a>, here’s how it looks on Debian:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs zlib
-lz
</code></pre></div></div>

<p>That’s the correct output, but is it using the cross sysroot? Ask it to
include the <code class="language-plaintext highlighter-rouge">-I</code> argument despite it being in the cross sysroot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz
</code></pre></div></div>

<p>Looking good! It passes the <code class="language-plaintext highlighter-rouge">pc_path</code> test, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig
</code></pre></div></div>

<p>Running <em>this same binary</em> on Windows after installing zlib in w64devkit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz
</code></pre></div></div>

<p>Also:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig
</code></pre></div></div>

<p>My Frankenwine is a success!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Practical libc-free threading on Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/03/23/"/>
    <id>urn:uuid:631a8107-2eef-420b-9594-752e6f013048</id>
    <updated>2023-03-23T05:32:41Z</updated>
    <category term="c"/><category term="optimization"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re <a href="/blog/2023/02/15/">not using a C runtime</a> on Linux, and instead you’re
programming against its system call API. It’s long-term and stable after
all. <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">Memory management</a> and <a href="/blog/2023/02/13/">buffered I/O</a> are easily
solved, but a lot of software benefits from concurrency. It would be nice
to also have thread spawning capability. This article will demonstrate a
simple, practical, and robust approach to spawning and managing threads
using only raw system calls. It only takes about a dozen lines of C,
including a few inline assembly instructions.</p>

<p>The catch is that there’s no way to avoid using a bit of assembly. Neither
the <code class="language-plaintext highlighter-rouge">clone</code> nor <code class="language-plaintext highlighter-rouge">clone3</code> system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architectures supported by Linux. The <a href="https://man7.org/linux/man-pages/man2/clone.2.html">glibc <code class="language-plaintext highlighter-rouge">clone(2)</code>
wrapper</a> fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.</p>

<p>Before diving in, the complete, working demo: <a href="https://github.com/skeeto/scratch/blob/master/misc/stack_head.c"><strong><code class="language-plaintext highlighter-rouge">stack_head.c</code></strong></a></p>

<h3 id="the-clone-system-call">The clone system call</h3>

<p>On Linux, threads are spawned using the <code class="language-plaintext highlighter-rouge">clone</code> system call with semantics
like the classic unix <code class="language-plaintext highlighter-rouge">fork(2)</code>. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">clone</span><span class="p">(</span><span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Sounds kind of elegant, but it has an annoying problem: The new thread
begins life in the <em>middle</em> of a function without any established stack
frame. Its stack is a blank slate. It’s not ready to do anything except
jump to a function prologue that will set up a stack frame. So besides the
assembly for the system call itself, it also needs more assembly to get
the thread into a C-compatible state. In other words, <strong>a generic system
call wrapper cannot reliably spawn threads</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">brokenclone</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">threadentry</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_clone</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">stack</span><span class="p">);</span>
    <span class="c1">// DANGER: new thread may access non-existant stack frame here</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">threadentry</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For odd historical reasons, each architecture’s <code class="language-plaintext highlighter-rouge">clone</code> has a slightly
different interface. The newer <code class="language-plaintext highlighter-rouge">clone3</code> unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.</p>

<h3 id="the-stack-header">The stack “header”</h3>

<p>I <a href="/blog/2015/05/15/">figured out a neat trick eight years ago</a> which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the <code class="language-plaintext highlighter-rouge">ret</code> instruction
do the work.</p>

<p>Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the <code class="language-plaintext highlighter-rouge">stack_head</code>,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own <code class="language-plaintext highlighter-rouge">stack_head</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using <code class="language-plaintext highlighter-rouge">sizeof</code>
to place the structure, as I’ll demonstrate later.</p>

<p>Now for the cool part: The <code class="language-plaintext highlighter-rouge">...</code> can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">join_futex</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>More details on that futex shortly.</p>

<h3 id="the-clone-wrapper">The clone wrapper</h3>

<p>I call the <code class="language-plaintext highlighter-rouge">clone</code> wrapper <code class="language-plaintext highlighter-rouge">newthread</code>. It has the inline assembly for the
system call, and since it includes a <code class="language-plaintext highlighter-rouge">ret</code> to diverge the threads, it’s a
“naked” function <a href="/blog/2023/02/12/">just like with <code class="language-plaintext highlighter-rouge">setjmp</code></a>. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like <code class="language-plaintext highlighter-rouge">clone</code>, it doesn’t accept a thread
entry point. Instead it accepts a <code class="language-plaintext highlighter-rouge">stack_head</code> seeded with the entry
point. The whole wrapper is just six instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">long</span> <span class="nf">newthread</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"mov  %%rdi, %%rsi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// arg2 = stack</span>
        <span class="s">"mov  $0x50f00, %%edi</span><span class="se">\n</span><span class="s">"</span>  <span class="c1">// arg1 = clone flags</span>
        <span class="s">"mov  $56, %%eax</span><span class="se">\n</span><span class="s">"</span>       <span class="c1">// SYS_clone</span>
        <span class="s">"syscall</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  %%rsp, %%rdi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// entry point argument</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="o">:</span> <span class="o">:</span> <span class="s">"rax"</span><span class="p">,</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"rsi"</span><span class="p">,</span> <span class="s">"rdi"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x86-64, both function calls and system calls use <code class="language-plaintext highlighter-rouge">rdi</code> and <code class="language-plaintext highlighter-rouge">rsi</code> for
their first two parameters. Per the reference <code class="language-plaintext highlighter-rouge">clone(2)</code> prototype above:
the first system call argument is <code class="language-plaintext highlighter-rouge">flags</code> and the second argument is the
new <code class="language-plaintext highlighter-rouge">stack</code>, which will point directly at the <code class="language-plaintext highlighter-rouge">stack_head</code>. However, the
stack pointer arrives in <code class="language-plaintext highlighter-rouge">rdi</code>. So I copy <code class="language-plaintext highlighter-rouge">stack</code> into the second argument
register, <code class="language-plaintext highlighter-rouge">rsi</code>, then load the flags (<code class="language-plaintext highlighter-rouge">0x50f00</code>) into the first argument
register, <code class="language-plaintext highlighter-rouge">rdi</code>. The system call number goes in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<p>Where does that <code class="language-plaintext highlighter-rouge">0x50f00</code> come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">long</span> <span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FILES</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FS</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SIGHAND</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SYSVSEM</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_THREAD</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_VM</span><span class="p">;</span>
</code></pre></div></div>

<p>When the system call returns, it copies the stack pointer into <code class="language-plaintext highlighter-rouge">rdi</code>, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as <code class="language-plaintext highlighter-rouge">stack</code>, of course. In the old thread this is a
harmless no-op because <code class="language-plaintext highlighter-rouge">rdi</code> is a volatile register in this ABI. Finally,
<code class="language-plaintext highlighter-rouge">ret</code> pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (<a href="/blog/2016/09/23/">negative errno</a>) or the new thread ID. In the new thread
<strong>it pops the first element of <code class="language-plaintext highlighter-rouge">stack_head</code></strong> which, of course, is the
entry point. That’s why it must be first!</p>

<p>The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the <code class="language-plaintext highlighter-rouge">exit</code> (<em>not</em> <code class="language-plaintext highlighter-rouge">exit_group</code>)
system call to terminate itself.</p>

<h3 id="caller-point-of-view">Caller point of view</h3>

<p>The caller side looks something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">threadentry</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do work ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
    <span class="n">futex_wake</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">);</span>
    <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span> <span class="o">=</span> <span class="n">newstack</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">);</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">entry</span> <span class="o">=</span> <span class="n">threadentry</span><span class="p">;</span>
    <span class="c1">// ... assign other thread data ...</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">newthread</span><span class="p">(</span><span class="n">stack</span><span class="p">);</span>

    <span class="c1">// ... do work ...</span>

    <span class="n">futex_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Despite the minimalist, 6-instruction clone wrapper, this is taking the
shape of a conventional threading API. It would only take a bit more to
hide the futex, too. Speaking of which, what’s going on there? The <a href="/blog/2022/10/05/">same
principal as a WaitGroup</a>. The futex, an integer, is zero-initialized,
indicating the thread is running (“not done”). The joiner tells the kernel
to wait until the integer is non-zero, which it may already be since I
don’t bother to check first. When the child thread is done, it atomically
sets the futex to non-zero and wakes all waiters, which might be nobody.</p>

<p>Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its <code class="language-plaintext highlighter-rouge">SIGCHLD</code> (or use <code class="language-plaintext highlighter-rouge">CLONE_CHILD_CLEARTID</code>). If this
sounds like a problem, consider <a href="https://vimeo.com/644068002">your context</a> more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.</p>

<p>What’s with the <code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code>? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, <a href="/blog/2023/02/15/#stack-alignment-on-32-bit-x86">just like on Windows</a>.
If you want to access <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> you’ll need <a href="/blog/2022/02/18/">more
assembly</a>. (I wish doing <em>really basic things</em> without libc on Linux
didn’t require so much assembly.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span> <span class="p">(</span>
    <span class="s">".global _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start:</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  (%rsp), %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsp), %rsi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsi,%rdi,8), %rdx</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   call  main</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  %eax, %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  $60, %eax</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   syscall</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Getting back to the example usage, it has some regular-looking system call
wrappers. Where do those come from? Start with this 6-argument generic
system call wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">syscall6</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">long</span> <span class="n">d</span><span class="p">,</span> <span class="kt">long</span> <span class="n">e</span><span class="p">,</span> <span class="kt">long</span> <span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r10</span> <span class="n">asm</span><span class="p">(</span><span class="s">"r10"</span><span class="p">)</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r8</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r8"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r9</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r9"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r10</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r8</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r9</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I could define <code class="language-plaintext highlighter-rouge">syscall5</code>, <code class="language-plaintext highlighter-rouge">syscall4</code>, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
</span></code></pre></div></div>

<p>Now we can have some exits:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit_group</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit_group</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simplified futex wrappers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">expect</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL4</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">expect</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL3</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mh">0x7fffffff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And so on.</p>

<p>Finally I can talk about that <code class="language-plaintext highlighter-rouge">newstack</code> function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="nf">newstack</span><span class="p">(</span><span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">p</span> <span class="o">=</span> <span class="n">SYSCALL6</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x22</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">size</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span> <span class="o">+</span> <span class="n">count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">aligned</code> attribute comes into play here: I treat the result like an
array of <code class="language-plaintext highlighter-rouge">stack_head</code> and return the last element. The attribute ensures
each individual elements is aligned.</p>

<p>That’s it! There’s not much to it other than a few thoughtful assembly
instructions. It took doing this a few times in a few different programs
before I noticed how simple it can be.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to build a WaitGroup from a 32-bit integer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/05/"/>
    <id>urn:uuid:cc83b101-2d77-42b8-b409-d4ed36831479</id>
    <updated>2022-10-05T03:19:07Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Go has a nifty synchronization utility called a <a href="https://godocs.io/sync#WaitGroup">WaitGroup</a>, on which
one or more goroutines can wait for concurrent task completion. In other
languages, the usual task completion convention is <em>joining</em> threads doing
the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup
replaces joins. Building a WaitGroup using typical, portable primitives is
a messy affair involving constructors and destructors, managing lifetimes.
However, on at least Linux and Windows, we can build a WaitGroup out of a
zero-initialized integer, much like my <a href="/blog/2022/05/14/">32-bit queue</a> and <a href="/blog/2022/03/13/">32-bit
barrier</a>.</p>

<p>In case you’re not familiar with it, a typical WaitGroup use case in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">wg</span> <span class="n">sync</span><span class="o">.</span><span class="n">WaitGroup</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">task</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">tasks</span> <span class="p">{</span>
    <span class="n">wg</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
    <span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">t</span> <span class="n">Task</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// ... do task ...</span>
        <span class="n">wg</span><span class="o">.</span><span class="n">Done</span><span class="p">()</span>
    <span class="p">}(</span><span class="n">task</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">wg</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>
</code></pre></div></div>

<p>I zero-initialize the WaitGroup, the main goroutine increments the counter
before starting each task goroutine, each goroutine decrements the counter
when done, and the main goroutine waits until the counter reaches zero. My
goal is to build the same mechanism in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">workfunc</span><span class="p">(</span><span class="n">task</span> <span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do task ...</span>
    <span class="n">waitgroup_done</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">waitgroup_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When it’s done, the WaitGroup is back to zero, and no cleanup is required.</p>

<p>I’m going to take it a little further than that: Since its meaning and
contents are explicit, you may initialize a WaitGroup to any non-negative
task count! In other words, <code class="language-plaintext highlighter-rouge">waitgroup_add</code> is optional if the total
number of tasks is known up front.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="n">ntasks</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
</code></pre></div></div>

<p>A sneak peek at the full source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/waitgroup.c"><code class="language-plaintext highlighter-rouge">waitgroup.c</code></a></strong></p>

<h3 id="the-four-elements-of-synchronization">The four elements (of synchronization)</h3>

<p>To build this WaitGroup, we’re going to need four primitives from the host
platform, each operating on an <code class="language-plaintext highlighter-rouge">int</code>. The first two are atomic operations,
and the second two interact with the system scheduler. To port the
WaitGroup to a platform you need only implement these four functions,
typically as one-liners.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>  <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// atomic load</span>
<span class="k">static</span> <span class="kt">int</span>  <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>  <span class="c1">// atomic add-then-fetch</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>      <span class="c1">// wait on change at address</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// wake all waiters by address</span>
</code></pre></div></div>

<p>The first two should be self-explanatory. The <code class="language-plaintext highlighter-rouge">wait</code> function waits for
the pointed-at integer to change its value, and the second argument is its
expected current value. The scheduler will double-check the integer before
putting the thread to sleep in case it changes at the last moment — in
other words, an atomic check-then-maybe-sleep. The <code class="language-plaintext highlighter-rouge">wake</code> function is the
other half. After changing the integer, a thread uses it to wake all
threads waiting for the pointed-at integer to change. Together, this
mechanism is known as a <em>futex</em>.</p>

<p>I’m going to simplify the WaitGroup semantics a bit in order to make my
implementation even simpler. Go’s WaitGroup allows adding negatives, and
the <code class="language-plaintext highlighter-rouge">Add</code> method essentially does double-duty. My version forbids adding
negatives. That means the “add” operation is just an atomic increment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_add</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">,</span> <span class="kt">int</span> <span class="n">delta</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since it cannot bring the counter to zero, there’s nothing else to do. The
“done” operation <em>can</em> decrement to zero:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_done</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">wake</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the atomic decrement brought the count to zero, we finished the last
task, so we need to wake the waiters. We don’t know if anyone is actually
waiting, but that’s fine. Some futex use cases will avoid making the
relatively expensive system call if nobody’s waiting — i.e. don’t waste
time on a system call for each unlock of an uncontended mutex — but in the
typical WaitGroup case we <em>expect</em> a waiter when the count finally goes to
zero. That’s the common case.</p>

<p>The most complicated of the three is waiting:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">wait</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First check if the count is already zero and return if it is. Otherwise
use the futex to <em>wait for it to change</em>. Unfortunately that’s not exactly
the semantics we want, which would be to wait for a certain target. This
doesn’t break the wait, but it’s a potential source of inefficiency. If a
thread finishes a task between our load and wait, we don’t go to sleep,
and instead try again. However, in practice, I ran thousands of threads
through this thing concurrently and I couldn’t observe such a “miss.” As
far as I can tell, it’s so rare it doesn’t matter.</p>

<p>If this was a concern, the WaitGroup could instead be a pair of integers:
the counter and a “latch” that is either 0 or 1. Waiters wait on the
latch, and the latch is modified (atomically) when the counter transitions
to or from zero. That gives waiters a stable value on which to wait,
proxying the counter. However, since this doesn’t seem to matter in
practice, I prefer the elegance and simplicity of the single-integer
WaitGroup.</p>

<h3 id="four-elements-linux">Four elements: Linux</h3>

<p>With the WaitGroup done at a high level, we now need the per-platform
parts. Both GCC and Clang support <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html">GNU-style atomics</a>, so I’ll just
assume these are available on Linux without worrying about the compiler.
The first two functions wrap these built-ins:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_load_n</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_add_fetch</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">wait</code> and <code class="language-plaintext highlighter-rouge">wake</code> we need the <a href="https://man7.org/linux/man-pages/man2/futex.2.html"><code class="language-plaintext highlighter-rouge">futex(2)</code> system call</a>. In an
attempt to discourage its direct use, glibc doesn’t wrap this system call
in a function, so we must make the system call ourselves.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">current</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="n">INT_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">INT_MAX</code> means “wake as many as possible.” The other common value is
1 for waking a single waiter. Also, these system calls can’t meaningfully
fail, so there’s no need to check the return value. If <code class="language-plaintext highlighter-rouge">wait</code> wakes up
early (e.g. <code class="language-plaintext highlighter-rouge">EINTR</code>), it’s going to check the counter again anyway. In
fact, if your kernel is more than 20 years old, predating futexes, and
returns <code class="language-plaintext highlighter-rouge">ENOSYS</code> (“Function not implemented”), it will <em>still</em> work
correctly, though it will be incredibly inefficient.</p>

<h3 id="four-elements-windows">Four elements: Windows</h3>

<p>Windows didn’t support futexes until Windows 8 in 2012, and were still
supporting Windows without it into 2020, so they’re still relatively “new”
for this platform. Nonetheless, they’re now mature enough that we can
count on them being available.</p>

<p>I’d like to support both GCC-ish (<a href="https://github.com/skeeto/w64devkit">via Mingw-w64</a>) and MSVC-ish
compilers. Mingw-w64 provides a compatible <code class="language-plaintext highlighter-rouge">intrin.h</code>, so I can stick to
MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t
define atomics for <code class="language-plaintext highlighter-rouge">int</code> (or even <code class="language-plaintext highlighter-rouge">int32_t</code>), strictly <code class="language-plaintext highlighter-rouge">long</code>, so I have
to sneak in a little cast. (Recall: <code class="language-plaintext highlighter-rouge">sizeof(long) == sizeof(int)</code> on every
version of Windows supporting futexes.) The other option is to <code class="language-plaintext highlighter-rouge">typedef</code>
the WaitGroup so that it’s <code class="language-plaintext highlighter-rouge">int</code> on Linux (for the futex) and <code class="language-plaintext highlighter-rouge">long</code> on
Windows (for atomics).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">_InterlockedOr</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">addend</span> <span class="o">+</span> <span class="n">_InterlockedExchangeAdd</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The official, sanctioned futex functions are <a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress">WaitOnAddress</a> and
<a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall">WakeByAddressAll</a>. They <a href="https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/CALK-3m%2B6tX_ubMVGV7NarAm6VH0AoOp5THyXfEUA%3DTjyu5L%3Dxw%40mail.gmail.com/">used to be in <code class="language-plaintext highlighter-rouge">kernel32.dll</code></a>, but as of
this writing they live in <code class="language-plaintext highlighter-rouge">API-MS-Win-Core-Synch-l1-2-0.dll</code>, linked via
<code class="language-plaintext highlighter-rouge">-lsynchronization</code>. Gross. Since I can’t stomach this, I instead call the
low-level RTL functions where it’s actually implemented: RtlWaitOnAddress
and RtlWakeAddressAll. These live in the nice neighborhood of <code class="language-plaintext highlighter-rouge">ntdll.dll</code>.
They’re undocumented as far as I can tell, but thankfully <a href="https://github.com/wine-mirror/wine/blob/master/dlls/ntdll/sync.c">Wine comes to
the rescue</a>, providing both documentation and several different
implementations. Reading through it is educational, and hints at ways to
construct futexes on systems lacking them.</p>

<p>These functions aren’t declared in any headers, so I have to do it myself.
On the plus side, so far I haven’t paid the substantial compile-time costs
of <a href="https://web.archive.org/web/20090912002357/http://www.tilander.org/aurora/2008/01/include-windowsh.html">including <code class="language-plaintext highlighter-rouge">windows.h</code></a>, and so I can continue avoiding it. These
functions <em>are</em> listed in the <code class="language-plaintext highlighter-rouge">ntdll.dll</code> import library, so I don’t need
to <a href="/blog/2021/05/31/">invent the import library entries</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWaitOnAddress</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWakeAddressAll</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Rather conveniently, the semantics perfectly line up with Linux futexes!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWaitOnAddress</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWakeAddressAll</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like with Linux, there’s no meaningful failure, so the return values don’t
matter.</p>

<p>That’s the whole implementation. Considering just a single platform, a
flexible, lightweight, and easy-to-use synchronization facility in ~50
lines of relatively simple code is a pretty good deal if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Illuminating synchronization edges for ThreadSanitizer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/03/"/>
    <id>urn:uuid:a008900c-cf6a-46e2-8657-21bded194350</id>
    <updated>2022-10-03T03:09:38Z</updated>
    <category term="c"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p><a href="https://github.com/google/sanitizers/wiki">Sanitizers</a> are powerful development tools which complement
<a href="/blog/2022/06/26/">debuggers</a> and <a href="/blog/2019/01/25/">fuzzing</a>. I typically have at least one sanitizer
active during development. They’re particularly useful during code review,
where they can identify issues before I’ve even begun examining the code
carefully — sometimes in mere minutes under fuzzing. Accordingly, it’s a
good idea to have your own code in good agreement with sanitizers before
review. For ThreadSanitizer (TSan), that means dealing with false
positives in programs relying on synchronization invisible to TSan.</p>

<p>This article’s motivation is multi-threaded <a href="https://man7.org/linux/man-pages/man7/epoll.7.html">epoll</a>. I mitigate TSan
false positives each time it comes up, enough to have gotten the hang of
it, so I ought to document it. <a href="https://github.com/skeeto/w64devkit">On Windows</a> I would also run into the
same issue with the Win32 message queue, crossing the synchronization edge
between <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-postmessagea">PostMessage</a> (release) and <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getmessage">GetMessage</a> (acquire), <em>except</em>
for the general lack of TSan support in Windows tooling. The same
technique would work there as well.</p>

<p>My typical epoll scenario looks like so:</p>

<ol>
  <li>Create an epoll file descriptor (<code class="language-plaintext highlighter-rouge">epoll_create1</code>).</li>
  <li>Create worker threads, passing the epoll file descriptor.</li>
  <li>Worker threads loop on <code class="language-plaintext highlighter-rouge">epoll_wait</code>.</li>
  <li>Main thread loops on <code class="language-plaintext highlighter-rouge">accept</code>, adding sockets to epoll (<code class="language-plaintext highlighter-rouge">epoll_ctl</code>).</li>
</ol>

<p>Between <code class="language-plaintext highlighter-rouge">accept</code> and <code class="language-plaintext highlighter-rouge">EPOLL_CTL_ADD</code>, the main thread allocates and
initializes the client session state, then attaches it to the epoll event.
The client socket is added with <a href="https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/">the <code class="language-plaintext highlighter-rouge">EPOLLONESHOT</code> flag</a>, and the
session state is not touched after the call to <code class="language-plaintext highlighter-rouge">epoll_ctl</code> (note: sans
error checks):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">accept</span><span class="p">(...);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="p">...;</span>
    <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span> <span class="o">=</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="k">struct</span> <span class="n">epoll_event</span><span class="p">;</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span> <span class="o">=</span> <span class="n">EPOLLET</span> <span class="o">|</span> <span class="n">EPOLLONESHOT</span> <span class="o">|</span> <span class="p">...;</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span> <span class="o">=</span> <span class="n">session</span><span class="p">;</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this example, <code class="language-plaintext highlighter-rouge">struct session</code> is defined by the application to contain
all the state for handling a session (file descriptor, buffers, <a href="/blog/2020/12/31/">state
machine</a>, parser state, <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">allocation arena</a>, etc.). Everything
else is part of the epoll interface.</p>

<p>When a socket is ready, one of the worker threads receive it. Due to
<code class="language-plaintext highlighter-rouge">EPOLLONESHOT</code>, it’s immediately disabled and no other thread can receive
it. The thread does as much work as possible (i.e. read/write until
<code class="language-plaintext highlighter-rouge">EAGAIN</code>), then reactivates it with <code class="language-plaintext highlighter-rouge">epoll_ctl</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">epoll_event</span> <span class="n">event</span><span class="p">;</span>
    <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span> <span class="o">=</span> <span class="n">EPOLLET</span> <span class="o">|</span> <span class="n">EPOLLONESHOT</span> <span class="o">|</span> <span class="p">...;</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared variables in <code class="language-plaintext highlighter-rouge">session</code> are passed between threads through
<code class="language-plaintext highlighter-rouge">epoll</code> using the event’s <code class="language-plaintext highlighter-rouge">.user.ptr</code>. These variables are potentially
read and mutated by every thread, but it’s all perfectly safe without any
further synchronization — i.e. no need for mutexes, etc. All the necessary
synchronization is implicit in epoll.</p>

<p>In the initial hand-off, that <code class="language-plaintext highlighter-rouge">EPOLL_CTL_ADD</code> must <em>happen before</em> the
corresponding <code class="language-plaintext highlighter-rouge">epoll_wait</code> in a worker thread. This establishes that the
main thread and worker thread do not touch session variables concurrently.
After all, how could the worker see an event on the file descriptor before
it’s been added to epoll? The synchronization in epoll itself will also
ensure all the architecture-level stores are visible to other threads
before the hand-off. We can call the “add” a <em>release</em> and the “wait” an
<em>acquire</em>, forming a synchronization edge.</p>

<p>Similarly, in the hand-off between worker threads, the <code class="language-plaintext highlighter-rouge">EPOLL_CTL_MOD</code>
that reactivates the file descriptor must <em>happen before</em> the wait that
observes the next event because, until reactivation, it’s disabled. The
<code class="language-plaintext highlighter-rouge">EPOLL_CTL_MOD</code> is another <em>release</em> in relation to the <em>acquire</em> wait.</p>

<p>Unfortunately TSan won’t see things this way. It can’t see into the
kernel, and it doesn’t know these subtle epoll semantics, so it can’t see
these synchronization edges. As far <a href="https://www.youtube.com/watch?v=5erqWdlhQLA">as it can tell</a>, threads might be
accessing a session concurrently, and TSan will reliably produce warnings
about it. You could shrug your shoulders and give up on using TSan in this
case, but there’s an easy solution: introduce redundant, semantically
identical synchronization edges, but only when TSan is looking.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARNING: ThreadSanitizer: data race
</code></pre></div></div>

<h3 id="redundant-synchronization">Redundant synchronization</h3>

<p>I prefer to solve this by introducing the weakest possible synchronization
so that I’m not synchronizing beyond epoll’s semantics. This will help
TSan catch real mistakes that stronger synchronization might hide.</p>

<p>The weakest option is memory fences. These wouldn’t introduce extra loads
or stores. At most it would be a fence instruction. I would use <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html">GCC’s
built-in <code class="language-plaintext highlighter-rouge">__atomic_thread_fence</code></a> for the job. However, TSan does not
currently understand thread fences, so that defeats the purpose. Instead,
I introduce a new field to <code class="language-plaintext highlighter-rouge">struct session</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">session</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="kt">int</span> <span class="n">_sync</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Then just before <code class="language-plaintext highlighter-rouge">epoll_ctl</code> I’ll do a <em>release</em> store on this field,
“releasing” the session. All session stores are ordered before the
release.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// main thread</span>
    <span class="c1">// ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">)</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>

    <span class="c1">// worker thread</span>
    <span class="c1">// ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">)</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
</code></pre></div></div>

<p>After <code class="language-plaintext highlighter-rouge">epoll_wait</code> I add an <em>acquire</em> load, “acquiring” the session. All
session loads are ordered after the acquire.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="n">__atomic_load_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">)</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
</code></pre></div></div>

<p>For this to work, the thread must not touch session variables in any way
before the acquire or after the release. For example, note how I obtained
the client file descriptor before the release, i.e. no <code class="language-plaintext highlighter-rouge">session-&gt;fd</code>
argument in the <code class="language-plaintext highlighter-rouge">epoll_ctl</code> call.</p>

<p>That’s it! This redundantly establishes the <em>happens before</em> relationship
already implicit in epoll, but now it’s visible to TSan. However, I don’t
want to pay for this unless I’m actually running under TSan, so some
macros are in order. <code class="language-plaintext highlighter-rouge">__SANITIZE_THREAD__</code> is automatically defined when
running under TSan:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if __SANITIZE_THREAD__
# define TSAN_SYNCED     int _sync
# define TSAN_ACQUIRE(s) __atomic_load_n(&amp;(s)-&gt;_sync, __ATOMIC_ACQUIRE)
# define TSAN_RELEASE(s) __atomic_store_n(&amp;(s)-&gt;_sync, 0, __ATOMIC_RELEASE)
#else
# define TSAN_SYNCED
# define TSAN_ACQUIRE(s)
# define TSAN_RELEASE(s)
#endif
</span></code></pre></div></div>

<p>This also makes it more readable, and intentions clearer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">session</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">TSAN_SYNCED</span><span class="p">;</span>
<span class="p">};</span>

    <span class="c1">// main thread</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="n">TSAN_RELEASE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="c1">// worker thread</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
        <span class="n">TSAN_ACQUIRE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
        <span class="c1">// ...</span>
        <span class="n">TSAN_RELEASE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Now I can use TSan again, and it didn’t cost anything in normal builds.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>My new debugbreak command</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/07/31/"/>
    <id>urn:uuid:c333d1ab-86b5-4389-b2b7-325d0eb90987</id>
    <updated>2022-07-31T12:59:59Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I <a href="/blog/2022/06/26/">previously mentioned</a> the Windows feature where <a href="https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-registerhotkey">pressing
F12</a> in a debuggee window causes it to break in the debugger. It
works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the
hotkey simply raises a breakpoint <a href="https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp">structured exception</a>. It’s been
surprisingly useful, and I’ve wanted it available in more contexts, such
as console programs or even on Linux. The result is a new <a href="https://github.com/skeeto/w64devkit/blob/4282797/src/debugbreak.c"><code class="language-plaintext highlighter-rouge">debugbreak</code>
command</a>, now included in <a href="/blog/2020/05/15/">w64devkit</a>. Though, of course, you
already have <a href="/blog/2020/09/25/">everything you need</a> to build it and try it out right
now. I’ve also worked out a Linux implementation.</p>

<p>It’s named after an <a href="https://docs.microsoft.com/en-us/visualstudio/debugger/debugbreak-and-debugbreak">MSVC intrinsic and Win32 function</a>. It takes no
arguments, and its operation is indiscriminate: It raises a breakpoint
exception in <em>all</em> debuggee processes system-wide. Reckless? Perhaps, but
certainly convenient. You don’t need to tell it which process you want to
pause. It just works, and a good debugging experience is one of ease and
convenience.</p>

<p>The linchpin is <a href="https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-debugbreakprocess">DebugBreakProcess</a>. The command walks the process
list and fires this function at each process. Nothing happens for programs
without a debugger attached, so it doesn’t even bother checking if it’s a
debuggee. It couldn’t be simpler. I’ve used it on everything from Windows
XP to Windows 11, and it’s worked flawlessly.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="n">s</span> <span class="o">=</span> <span class="n">CreateToolhelp32Snapshot</span><span class="p">(</span><span class="n">TH32CS_SNAPPROCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">PROCESSENTRY32W</span> <span class="n">p</span> <span class="o">=</span> <span class="p">{</span><span class="k">sizeof</span><span class="p">(</span><span class="n">p</span><span class="p">)};</span>
<span class="k">for</span> <span class="p">(</span><span class="n">BOOL</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32FirstW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">);</span> <span class="n">r</span><span class="p">;</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32NextW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">))</span> <span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">PROCESS_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">th32ProcessID</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">DebugBreakProcess</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I use it almost exclusively from Vim, where I’ve given it a <a href="https://learnvimscriptthehardway.stevelosh.com/chapters/06.html">leader
mapping</a>. With the editor focused, I can type backslash then
<kbd>d</kbd> to pause the debuggee.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">map</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">d</span> <span class="p">:</span><span class="k">call</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"debugbreak"</span><span class="p">)&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>With the debuggee paused, I’m free to add new breakpoints or watchpoints,
or print the call stack to see what the heck it’s busy doing. The
mechanism behind DebugBreakProcess is to create a new thread in the
target, with that thread raising the breakpoint exception. The debugger
will be stopped in this new thread. In GDB you can use the <code class="language-plaintext highlighter-rouge">thread</code>
command to switch over to the thread that actually matters, usually <code class="language-plaintext highlighter-rouge">thr
1</code>.</p>

<h3 id="debugbreak-on-linux">debugbreak on Linux</h3>

<p>On unix-like systems the equivalent of a breakpoint exception is a
<code class="language-plaintext highlighter-rouge">SIGTRAP</code>. There’s already a standard command for sending signals,
<a href="https://man7.org/linux/man-pages/man1/kill.1.html"><code class="language-plaintext highlighter-rouge">kill</code></a>, so a <code class="language-plaintext highlighter-rouge">debugbreak</code> command can be built using nothing more
than a few lines of shell script. However, unlike DebugBreakProcess,
signaling every process with <code class="language-plaintext highlighter-rouge">SIGTRAP</code> will only end in tears. The script
will need a way to determine which processes are debuggees.</p>

<p>Linux exposes processes in the file system as virtual files under <code class="language-plaintext highlighter-rouge">/proc</code>,
where each process appears as a directory. Its <code class="language-plaintext highlighter-rouge">status</code> file includes a
<code class="language-plaintext highlighter-rouge">TracerPid</code> field, which will be non-zero for debuggees. The script
inspects this field, and if non-zero sends a <code class="language-plaintext highlighter-rouge">SIGTRAP</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="k">for </span>pid <span class="k">in</span> <span class="si">$(</span>find /proc <span class="nt">-maxdepth</span> 1 <span class="nt">-printf</span> <span class="s1">'%f\n'</span> | <span class="nb">grep</span> <span class="s1">'^[0-9]\+$'</span><span class="si">)</span><span class="p">;</span> <span class="k">do
    </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s1">'^TracerPid:\s[^0]'</span> /proc/<span class="nv">$pid</span>/status 2&gt;/dev/null <span class="o">&amp;&amp;</span>
        <span class="nb">kill</span> <span class="nt">-TRAP</span> <span class="nv">$pid</span>
<span class="k">done</span>
</code></pre></div></div>

<p>This script, now part of <a href="/blog/2012/06/23/">my dotfiles</a>, has worked very well so
far, and effectively smoothes over some debugging differences between
Windows and Linux, reducing my context switching mental load. There’s
probably a better way to express this script, but that’s the best I could
do so far. On the BSDs you’d need to parse the output of <code class="language-plaintext highlighter-rouge">ps</code>, though each
system seems to do its own thing for distinguishing debuggees.</p>

<h3 id="a-missing-feature">A missing feature</h3>

<p>I had originally planned for one flag, <code class="language-plaintext highlighter-rouge">-k</code>. Rather than breakpoint
debugees, it would terminate all debuggee processes. This is especially
important on Windows where debuggee processes block builds due to file
locking shenanigans. I’d just run <code class="language-plaintext highlighter-rouge">debugbreak -k</code> as part of the build.
However, it’s not possible to terminate debuggees paused in the debugger —
the common situation. I’ve given up on this for now.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to build and use DLLs on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/05/31/"/>
    <id>urn:uuid:6b64024a-6945-4bff-8226-33b9357babda</id>
    <updated>2021-05-31T02:13:40Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I’ve recently been involved with a couple of discussions about Windows’
dynamic linking. One was <a href="https://begriffs.com/">Joe Nelson</a> in considering how to make
<a href="https://github.com/begriffs/libderp">libderp</a> accessible on Windows, and the other was about <a href="/blog/2020/05/15/">w64devkit</a>,
my Mingw-w64 distribution. I use these techniques so infrequently that I
need to figure it all out again each time I need it. Unfortunately there’s
a whole lot of outdated and incorrect information online which gets in the
way every time this happens. While it’s all fresh in my head, I will now
document what I know works.</p>

<p>In this article, all commands and examples are being run in the context of
w64devkit (1.8.0).</p>

<h3 id="mingw-w64">Mingw-w64</h3>

<p>If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, <code class="language-plaintext highlighter-rouge">square</code>,
that returns the square of its input (<code class="language-plaintext highlighter-rouge">square.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The header file (<code class="language-plaintext highlighter-rouge">square.h</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>To build a stripped, size-optimized DLL, <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o square.dll square.c
</code></pre></div></div>

<p>Now a test program to link against it (<code class="language-plaintext highlighter-rouge">main.c</code>), which “imports” <code class="language-plaintext highlighter-rouge">square</code>
from <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Linking and testing it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p>It’s that simple. Or more traditionally, using the <code class="language-plaintext highlighter-rouge">-l</code> flag:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s -L. main.c -lsquare
</code></pre></div></div>

<p>Given <code class="language-plaintext highlighter-rouge">-lxyz</code> GCC will look for <code class="language-plaintext highlighter-rouge">xyz.dll</code> in the library path.</p>

<h4 id="viewing-exported-symbols">Viewing exported symbols</h4>

<p>Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s <code class="language-plaintext highlighter-rouge">nm -D</code>, but despite what
the internet will tell you, this tool does not support DLLs. <code class="language-plaintext highlighter-rouge">objdump</code>
will print the exports as part of the “private” headers (<code class="language-plaintext highlighter-rouge">-p</code>). A bit of
<code class="language-plaintext highlighter-rouge">awk</code> can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, <code class="language-plaintext highlighter-rouge">exports.sh</code>, that composes <code class="language-plaintext highlighter-rouge">objdump</code> and
<code class="language-plaintext highlighter-rouge">awk</code> into the tool I want:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">printf</span> <span class="s1">'LIBRARY %s\nEXPORTS\n'</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
objdump <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> | <span class="nb">awk</span> <span class="s1">'/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'</span>
</code></pre></div></div>

<p>Running this on <code class="language-plaintext highlighter-rouge">square.dll</code> above:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square
</code></pre></div></div>

<p>This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
<a href="https://sourceware.org/binutils/docs/binutils/def-file-format.html"><code class="language-plaintext highlighter-rouge">.def</code> file format</a> (<a href="https://www.willus.com/mingw/yongweiwu_stdcall.html">also</a>), which will be particularly
useful in a moment.</p>

<p>Mingw-w64 has a <code class="language-plaintext highlighter-rouge">gendef</code> tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square
</code></pre></div></div>

<p>Alternatively Visual Studio provides <code class="language-plaintext highlighter-rouge">dumpbin</code>. It’s not as concise as
<code class="language-plaintext highlighter-rouge">exports.sh</code> but it’s a lot less verbose than <code class="language-plaintext highlighter-rouge">objdump -p</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...
</code></pre></div></div>

<h4 id="mingw-w64-improved">Mingw-w64 (improved)</h4>

<p>You can get by without knowing anything more, which is usually enough for
those looking to support Windows as a secondary platform, even just as a
cross-compilation target. However, with a bit more work we can do better.
Imagine doing the above with a non-trivial program. GCC doesn’t know which
functions are part of the API and which are not. Obviously static
functions should not be exported, but what about non-static functions
visible between translation units (i.e. object files)?</p>

<p>For instance, suppose <code class="language-plaintext highlighter-rouge">square.c</code> also has this function which is not part
of its API but may be called by another translation unit.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>Now when I build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square
</code></pre></div></div>

<p>On the other side, when I build <code class="language-plaintext highlighter-rouge">main.c</code> how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).</p>

<p>On Windows both are solved by adding <code class="language-plaintext highlighter-rouge">__declspec</code> notation on both sides.
In <code class="language-plaintext highlighter-rouge">square.c</code> the exports are marked as <code class="language-plaintext highlighter-rouge">dllexport</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>In the header, it’s marked as an import:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p>The mere presence of <code class="language-plaintext highlighter-rouge">dllexport</code> tells the linker to only export those
functions marked as exports, and so <code class="language-plaintext highlighter-rouge">internal_func</code> disappears from the
exports list. Convenient!</p>

<p>On the import side, during compilation of the original program, GCC
assumed <code class="language-plaintext highlighter-rouge">square</code> wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a <a href="https://www.airs.com/blog/archives/41">PLT</a>). With
<code class="language-plaintext highlighter-rouge">dllimport</code>, GCC knows it’s an imported function and so doesn’t go through
a trampoline.</p>

<p>While generally unnecessary for the GNU toolchain, it’s good hygiene to
use <code class="language-plaintext highlighter-rouge">__declspec</code>. It’s also mandatory when using <a href="https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B">MSVC</a>, in case you
care about that as well.</p>

<h3 id="msvc">MSVC</h3>

<p>Mingw-w64-compiled DLLs will work with <code class="language-plaintext highlighter-rouge">LoadLibrary</code> out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (<code class="language-plaintext highlighter-rouge">loadlib.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">LoadLibrary</span><span class="p">(</span><span class="s">"square.dll"</span><span class="p">);</span>
    <span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">square</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"square"</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with MSVC <code class="language-plaintext highlighter-rouge">cl</code> (via <a href="/blog/2016/06/13/#visual-c"><code class="language-plaintext highlighter-rouge">vcvars.bat</code></a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo loadlib.c
$ ./loadlib
4
</code></pre></div></div>

<p>However, the MSVC linker, unlike Binutils <code class="language-plaintext highlighter-rouge">ld</code>, cannot link directly with
DLLs. It requires an <em>import library</em>. Conventionally this matches the DLL
name but has a <code class="language-plaintext highlighter-rouge">.lib</code> extension — <code class="language-plaintext highlighter-rouge">square.lib</code> in this case. The Mingw-w64
ecosystem conventionally uses <code class="language-plaintext highlighter-rouge">.dll.a</code>, as in <code class="language-plaintext highlighter-rouge">square.dll.a</code>, in order to
distinguish it from a static library, but it’s the same format. The most
convenient way to get an import library is to ask GCC to generate one at
link-time via <code class="language-plaintext highlighter-rouge">--out-implib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c
</code></pre></div></div>

<p>Back to <code class="language-plaintext highlighter-rouge">cl</code>, just add <code class="language-plaintext highlighter-rouge">square.lib</code> as another input. You don’t actually
need <code class="language-plaintext highlighter-rouge">square.dll</code> present at link time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /Os main.c square.lib
$ ./main
4
</code></pre></div></div>

<p>What if you already have the DLL and you just need an import library? GNU
Binutils’ <code class="language-plaintext highlighter-rouge">dlltool</code> can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a <code class="language-plaintext highlighter-rouge">.def</code>
file enumerating the exports. (Why?) What luck that we have a tool for
this!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll &gt;square.def
$ dlltool --input-def square.def --output-lib square.lib
</code></pre></div></div>

<h3 id="reversing-directions">Reversing directions</h3>

<p>Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with <code class="language-plaintext highlighter-rouge">dllexport</code>. The <code class="language-plaintext highlighter-rouge">/LD</code> (case
sensitive) is just like GCC’s <code class="language-plaintext highlighter-rouge">-shared</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">cl</code> outputs three files: <code class="language-plaintext highlighter-rouge">square.dll</code>, <code class="language-plaintext highlighter-rouge">square.lib</code>, and <code class="language-plaintext highlighter-rouge">square.exp</code>.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.</p>

<p>This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at
least for C interfaces that <a href="/blog/2023/08/27/">don’t share CRT objects</a>.</p>

<h3 id="tying-it-all-together">Tying it all together</h3>

<p>If your program is designed to be portable, those <code class="language-plaintext highlighter-rouge">__declspec</code> will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.</p>

<p>The strategy will be to mark all API functions with <code class="language-plaintext highlighter-rouge">SQUARE_API</code> and
expand that to whatever is necessary at the time. When building a library,
it will expand to <code class="language-plaintext highlighter-rouge">dllexport</code>, or default visibility on unix-likes. When
consuming a library it will expand to <code class="language-plaintext highlighter-rouge">dllimport</code>, or nothing outside of
Windows. The new <code class="language-plaintext highlighter-rouge">square.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="cp">#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>The new <code class="language-plaintext highlighter-rouge">square.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SQUARE_BUILD
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">main.c</code> remains the same. When compiling on unix-like systems, add the
<code class="language-plaintext highlighter-rouge">-fvisibility=hidden</code> to hide all symbols by default so that this macro
can reveal them.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4
</code></pre></div></div>

<h3 id="makefile-ideas">Makefile ideas</h3>

<p>While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different <a href="/blog/2017/08/20/">Makefiles</a> per toolchain
and target. Aside from the <code class="language-plaintext highlighter-rouge">SQUARE_API</code> type of macros, the source code
can fortunately remain fairly agnostic about it.</p>

<p>Here’s what I might use as <code class="language-plaintext highlighter-rouge">NMakefile</code> for MSVC <code class="language-plaintext highlighter-rouge">nmake</code>:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>     <span class="o">=</span> cl /nologo
<span class="nv">CFLAGS</span> <span class="o">=</span> /Os

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.h square.lib</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> main.c square.lib

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> /LD <span class="nv">$(CFLAGS)</span> square.c

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="p">-</span>del /f main.exe square.dll square.lib square.exp
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nmake /nologo /f NMakefile
</code></pre></div></div>

<p>For w64devkit and cross-compiling, <code class="language-plaintext highlighter-rouge">Makefile.w64</code>, which includes
import library generation for the sake of MSVC consumers:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.dll square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c square.dll <span class="nv">$(LDLIBS)</span>

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nt">-Wl</span>,--out-implib,<span class="err">$</span><span class="o">(</span>@:dll<span class="o">=</span>lib<span class="o">)</span> <span class="se">\</span>
	    <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main.exe square.dll square.lib
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make -f Makefile.w64
</code></pre></div></div>

<p>And a <code class="language-plaintext highlighter-rouge">Makefile</code> for everyone else:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span> <span class="nt">-fvisibility</span><span class="o">=</span>hidden
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main libsquare.so</span>

<span class="nl">main</span><span class="o">:</span> <span class="nf">main.c libsquare.so square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c ./libsquare.so <span class="nv">$(LDLIBS)</span>

<span class="nl">libsquare.so</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main libsquare.so
</code></pre></div></div>

<p>Now that I have this article, I’m glad I won’t have to figure this all out
again next time I need it!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Asynchronously Opening and Closing Files in Asyncio</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/09/04/"/>
    <id>urn:uuid:ae94da45-f65d-4c72-a10e-9e421ea843ec</id>
    <updated>2020-09-04T01:36:20Z</updated>
    <category term="c"/><category term="linux"/><category term="python"/><category term="asyncio"/>
    <content type="html">
      <![CDATA[<p>Python <a href="https://docs.python.org/3/library/asyncio.html">asyncio</a> has support for asynchronous networking,
subprocesses, and interprocess communication. However, it has nothing
for asynchronous file operations — opening, reading, writing, or
closing. This is likely in part because operating systems themselves
also lack these facilities. If a file operation takes a long time,
perhaps because the file is on a network mount, then the entire Python
process will hang. It’s possible to work around this, so let’s build a
utility that can asynchronously open and close files.</p>

<p>The usual way to work around the lack of operating system support for a
particular asynchronous operation is to <a href="http://docs.libuv.org/en/v1.x/design.html#file-i-o">dedicate threads to waiting on
those operations</a>. By using a thread pool, we can even avoid the
overhead of spawning threads when we need them. Plus asyncio is designed
to play nicely with thread pools anyway.</p>

<h3 id="test-setup">Test setup</h3>

<p>Before we get started, we’ll need some way to test that it’s working. We
need a slow file system. One thought is to <a href="/blog/2018/06/23/">use ptrace to intercept the
relevant system calls</a>, though this isn’t quite so simple. The
other threads need to continue running while the thread waiting on
<code class="language-plaintext highlighter-rouge">open(2)</code> is paused, but ptrace pauses the whole process. Fortunately
there’s a simpler solution anyway: <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code>.</p>

<p>Setting the <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code> environment variable to the name of a shared
object will cause the loader to load this shared object ahead of
everything else, allowing that shared object to override other
libraries. I’m on x86-64 Linux (Debian), and so I’m looking to override
<code class="language-plaintext highlighter-rouge">open64(2)</code> in glibc. Here’s my <code class="language-plaintext highlighter-rouge">open64.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;dlfcn.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span>
<span class="nf">open64</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mode</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">"/tmp/"</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">sleep</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="n">RTLD_NEXT</span><span class="p">,</span> <span class="s">"open64"</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">f</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">mode</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now Python must go through my C function when it opens files. If the
file resides where under <code class="language-plaintext highlighter-rouge">/tmp/</code>, opening the file will be delayed by 3
seconds. Since I still want to actually open a file, I use <code class="language-plaintext highlighter-rouge">dlsym()</code> to
access the <em>real</em> <code class="language-plaintext highlighter-rouge">open64()</code> in glibc. I build it like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -o open64.so open64.c -ldl
</code></pre></div></div>

<p>And to test that it works with Python, let’s time how long it takes to
open <code class="language-plaintext highlighter-rouge">/tmp/x</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ touch /tmp/x
$ time LD_PRELOAD=./open64.so python3 -c 'open("/tmp/x")'

real    0m3.021s
user    0m0.014s
sys     0m0.005s
</code></pre></div></div>

<p>Perfect! (Note: It’s a little strange putting <code class="language-plaintext highlighter-rouge">time</code> <em>before</em> setting the
environment variable, but that’s because I’m using Bash and it <code class="language-plaintext highlighter-rouge">time</code> is
special since this is the shell’s version of the command.)</p>

<h3 id="thread-pools">Thread pools</h3>

<p>Python’s standard <code class="language-plaintext highlighter-rouge">open()</code> is most commonly used as a <em>context manager</em>
so that the file is automatically closed no matter what happens.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'output.txt'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'hello world'</span><span class="p">,</span> <span class="nb">file</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
</code></pre></div></div>

<p>I’d like my asynchronous open to follow this pattern using <a href="https://www.python.org/dev/peps/pep-0492/"><code class="language-plaintext highlighter-rouge">async
with</code></a>. It’s like <code class="language-plaintext highlighter-rouge">with</code>, but the context manager is acquired and
released asynchronously. I’ll call my version <code class="language-plaintext highlighter-rouge">aopen()</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">with</span> <span class="n">aopen</span><span class="p">(</span><span class="s">'output.txt'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
    <span class="p">...</span>
</code></pre></div></div>

<p>So <code class="language-plaintext highlighter-rouge">aopen()</code> will need to return an <em>asynchronous context manager</em>, an
object with methods <code class="language-plaintext highlighter-rouge">__aenter__</code> and <code class="language-plaintext highlighter-rouge">__aexit__</code> that both return
<a href="https://docs.python.org/3/glossary.html#term-awaitable"><em>awaitables</em></a>. Usually this is by virtue of these methods being
<a href="https://docs.python.org/3/glossary.html#term-coroutine-function"><em>coroutine functions</em></a>, but a normal function that directly returns
an awaitable also works, which is what I’ll be doing for <code class="language-plaintext highlighter-rouge">__aenter__</code>.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">_AsyncOpen</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span>
        <span class="p">...</span>

    <span class="k">def</span> <span class="nf">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="p">...</span>

    <span class="k">async</span> <span class="k">def</span> <span class="nf">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc_type</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
        <span class="p">...</span>
</code></pre></div></div>

<p>Ultimately we have to call <code class="language-plaintext highlighter-rouge">open()</code>. The arguments for <code class="language-plaintext highlighter-rouge">open()</code> will be
given to the constructor to be used later. This will make more sense
when you see the definition for <code class="language-plaintext highlighter-rouge">aopen()</code>.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_args</span> <span class="o">=</span> <span class="n">args</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_kwargs</span> <span class="o">=</span> <span class="n">kwargs</span>
</code></pre></div></div>

<p>When it’s time to actually open the file, Python will call <code class="language-plaintext highlighter-rouge">__aenter__</code>.
We can’t call <code class="language-plaintext highlighter-rouge">open()</code> directly since that will block, so we’ll use a
thread pool to wait on it. Rather than create a thread pool, we’ll use
the one that comes with the current event loop. The <code class="language-plaintext highlighter-rouge">run_in_executor()</code>
method runs a function in a thread pool — where <code class="language-plaintext highlighter-rouge">None</code> means use the
default pool — returning an asyncio future representing the future
result, in this case the opened file object.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">def</span> <span class="nf">thread_open</span><span class="p">():</span>
            <span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="o">*</span><span class="bp">self</span><span class="p">.</span><span class="n">_args</span><span class="p">,</span> <span class="o">**</span><span class="bp">self</span><span class="p">.</span><span class="n">_kwargs</span><span class="p">)</span>
        <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_future</span> <span class="o">=</span> <span class="n">loop</span><span class="p">.</span><span class="n">run_in_executor</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">thread_open</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_future</span>
</code></pre></div></div>

<p>Since this <code class="language-plaintext highlighter-rouge">__aenter__</code> is not a coroutine function, it returns the
future directly as its awaitable result. The caller will await it.</p>

<p>The default thread pool is limited to one thread per core, which I
suppose is the most obvious choice, though not ideal here. That’s fine
for CPU-bound operations but not for I/O-bound operations. In a real
program we may want to use a larger thread pool.</p>

<p>Closing a file may block, so we’ll do that in a thread pool as well.
First pull the file object <a href="/blog/2020/07/30/">from the future</a>, then close it in the
thread pool, waiting until the file has actually closed:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">async</span> <span class="k">def</span> <span class="nf">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc_type</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
        <span class="nb">file</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_future</span>
        <span class="k">def</span> <span class="nf">thread_close</span><span class="p">():</span>
            <span class="nb">file</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
        <span class="k">await</span> <span class="n">loop</span><span class="p">.</span><span class="n">run_in_executor</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">thread_close</span><span class="p">)</span>
</code></pre></div></div>

<p>The open and close are paired in this context manager, but it may be
concurrent with an arbitrary number of other <code class="language-plaintext highlighter-rouge">_AsyncOpen</code> context
managers. There will be some upper limit to the number of open files, so
<strong>we need to be careful not to use too many of these things
concurrently</strong>, something <a href="/blog/2020/05/24/">which easily happens when using unbounded
queues</a>. Lacking back pressure, all it takes is for tasks to be
opening files slightly faster than they close them.</p>

<p>With all the hard work done, the definition for <code class="language-plaintext highlighter-rouge">aopen()</code> is trivial:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">aopen</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">_AsyncOpen</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span>
</code></pre></div></div>

<p>That’s it! Let’s try it out with the <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code> test.</p>

<h3 id="a-test-drive">A test drive</h3>

<p>First define a “heartbeat” task that will tell us the asyncio loop is
still chugging away while we wait on opening the file.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">heartbeat</span><span class="p">():</span>
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">'HEARTBEAT'</span><span class="p">)</span>
</code></pre></div></div>

<p>Here’s a test function for <code class="language-plaintext highlighter-rouge">aopen()</code> that asynchronously opens a file
under <code class="language-plaintext highlighter-rouge">/tmp/</code> named by an integer, (synchronously) writes that integer
to the file, then asynchronously closes it.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">write</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">aopen</span><span class="p">(</span><span class="sa">f</span><span class="s">'/tmp/</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="nb">file</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">main()</code> function creates the heartbeat task and opens 4 files
concurrently though the intercepted file opening routine:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">beat</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">heartbeat</span><span class="p">())</span>
    <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">write</span><span class="p">(</span><span class="n">i</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span>
    <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
    <span class="n">beat</span><span class="p">.</span><span class="n">cancel</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>The result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ LD_PRELOAD=./open64.so python3 aopen.py
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
$ cat /tmp/{1,2,3,4}
1
2
3
4
</code></pre></div></div>

<p>As expected, 6 heartbeats corresponding to 3 seconds that all 4 tasks
spent concurrently waiting on the intercepted <code class="language-plaintext highlighter-rouge">open()</code>. Here’s the full
source if you want to try it our for yourself:</p>

<p><a href="https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd">https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd</a></p>

<h3 id="caveat-no-asynchronous-reads-and-writes">Caveat: no asynchronous reads and writes</h3>

<p><em>Only</em> opening and closing the file is asynchronous. Read and writes are
unchanged, still fully synchronous and blocking, so this is only a half
solution. A full solution is not nearly as simple because asyncio is
async/await. Asynchronous reads and writes would require all new APIs
<a href="https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/">with different coloring</a>. You’d need an <code class="language-plaintext highlighter-rouge">aprint()</code> to complement
<code class="language-plaintext highlighter-rouge">print()</code>, and so on, each returning an <code class="language-plaintext highlighter-rouge">awaitable</code> to be awaited.</p>

<p>This is one of the unfortunate downsides of async/await. I strongly
prefer conventional, preemptive concurrency, <em>but</em> we don’t always have
that luxury.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Purgeable Memory Allocations for Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/29/"/>
    <id>urn:uuid:50300bbe-0939-4bcf-96ff-8fb96a9b12d5</id>
    <updated>2019-12-29T00:25:49Z</updated>
    <category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I saw (part of) a video, <a href="https://www.youtube.com/watch?v=9l0nWEUpg7s">OS hacking: Purgeable memory</a>, by
Andreas Kling who’s writing an operating system called <a href="https://github.com/SerenityOS/serenity">Serenity</a>
and recording videos his progress. In the video he implements
<em>purgeable memory</em> as <a href="https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/CachingandPurgeableMemory.html">found on some Apple platforms</a> by adding
special support in the kernel. A process tells the kernel that a
particular range of memory isn’t important, and so the kernel can
reclaim if it the system is under memory pressure — the memory is
purgeable.</p>

<p>Linux has a mechanism like this, <a href="http://man7.org/linux/man-pages/man2/madvise.2.html"><code class="language-plaintext highlighter-rouge">madvise(2)</code></a>, that allows
processes to provide hints to the kernel on how memory is expected to be
used. The flag of interest is <code class="language-plaintext highlighter-rouge">MADV_FREE</code>:</p>

<blockquote>
  <p>The application no longer requires the pages in the range specified by
<code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">len</code>. The kernel can thus free these pages, but the
freeing could be delayed until memory pressure occurs. For each of the
pages that has been marked to be freed but has not yet been freed, the
free operation will be canceled if the caller writes into the page.</p>
</blockquote>

<p>So, given this, I built a proof of concept / toy on top of <code class="language-plaintext highlighter-rouge">MADV_FREE</code>
that provides this functionality for Linux:</p>

<p><strong><a href="https://github.com/skeeto/purgeable">https://github.com/skeeto/purgeable</a></strong></p>

<p>It <a href="/blog/2018/11/15/">allocates anonymous pages</a> using <code class="language-plaintext highlighter-rouge">mmap(2)</code>. When the allocation
is “unlocked” — i.e. the process isn’t actively using it — its pages are
marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code> so that the kernel can reclaim them at any time.
To lock the allocation so that the process can safely make use of them,
the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> is canceled. This is all a little trickier than it sounds,
and that’s the subject of this article.</p>

<p>Note: There’s also <code class="language-plaintext highlighter-rouge">MADV_DONTNEED</code> which seems like it would fit the
bill, but <a href="https://www.youtube.com/watch?v=bg6-LVCHmGM#t=58m23s">it’s implemented incorrectly in Linux</a>. It <em>immediately</em>
frees the pages, and so it’s useless for implementing purgeable memory.</p>

<h3 id="purgeable-api">Purgeable API</h3>

<p>Before diving into the implementation, here’s the API. It’s <a href="/blog/2018/06/10/">just four
functions</a> with no structure definitions. The pointer used by the
API is the memory allocation itself. All the bookkeeping <a href="/blog/2017/01/08/">associated
with that pointer</a> is hidden away, out of sight from the API’s
consumer. The full documentation is in <a href="https://github.com/skeeto/purgeable/blob/master/purgeable.h"><code class="language-plaintext highlighter-rouge">purgeable.h</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_alloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_unlock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_lock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The semantics are much like a C++ <code class="language-plaintext highlighter-rouge">weak_ptr</code> in that locking both
validates that the allocation is still available and creates a “strong”
reference to it that prevents it from being purged. Though unlike a weak
reference, the allocation is stickier. It will remain until the system is
actually under pressure, not just when the garbage collector happens to
run or the last strong reference is gone.</p>

<p>Here’s how it might be used to, say, store decoded PNG data that can
decompressed again if needed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">png</span> <span class="o">*</span><span class="n">png</span> <span class="o">=</span> <span class="n">png_load</span><span class="p">(</span><span class="s">"texture.png"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">png</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>

<span class="cm">/* ... */</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">*</span> <span class="mi">4</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>
        <span class="n">png_decode_rgba</span><span class="p">(</span><span class="n">png</span><span class="p">,</span> <span class="n">texture</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">purgeable_lock</span><span class="p">(</span><span class="n">texture</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">purgeable_free</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">continue</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">glTexImage2D</span><span class="p">(</span>
        <span class="n">GL_TEXTURE_2D</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">GL_UNSIGNED_BYTE</span><span class="p">,</span> <span class="n">texture</span>
    <span class="p">);</span>
    <span class="n">purgeable_unlock</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
    <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Memory is allocated in a locked state since it’s very likely to be
immediately filled with data. The application should unlock it before
moving on with other tasks. The purgeable memory must always be freed
using <code class="language-plaintext highlighter-rouge">purgeable_free()</code>, even if <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> failed. This not only
frees the bookkeeping, but also releases the now-zero pages and the
mapping itself. Originally I had <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> free the purgeable
memory on failure, but I felt this was clearer. There’s no technical
reason it couldn’t, though.</p>

<h3 id="purgeable-implementation">Purgeable Implementation</h3>

<p>The main challenge is that the kernel doesn’t necessarily treat the
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> range contiguously. It might reclaim just some pages, and do
so in an arbitrary order. In order to lock the region, each page must be
handled individually. Per the man page quoted above, reversing
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> requires a write to each page — to either trigger a page
fault or set <a href="https://en.wikipedia.org/wiki/Dirty_bit">a dirty bit</a>.</p>

<p>The only way to tell if a page has been purged is to check if it’s been
filled with zeros. That’s easy if we’re sure a particular byte in the
page should be zero, but, since this is a library, the caller might just
store <em>anything</em> on these pages.</p>

<p>So here’s my solution: To unlock a page, look at the first byte on the
page. Remember whether or not it’s zero. If it’s zero, write a 1 into
that byte. Once this has been done for all pages, use <code class="language-plaintext highlighter-rouge">madvise(2)</code> to
mark them all <code class="language-plaintext highlighter-rouge">MADV_FREE</code>.</p>

<p>With this approach, the library only needs to track one bit of information
per page regardless of the page’s contents. Assuming 4kB pages, each 32kB
of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead.
Not too bad!</p>

<p>Locking purgeable memory is a little trickier. Again, each page must be
visited in turn, and if any page was purged, then the whole allocation is
considered lost. If the first byte was non-zero when unlocking, the
library checks that it’s still non-zero. If the first byte was zero when
unlocking, then it prepares to write a zero back into that byte, which
must currently be non-zero.</p>

<p>In either case, the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> needs to be canceled using a write, so
the library <a href="/blog/2014/09/02/">does an atomic compare-and-swap</a> (CAS) to write the
correct byte into the page, <em>even if it’s the same value</em> in the
non-zero case. The atomic CAS is essential because <strong>it ensures the page
wasn’t purged between the check and the write, as both are done
together, atomically</strong>. If every page has the expected first byte, and
every CAS succeeded, then the purgeable memory has been successfully
locked.</p>

<p>As an optimization, the library could consider more than just the first
byte, and look at, say, the first <code class="language-plaintext highlighter-rouge">long int</code> on each page. The library
does less work when the page contains a non-zero value, and the chance of
an arbitrary 8-byte value being zero is much lower. However, I wanted to
avoid <a href="/blog/2018/07/20/#strict-aliasing">potential aliasing issues</a>, especially if this library were
to be embedded, so I passed on the idea.</p>

<h4 id="bookkeeping">Bookkeeping</h4>

<p>The bookkeeping data is stored just before the buffer returned as the
purgeable memory, and it’s never marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code>. Assuming 4kB
pages, for each 128MB of purgeable memory the library allocates one extra
anonymous page to track it. The number of pages in the allocation is
stored just before the purgeable memory as a <code class="language-plaintext highlighter-rouge">size_t</code>, and the rest is the
per-page bit table described above.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">numpages</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
</code></pre></div></div>

<p>So the library can immediately find it starting from the purgeable memory
address. Here’s an illustration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      ,--- p
      |
      v
----------------------------------------------
|...Z|    |    |    |    |    |    |    |    |
----------------------------------------------
 ^  ^
 |  |
 |  `--- size_t numpages
 |
 `--- bit table
</code></pre></div></div>

<p>The downside is that buffer underflows in the application would easily
trample the <code class="language-plaintext highlighter-rouge">numpages</code> value because it’s located immediately adjacent. It
would be safer to move it to the <em>beginning</em> of the first page before the
purgeable memory, but this would have made bit table access more
complicated. While the region is locked, the contents of the bit table
don’t matter, so it won’t be damaged by an underflow. Another idea: put a
checksum alongside <code class="language-plaintext highlighter-rouge">numpages</code>. It could just be a simple <a href="/blog/2018/07/31/">integer
hash</a>.</p>

<p>This makes for a really slick API since the consumer doesn’t need to track
anything more than a single pointer, the address of the purgeable memory
allocation itself.</p>

<h3 id="worth-using">Worth using?</h3>

<p>I’m not quite sure how often I’d actually use purgeable memory in real
programs, especially in software intended to be portable. Each operating
system needs its own implementation, and this library is not portable
since it relies on interfaces and behaviors specific to Linux.</p>

<p>It also has a not-so-unlikely pathological case: Imagine a program that
makes two purgeable memory allocation, and they’re large enough that one
always evicts the other. The program would thrash back and forth
fighting itself as it used each allocation. Detecting this situation
might be difficult, especially as the number of purgeable memory
allocations increases.</p>

<p>Regardless, it’s another tool for my software toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Survey of $RANDOM</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/12/25/"/>
    <id>urn:uuid:071e3ec5-fe1d-309a-3e66-3b590a96ac2c</id>
    <updated>2018-12-25T00:05:38Z</updated>
    <category term="linux"/><category term="bsd"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Most Bourne shell clones support a special <code class="language-plaintext highlighter-rouge">RANDOM</code> environment
variable that evaluates to a random value between 0 and 32,767 (e.g.
15 bits). Assigment to the variable seeds the generator. This variable
is an extension and <a href="http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/utilities/V3_chap02.html">did not appear</a> in the original Unix Bourne
shell. Despite this, the different Bourne-like shells that implement
it have converged to the same interface, but <em>only</em> the interface.
Each implementation differs in interesting ways. In this article we’ll
explore how <code class="language-plaintext highlighter-rouge">$RANDOM</code> is implemented in various Bourne-like shells.</p>

<p><del>Unfortunately I was unable to determine the origin of <code class="language-plaintext highlighter-rouge">$RANDOM</code>.</del>
Nobody was doing a good job tracking source code changes before the
mid-1990s, so that history appears to be lost. Bash was first released
in 1989, but the earliest version I could find was 1.14.7, released in 1996.
KornShell was first released in 1983, but the earliest source I could
find <a href="https://web.archive.org/web/20120613182836/http://www.research.att.com/sw/download/man/man1/ksh.html">was from 1993</a>. In both cases <code class="language-plaintext highlighter-rouge">$RANDOM</code> already existed. My
guess is that it first appeared in one of these two shells, probably
KornShell.</p>

<p><strong>Update</strong>: Quentin Barnes has informed me that his 1986 copy of
KornShell (a.k.a. ksh86) implements <code class="language-plaintext highlighter-rouge">$RANDOM</code>. This predates Bash and
makes it likely that this feature originated in KornShell.</p>

<h3 id="bash">Bash</h3>

<p>Of all the shells I’m going to discuss, Bash has the most interesting
history. It never made use use of <code class="language-plaintext highlighter-rouge">srand(3)</code> / <code class="language-plaintext highlighter-rouge">rand(3)</code> and instead
uses its own generator — which is generally <a href="/blog/2017/09/21/">what I prefer</a>. Prior
to Bash 4.0, it used the crummy linear congruential generator (LCG)
<a href="http://port70.net/~nsz/c/c89/c89-draft.html#4.10.2.2">found in the C89 standard</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">rseed</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">brand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">*</span> <span class="mi">1103515245</span> <span class="o">+</span> <span class="mi">12345</span><span class="p">;</span>
  <span class="k">return</span> <span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)((</span><span class="n">rseed</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">32767</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For some reason it was naïvely decided that <code class="language-plaintext highlighter-rouge">$RANDOM</code> should never
produce the same value twice in a row. The caller of <code class="language-plaintext highlighter-rouge">brand()</code> filters
the output and discards repeats before returning to the shell script.
This actually <em>reduces</em> the quality of the generator further since it
increases correlation between separate outputs.</p>

<p>When the shell starts up, <code class="language-plaintext highlighter-rouge">rseed</code> is seeded from the PID and the current
time in seconds. These values are literally summed and used as the seed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Note: not the literal code, but equivalent. */</span>
<span class="n">rseed</span> <span class="o">=</span> <span class="n">getpid</span><span class="p">()</span> <span class="o">+</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Subshells, which fork and initally share an <code class="language-plaintext highlighter-rouge">rseed</code>, are given similar
treatment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rseed</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">+</span> <span class="n">getpid</span><span class="p">()</span> <span class="o">+</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Notice there’s no <a href="/blog/2018/07/31/">hashing</a> or <a href="http://www.pcg-random.org/posts/developing-a-seed_seq-alternative.html">mixing</a> of these values, so
there’s no avalanche effect. That would have prevented shells that start
around the same time from having related initial random sequences.</p>

<p>With Bash 4.0, released in 2009, the algorithm was changed to a
<a href="http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf">Park–Miller multiplicative LCG</a> from 1988:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>
<span class="nf">brand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="kt">long</span> <span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">;</span>

  <span class="cm">/* can't seed with 0. */</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">rseed</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="mi">123459876</span><span class="p">;</span>
  <span class="n">h</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">/</span> <span class="mi">127773</span><span class="p">;</span>
  <span class="n">l</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">%</span> <span class="mi">127773</span><span class="p">;</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
  <span class="k">return</span> <span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)(</span><span class="n">rseed</span> <span class="o">&amp;</span> <span class="mi">32767</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s actually a subtle mistake in this implementation compared to the
generator described in the paper. This function will generate different
numbers than the paper, and it will generate different numbers on
different hosts! More on that later.</p>

<p>This algorithm is a <a href="http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html">much better choice</a> than the previous LCG.
There were many more options available in 2009 compared to 1989, but,
honestly, this generator is pretty reasonable for this application.
Bash is <em>so slow</em> that you’re never practically going to generate
enough numbers for the small state to matter. Since the Park–Miller
algorithm is older than Bash, they could have used this in the first
place.</p>

<p>I considered submitting a patch to switch to something more modern.
However, given Bash’s constraints, it’s harder said than done.
Portability to weird systems is still a concern, and I expect they’d
reject a patch that started making use of <code class="language-plaintext highlighter-rouge">long long</code> in the PRNG.
They still support pre-ANSI C compilers that don’t have 64-bit
arithmetic.</p>

<p>However, what still really <em>could</em> be improved is seeding. In Bash 4.x
here’s what it looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">seedrand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">;</span>

  <span class="n">gettimeofday</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">tv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
  <span class="n">sbrand</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">^</span> <span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span> <span class="o">^</span> <span class="n">getpid</span> <span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Seeding is both better and worse. It’s better that it’s seeded from a
higher resolution clock (milliseconds), so two shells started close in
time have more variation. However, it’s “mixed” with XOR, which, in
this case, is worse than addition.</p>

<p>For example, imagine two Bash shells started one millsecond apart. Both
<code class="language-plaintext highlighter-rouge">tv_usec</code> and <code class="language-plaintext highlighter-rouge">getpid()</code> are incremented by one. Those increments are
likely to cancel each other out by an XOR, and they end up with the same
seed.</p>

<p>Instead, each of those quantities should be hashed before mixing. Here’s
a rough example using my <a href="https://github.com/skeeto/hash-prospector#three-round-functions"><code class="language-plaintext highlighter-rouge">triple32()</code> hash</a> (adapted to glorious
GNU-style pre-ANSI C):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span>
<span class="n">hash32</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span>
     <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">x</span><span class="p">;</span>
<span class="p">{</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xed5ad4bbUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xac4c1b51UL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x31848babUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">;</span>
  <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">seedrand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">;</span>

  <span class="n">gettimeofday</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">tv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
  <span class="n">sbrand</span> <span class="p">(</span><span class="n">hash32</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span><span class="p">)</span> <span class="o">^</span>
          <span class="n">hash32</span> <span class="p">(</span><span class="n">hash32</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span><span class="p">)</span> <span class="o">^</span> <span class="n">getpid</span> <span class="p">()));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I had said there’s there’s a mistake in the Bash implementation of
Park–Miller. Take a closer look at the types and the assignment to
rseed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="cm">/* The variables */</span>
  <span class="kt">long</span> <span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">rseed</span><span class="p">;</span>

  <span class="cm">/* The assignment */</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
</code></pre></div></div>

<p>The result of the substraction can be negative, and that negative
value is converted to <code class="language-plaintext highlighter-rouge">unsigned long</code>. The C standard says
<code class="language-plaintext highlighter-rouge">ULONG_MAX + 1</code> is added to make the value positive. <code class="language-plaintext highlighter-rouge">ULONG_MAX</code>
varies by platform — typicially <code class="language-plaintext highlighter-rouge">long</code> is either 32 bits or 64 bits —
so the results also vary. Here’s how the paper defined it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">long</span> <span class="n">test</span><span class="p">;</span>

  <span class="n">test</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">test</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="n">test</span><span class="p">;</span>
  <span class="k">else</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="n">test</span> <span class="o">+</span> <span class="mi">2147483647</span><span class="p">;</span>
</code></pre></div></div>

<p>As far as I can tell, this mistake doesn’t hurt the quality of the
generator.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634

$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115
</code></pre></div></div>

<h3 id="zsh">Zsh</h3>

<p>In contrast to Bash, Zsh is the most straightforward: defer to
<code class="language-plaintext highlighter-rouge">rand(3)</code>. Its <code class="language-plaintext highlighter-rouge">$RANDOM</code> can return the same value twice in a row,
assuming that <code class="language-plaintext highlighter-rouge">rand(3)</code> does.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zlong</span>
<span class="nf">randomgetfn</span><span class="p">(</span><span class="n">UNUSED</span><span class="p">(</span><span class="n">Param</span> <span class="n">pm</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">rand</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mh">0x7fff</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">randomsetfn</span><span class="p">(</span><span class="n">UNUSED</span><span class="p">(</span><span class="n">Param</span> <span class="n">pm</span><span class="p">),</span> <span class="n">zlong</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">srand</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)</span><span class="n">v</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A cool feature is that means you could override it if you wanted with <a href="https://xkcd.com/221/">a
custom generator</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">rand</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// chosen by fair dice roll.</span>
              <span class="c1">// guaranteed to be random.</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4
</code></pre></div></div>

<p>This trick also applies to the rest of the shells below.</p>

<h3 id="kornshell-ksh">KornShell (ksh)</h3>

<p>KornShell originated in 1983, but it was finally released under an open
source license in 2005. There’s a clone of KornShell called Public
Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but
I’ll get to that next.</p>

<p>KornShell defers to <code class="language-plaintext highlighter-rouge">rand(3)</code>, but it does some additional naïve
filtering on the output. When the shell starts up, it generates 10
values from <code class="language-plaintext highlighter-rouge">rand()</code>. If any of them are larger than 32,767 then it will
shift right by three all generated numbers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RANDMASK 0x7fff
</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Don't use lower bits when rand() generates large numbers.</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&gt;</span> <span class="n">RANDMASK</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">rand_shift</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Why not just look at <code class="language-plaintext highlighter-rouge">RAND_MAX</code>? I guess they didn’t think of it.</p>

<p><strong>Update</strong>: Quentin Barnes pointed out that <code class="language-plaintext highlighter-rouge">RAND_MAX</code> didn’t exist
until POSIX standardization in 1988. The constant <a href="https://github.com/dspinellis/unix-history-repo/commit/1cc1b02a4361">first appeared in
Unix in 1990</a>. This KornShell code either predates the standard
or needed to work on systems that predate the standard.</p>

<p>Like Bash, repeated values are not allowed. I suspect one shell got this
idea from the other.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">do</span> <span class="p">{</span>
        <span class="n">cur</span> <span class="o">=</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="n">rand_shift</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">RANDMASK</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">==</span> <span class="n">last</span><span class="p">);</span>
</code></pre></div></div>

<p>Who came up with this strange idea first?</p>

<h3 id="openbsds-public-domain-korn-shell-pdksh">OpenBSD’s Public Domain Korn Shell (pdksh)</h3>

<p>I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I
ever touch in practice, and its <code class="language-plaintext highlighter-rouge">$RANDOM</code> is the most interesting of the
pdksh forks — at least since 2014.</p>

<p>Like Zsh, pdksh simply defers to <code class="language-plaintext highlighter-rouge">rand(3)</code>. However, OpenBSD’s <code class="language-plaintext highlighter-rouge">rand(3)</code>
is <a href="https://marc.info/?l=openbsd-tech&amp;m=141807224826859&amp;w=2">infamously and proudly non-standard</a>. By default it returns
<em>non-deterministic</em>, cryptographic-quality results seeded from system
entropy (via the misnamed <a href="https://man.openbsd.org/arc4random.3"><code class="language-plaintext highlighter-rouge">arc4random(3)</code></a>), à la <code class="language-plaintext highlighter-rouge">/dev/urandom</code>.
Its <code class="language-plaintext highlighter-rouge">$RANDOM</code> inherits this behavior.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">setint</span><span class="p">(</span><span class="n">vp</span><span class="p">,</span> <span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mh">0x7fff</span><span class="p">));</span>
</code></pre></div></div>

<p>However, if a value is assigned to <code class="language-plaintext highlighter-rouge">$RANDOM</code> in order to seed it, it
reverts to its old pre-2014 deterministic generation via
<a href="https://man.openbsd.org/rand"><code class="language-plaintext highlighter-rouge">srand_deterministic(3)</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">srand_deterministic</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)</span><span class="n">intval</span><span class="p">(</span><span class="n">vp</span><span class="p">));</span>
</code></pre></div></div>

<p>OpenBSD’s deterministic <code class="language-plaintext highlighter-rouge">rand(3)</code> is the crummy LCG from the C89
standard, just like Bash 3.x. So if you assign to <code class="language-plaintext highlighter-rouge">$RANDOM</code>, you’ll get
nearly the same results as Bash 3.x and earlier — the only difference
being that it can repeat numbers.</p>

<p>That’s a slick upgrade to the old interface without breaking anything,
making it my favorite version <code class="language-plaintext highlighter-rouge">$RANDOM</code> for any shell.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>A JIT Compiler Skirmish with SELinux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/11/15/"/>
    <id>urn:uuid:d4fa35ad-05c3-3b86-1083-d533dfacfb15</id>
    <updated>2018-11-15T18:57:47Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>This is a debugging war story.</p>

<p>Once upon a time I wrote a fancy data conversion utility. The input
was a complex binary format defined by a data dictionary supplied at
run time by the user alongside the input data. Since the converter was
typically used to process massive quantities of input, and the nature
of that input wasn’t known until run time, I wrote <a href="/blog/2015/03/19/">an x86-64 JIT
compiler</a> to speed it up. The converter generated a fast, native
binary parser in memory according to the data dictionary
specification. Processing data now took much less time and everyone
rejoiced.</p>

<p>Then along came SELinux, Sheriff of Pedantry. Not liking all the
shenanigans with page protections, SELinux huffed and puffed and made
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> return <code class="language-plaintext highlighter-rouge">EACCES</code> (“Permission denied”). Believing I was
following all the rules and so this would never happen, I foolishly
did not check the result and the converter was now crashing for its
users. What made SELinux so unhappy, and could this somehow be
resolved?</p>

<h3 id="allocating-memory">Allocating memory</h3>

<p>Before going further, let’s back up and review how this works. Suppose I
want to generate code at run time and execute it. In the old days this
was as simple as writing some machine code into a buffer and jumping to
that buffer — e.g. by converting the buffer to a function pointer and
calling it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="nf">int</span> <span class="p">(</span><span class="o">*</span><span class="n">jit_func</span><span class="p">)(</span><span class="kt">void</span><span class="p">);</span>

<span class="cm">/* NOTE: This doesn't work anymore! */</span>
<span class="n">jit_func</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="kt">int</span> <span class="n">retval</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">6</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* mov eax, retval */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xb8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
        <span class="cm">/* ret */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xc3</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">jit_func</span> <span class="n">f</span> <span class="o">=</span> <span class="n">jit_compile</span><span class="p">(</span><span class="mi">1001</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"f() = %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">f</span><span class="p">());</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This situation was far too easy for malicious actors to abuse. An
attacker could supply instructions of their own choosing — i.e. <em>shell
code</em> — as input and exploit a buffer overflow vulnerability to execute
the input buffer. These exploits were trivial to craft.</p>

<p>Modern systems have hardware checks to prevent this from happening.
Memory containing instructions must have their execute protection bit
set before those instructions can be executed. This is useful both for
making attackers work harder and for catching bugs in programs — no more
executing data by accident.</p>

<p>This is further complicated by the fact that memory protections have
page granularity. You can’t adjust the protections for a 6-byte
buffer. You do it for the entire surrounding page — typically 4kB, but
sometimes as large as 2MB. This requires replacing that <code class="language-plaintext highlighter-rouge">malloc(3)</code>
with a more careful allocation strategy. There are a few ways to go
about this.</p>

<h4 id="anonymous-memory-mapping">Anonymous memory mapping</h4>

<p>The most common and most sensible is to create an anonymous memory
mapping: a file memory map that’s not actually backed by a file. The
<code class="language-plaintext highlighter-rouge">mmap(2)</code> function has a flag specifically for this purpose:
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">anon_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately, <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> not part of POSIX. If you’re being super
strict with your includes — <a href="/blog/2017/03/30/">as I tend to be</a> — this flag won’t be
defined, even on systems where it’s supported.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span><span class="c1">// MAP_ANONYMOUS undefined!</span>
</code></pre></div></div>

<p>To get the flag, you must use the <code class="language-plaintext highlighter-rouge">_BSD_SOURCE</code>, or, more recently,
the <code class="language-plaintext highlighter-rouge">_DEFAULT_SOURCE</code> feature test macro to explicitly enable that
feature.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE </span><span class="cm">/* for MAP_ANONYMOUS */</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>The POSIX way to do this is to instead map <code class="language-plaintext highlighter-rouge">/dev/zero</code>. <strong>So, wanting to
be Mr. Portable, this is what I did in my tool.</strong> Take careful note of
this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"/dev/zero"</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="aligned-allocation">Aligned allocation</h4>

<p>Another, less common (and less portable) strategy is to lean on the
existing C memory allocator, being careful to allocate on page
boundaries so that the page protections don’t affect other allocations.
The classic allocation functions, like <code class="language-plaintext highlighter-rouge">malloc(3)</code>, don’t allow for this
kind of control. However, there are a couple of aligned allocation
alternatives.</p>

<p>The first is <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">posix_memalign</span><span class="p">(</span><span class="kt">void</span> <span class="o">**</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">alignment</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>By choosing page alignment and a size that’s a multiple of the page
size, it’s guaranteed to return whole pages. When done, pages are freed
with <code class="language-plaintext highlighter-rouge">free(3)</code>. Though, unlike unmapping, the original page protections
must first be restored since those pages may be reused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">pagesize</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGE_SIZE</span><span class="p">);</span> <span class="c1">// TODO: cache this</span>
    <span class="kt">size_t</span> <span class="n">roundup</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span> <span class="o">+</span> <span class="n">pagesize</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">pagesize</span> <span class="o">*</span> <span class="n">pagesize</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">posix_memalign</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">pagesize</span><span class="p">,</span> <span class="n">roundup</span><span class="p">)</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re using C11, there’s also <code class="language-plaintext highlighter-rouge">aligned_alloc(3)</code>. This is the most
uncommon of all since most C programmers refuse to switch to a new
standard until it’s at least old enough to drive a car.</p>

<h3 id="changing-page-protections">Changing page protections</h3>

<p>So we’ve allocated our memory, but it’s not going to start in an
executable state. Why? Because a <a href="https://en.wikipedia.org/wiki/W%5EX">W^X</a> (“write xor execute”)
policy is becoming increasingly common. Attempting to set both write
and execute protections at the same time may be denied. (In fact,
there’s an SELinux policy for this.)</p>

<p>As a JIT compiler, we need to write to a page <em>and</em> execute it. Again,
there are two strategies. The complicated strategy is to <a href="/blog/2016/04/10/">map the same
memory at two different places</a>, one with the execute protection,
one with the write protection. This allows the page to be modified as
it’s being executed without violating W^X.</p>

<p>The simpler and more secure strategy is to write the machine
instructions, then swap the page over to executable using <code class="language-plaintext highlighter-rouge">mprotect(2)</code>
once it’s ready. This is what I was doing in my tool.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">anon_alloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="cm">/* ... write instructions into the buffer ... */</span>
<span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="n">jit_func</span> <span class="n">func</span> <span class="o">=</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="n">func</span><span class="p">();</span>
</code></pre></div></div>

<p>At a high level, That’s pretty close to what I was actually doing. That
includes neglecting to check the result of <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. This worked
fine and dandy for several years, when suddenly (shown here in the style
<a href="/blog/2018/06/23/">of strace</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)
</code></pre></div></div>

<p>Then the program would crash trying to execute the buffer. Suddenly it
wasn’t allowed to make this buffer executable. My program hadn’t
changed. What <em>had</em> changed was the SELinux security policy on this
particular system.</p>

<h3 id="asking-for-help">Asking for help</h3>

<p>The problem is that I don’t administer this (Red Hat) system. I can’t
access the logs and I didn’t set the policy. I don’t have any insight
on <em>why</em> this call was suddenly being denied. To make this more
challenging, the folks that manage this system didn’t have the
necessary knowledge to help with this either.</p>

<p>So to figure this out, I need to treat it like a black box and probe
at system calls until I can figure out just what SELinux policy I’m up
against. I only have practical experience administrating Debian
systems (and its derivatives like Ubuntu), which means I’ve hardly
ever had to deal with SELinux. I’m flying fairly blind here.</p>

<p>Since my real application is large and complicated, I code up a
minimal example, around a dozen lines of code: allocate a single page
of memory, write a single return (<code class="language-plaintext highlighter-rouge">ret</code>) instruction into it, set it
as executable, and call it. The program checks for errors, and I can
run it under strace if that’s not insightful enough. This program is
also something simple I could provide to the system administrators,
since they were willing to turn some of the knobs to help narrow down
the problem.</p>

<p>However, <strong>here’s where I made a major mistake</strong>. Assuming the problem
was solely in <code class="language-plaintext highlighter-rouge">mprotect(2)</code>, and wanting to keep this as absolutely
simple as possible, I used <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code> to allocate that page. I
saw the same <code class="language-plaintext highlighter-rouge">EACCES</code> as before, and assumed I was demonstrating the
same problem. Take note of this, too.</p>

<h3 id="finding-a-resolution">Finding a resolution</h3>

<p>Eventually I’d need to figure out what policy was blocking my JIT
compiler, then see if there was an alternative route. The system
loader still worked after all, and I could plainly see that with
strace. So it wasn’t a blanket policy that completely blocked the
execute protection. Perhaps the loader was given an exception?</p>

<p>However, the very first order of business was to actually check the
result from <code class="language-plaintext highlighter-rouge">mprotect(2)</code> and do something more graceful rather than
crash. In my case, that meant falling back to executing a byte-code
virtual machine. I added the check, and now the program ran slower
instead of crashing.</p>

<p>The program runs on both Linux and Windows, and the allocation and
page protection management is abstracted. On Windows it uses
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> and <code class="language-plaintext highlighter-rouge">VirtualProtect()</code> instead of <code class="language-plaintext highlighter-rouge">mmap(2)</code> and
<code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Neither implementation checked that the protection
change succeeded, so I fixed the Windows implementation while I was at
it.</p>

<p>Thanks to Mingw-w64, I actually do most of my <a href="/blog/2016/06/13/">Windows
development</a> on Linux. And, thanks to <a href="https://www.winehq.org/">Wine</a>, I mean
everything, including running and debugging. Calling
<code class="language-plaintext highlighter-rouge">VirtualProtect()</code> in Wine would ultimately call <code class="language-plaintext highlighter-rouge">mprotect(2)</code> in the
background, which I expected would be denied. So running the Windows
version with Wine under this SELinux policy would be the perfect test.
Right?</p>

<p><strong>Except that <code class="language-plaintext highlighter-rouge">mprotect(2)</code> succeeded under Wine!</strong> The Windows version
of my JIT compiler was working just fine on Linux. Huh?</p>

<p>This system doesn’t have Wine installed. I had built <a href="/blog/2018/03/27/">and packaged it
myself</a>. This Wine build definitely has no SELinux exceptions.
Not only did the Wine loader work correctly, it can change page
protections in ways my own Linux programs could not. What’s different?</p>

<p>Debugging this with all these layers is starting to look silly, but
this is exactly why doing Windows development on Linux is so useful. I
run my program under Wine under strace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace wine ./mytool.exe
</code></pre></div></div>

<p>I study the system calls around <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Perhaps there’s some
stricter alignment issue? No. Perhaps I need to include <code class="language-plaintext highlighter-rouge">PROT_READ</code>?
No. The only difference I can find is they’re using the
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag. So, armed with this knowledge, <strong>I modify my
minimal example to allocate 1024 pages instead of just one, and
suddenly it works correctly</strong>. I was most of the way to figuring this
all out.</p>

<h3 id="inside-glibc-allocation">Inside glibc allocation</h3>

<p>Why did increasing the allocation size change anything? This is a
typical Linux system, so my program is linked against the GNU C
library, glibc. This library allocates memory from two places
depending on the allocation size.</p>

<p>For small allocations, glibc uses <code class="language-plaintext highlighter-rouge">brk(2)</code> to extend the executable
image — i.e. to extend the <code class="language-plaintext highlighter-rouge">.bss</code> section. These resources are not
returned to the operating system after they’re freed with <code class="language-plaintext highlighter-rouge">free(3)</code>.
They’re reused.</p>

<p>For large allocations, glibc uses <code class="language-plaintext highlighter-rouge">mmap(2)</code> to create a new, anonymous
mapping for that allocation. When freed with <code class="language-plaintext highlighter-rouge">free(3)</code>, that memory is
unmapped and its resources are returned to the operating system.</p>

<p>By increasing the allocation size, it became a “large” allocation and
was backed by an anonymous mapping. Even though I didn’t use <code class="language-plaintext highlighter-rouge">mmap(2)</code>,
to the operating system this would be indistinguishable to what Wine was
doing (and succeeding at).</p>

<p>Consider this little example program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When <em>not</em> compiled as a Position Independent Executable (PIE), here’s
what the output looks like. The first pointer is near where the program
was loaded, low in memory. The second pointer is a randomly selected
address high in memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x1077010
0x7fa9b998e010
</code></pre></div></div>

<p>And if you run it under strace, you’ll see that the first allocation
comes from <code class="language-plaintext highlighter-rouge">brk(2)</code> and the second comes from <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<h3 id="two-selinux-policies">Two SELinux policies</h3>

<p>With a little bit of research, I found the <a href="https://akkadia.org/drepper/selinux-mem.html">two SELinux policies</a>
at play here. In my minimal example, I was blocked by <code class="language-plaintext highlighter-rouge">allow_execheap</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execheap
</code></pre></div></div>

<p>This prohibits programs from setting the execute protection on any
“heap” page.</p>

<blockquote>
  <p>The POSIX specification does not permit it, but the Linux
implementation of <code class="language-plaintext highlighter-rouge">mprotect</code> allows changing the access protection of
memory on the heap (e.g., allocated using <code class="language-plaintext highlighter-rouge">malloc</code>). This error
indicates that heap memory was supposed to be made executable. Doing
this is really a bad idea. If anonymous, executable memory is needed
it should be allocated using <code class="language-plaintext highlighter-rouge">mmap</code> which is the only portable
mechanism.</p>
</blockquote>

<p>Obviously this is pretty loose since I was still able to do it with
<code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>, which, technically speaking, allocates from the
heap. So this policy applies to pages mapped by <code class="language-plaintext highlighter-rouge">brk(2)</code>.</p>

<p>The second policy was <code class="language-plaintext highlighter-rouge">allow_execmod</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execmod
</code></pre></div></div>

<blockquote>
  <p>The program mapped from a file with <code class="language-plaintext highlighter-rouge">mmap</code> and the <code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> flag
and write permission. Then the memory region has been written to,
resulting in copy-on-write (COW) of the affected page(s). This memory
region is then made executable […]. The <code class="language-plaintext highlighter-rouge">mprotect</code> call will fail
with <code class="language-plaintext highlighter-rouge">EACCES</code> in this case.</p>
</blockquote>

<p>I don’t understand what purpose this policy serves, but this is what
was causing my original problem. Pages mapped to <code class="language-plaintext highlighter-rouge">/dev/zero</code> are not
<em>actually</em> considered anonymous by Linux, at least as far as this
policy is concerned. I think this is a mistake, and that mapping the
special <code class="language-plaintext highlighter-rouge">/dev/zero</code> device should result in effectively anonymous
pages.</p>

<p>From this I learned a little lesson about baking assumptions — that
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> was solely at fault — into my minimal debugging examples.
And the fix was ultimately easy: I just had to suck it up and use the
slightly less pure <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Brute Force Incognito Browsing</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/09/06/"/>
    <id>urn:uuid:376eff98-5b58-30fd-d101-3dac9052bf82</id>
    <updated>2018-09-06T14:07:13Z</updated>
    <category term="linux"/><category term="debian"/><category term="trick"/><category term="web"/>
    <content type="html">
      <![CDATA[<p>Both Firefox and Chrome have a feature for creating temporary private
browsing sessions. Firefox calls it <a href="https://support.mozilla.org/en-US/kb/private-browsing-use-firefox-without-history">Private Browsing</a> and Chrome
calls it <a href="https://support.google.com/chrome/answer/95464">Incognito Mode</a>. Both work essentially the same way. A
temporary browsing session is started without carrying over most
existing session state (cookies, etc.), and no state (cookies,
browsing history, cached data, etc.) is preserved after ending the
session. Depending on the configuration, some browser extensions will
be enabled in the private session, and their own internal state may be
preserved.</p>

<p>The most obvious use is for visiting websites that you don’t want
listed in your browsing history. Another use for more savvy users is
to visit websites with a fresh, empty cookie file. For example, some
news websites use a cookie to track the number visits and require a
subscription after a certain number of “free” articles. Manually
deleting cookies is a pain (especially without a specialized
extension), but opening the same article in a private session is two
clicks away.</p>

<p>For web development there’s yet another use. A private session is a way
to view your website from the perspective of a first-time visitor.
You’ll be logged out and will have little or no existing state.</p>

<p>However, sometimes <em>it just doesn’t go far enough</em>. Some of those news
websites have adapted, and in addition to counting the number of visits,
they’ve figured out how to detect private sessions and block them. I
haven’t looked into <em>how</em> they do this — maybe something to do with
local storage, or detecting previously cached content. Sometimes I want
a private session that’s <em>truly</em> fully isolated. The existing private
session features just aren’t isolated enough or they behave differently,
which is how they’re being detected.</p>

<p>Some time ago I put together a couple of scripts to brute force my own
private sessions when I need them, generally for testing websites in a
guaranteed fresh, fully-functioning instance. It also lets me run
multiple such sessions in parallel. My scripts don’t rely on any
private session feature of the browser, so the behavior is identical
to a real browser, making it undetectable.</p>

<p>The downside is that, for better or worse, no browser extensions are
carried over. In some ways this can be considered a feature, but a lot
of the time I would like my ad-blocker to carry over. Your ad-blocker is
probably <em>the</em> most important security software on your computer, so you
should hesitate to give it up.</p>

<p>Another downside is that both Firefox and Chrome have some irritating
first-time behaviors that can’t be disabled. The intent is to be
newbie-friendly but it just gets in my way. For example, both bug me
about logging into their browser platforms. Firefox starts with two
tabs. Chrome creates a popup to ask me to configure a printer. Both
start with a junk URL in the location bar so I can’t just middle-click
paste (i.e. the X11 selection clipboard) into it. It’s definitely not
designed for my use case.</p>

<h3 id="firefox">Firefox</h3>

<p>Here’s my brute force private session script for Firefox:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh -e</span>
<span class="nv">DIR</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">XDG_CACHE_HOME</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.cache</span><span class="k">}</span><span class="s2">"</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$DIR</span><span class="s2">"</span>
<span class="nv">TEMP</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">mktemp</span> <span class="nt">-d</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$DIR</span><span class="s2">/firefox-XXXXXX"</span><span class="si">)</span><span class="s2">"</span>
<span class="nb">trap</span> <span class="s2">"rm -rf -- '</span><span class="nv">$TEMP</span><span class="s2">'"</span> INT TERM EXIT
firefox <span class="nt">-profile</span> <span class="s2">"</span><span class="nv">$TEMP</span><span class="s2">"</span> <span class="nt">-no-remote</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>It creates a temporary directory under <code class="language-plaintext highlighter-rouge">$XDG_CACHE_HOME</code> and tells
Firefox to use the profile in that directory. No such profile exists,
of course, so Firefox creates a fresh profile.</p>

<p>In theory I could just create a <em>new</em> profile alongside the default
within my existing <code class="language-plaintext highlighter-rouge">~/.mozilla</code> directory. However, I’ve never liked
Firefox’s profile feature, especially with the intentionally
unpredictable way it stores the profile itself: behind random path. I
also don’t trust it to be fully isolated and to fully clean up when I’m
done.</p>

<p>Before starting Firefox, I register a trap with the shell to clean up
the profile directory regardless of what happens. It doesn’t matter if
Firefox exits cleanly, if it crashes, or if I CTRL-C it to death.</p>

<p>The <code class="language-plaintext highlighter-rouge">-no-remote</code> option prevents the new Firefox instance from joining
onto an existing Firefox instance, which it <em>really</em> prefers to do even
though it’s technically supposed to be a different profile.</p>

<p>Note the <code class="language-plaintext highlighter-rouge">"$@"</code>, which passes arguments through to Firefox — most often
the URL of the site I want to test.</p>

<h3 id="chromium">Chromium</h3>

<p>I don’t actually use Chrome but rather the open source version,
Chromium. I think this script will also work with Chrome.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh -e</span>
<span class="nv">DIR</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">XDG_CACHE_HOME</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.cache</span><span class="k">}</span><span class="s2">"</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$DIR</span><span class="s2">"</span>
<span class="nv">TEMP</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">mktemp</span> <span class="nt">-d</span> <span class="nt">--</span> <span class="s2">"</span><span class="nv">$DIR</span><span class="s2">/chromium-XXXXXX"</span><span class="si">)</span><span class="s2">"</span>
<span class="nb">trap</span> <span class="s2">"rm -rf -- '</span><span class="nv">$TEMP</span><span class="s2">'"</span> INT TERM EXIT
chromium <span class="nt">--user-data-dir</span><span class="o">=</span><span class="s2">"</span><span class="nv">$TEMP</span><span class="s2">"</span> <span class="se">\</span>
         <span class="nt">--no-default-browser-check</span> <span class="se">\</span>
         <span class="nt">--no-first-run</span> <span class="se">\</span>
         <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span> <span class="o">&gt;</span>/dev/null 2&gt;&amp;1
</code></pre></div></div>

<p>It’s exactly the same as the Firefox script and only the browser
arguments have changed. I tell it not to ask about being the default
browser, and <code class="language-plaintext highlighter-rouge">--no-first-run</code> disables <em>some</em> of the irritating
first-time behaviors.</p>

<p>Chromium is <em>very</em> noisy on the command line, so I also redirect all
output to <code class="language-plaintext highlighter-rouge">/dev/null</code>.</p>

<p>If you’re on Debian like me, its version of Chromium comes with a
<code class="language-plaintext highlighter-rouge">--temp-profile</code> option that handles the throwaway profile
automatically. So the script can be simplified:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh -e</span>
chromium <span class="nt">--temp-profile</span> <span class="se">\</span>
         <span class="nt">--no-default-browser-check</span> <span class="se">\</span>
         <span class="nt">--no-first-run</span> <span class="se">\</span>
         <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span> <span class="o">&gt;</span>/dev/null 2&gt;&amp;1
</code></pre></div></div>

<p>In my own use case, these scripts have fully replaced the built-in
private session features. In fact, since Chromium is not my primary
browser, my brute force private session script is how I usually launch
it. I only run it to test things, and I always want to test using a
fresh profile.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Intercepting and Emulating Linux System Calls with Ptrace</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/06/23/"/>
    <id>urn:uuid:a39b7709-d0a6-3b12-159f-7445d9524594</id>
    <updated>2018-06-23T20:41:08Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>The <code class="language-plaintext highlighter-rouge">ptrace(2)</code> (“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing <a href="https://blog.plover.com/Unix/strace-groff.html">strace</a> — system call trace. With Ptrace, tracers
can pause tracees, <a href="/blog/2016/09/03/">inspect and set registers and memory</a>, monitor
system calls, or even <em>intercept</em> system calls.</p>

<p>By intercept, I mean that the tracer can mutate system call arguments,
mutate the system call return value, or even block certain system calls.
Reading between the lines, this means a tracer can fully service system
calls itself. This is particularly interesting because it also means <strong>a
tracer can emulate an entire foreign operating system</strong>. This is done
without any special help from the kernel beyond Ptrace.</p>

<p>The catch is that a process can only have one tracer attached at a time,
so it’s not possible emulate a foreign operating system while also
debugging that process with, say, GDB. The other issue is that emulated
systems calls will have higher overhead.</p>

<p>For this article I’m going to focus on <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">Linux’s Ptrace</a> on
x86-64, and I’ll be taking advantage of a few Linux-specific extensions.
For the article I’ll also be omitting error checks, but the full source
code listings will have them.</p>

<p>You can find runnable code for the examples in this article here:</p>

<p><strong><a href="https://github.com/skeeto/ptrace-examples">https://github.com/skeeto/ptrace-examples</a></strong></p>

<h3 id="strace">strace</h3>

<p>Before getting into the really interesting stuff, let’s start by
reviewing a bare bones implementation of strace. It’s <a href="/blog/2018/01/17/">no
DTrace</a>, but strace is still incredibly useful.</p>

<p>Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The <code class="language-plaintext highlighter-rouge">ptrace(2)</code>
prototype generally looks something like this, though the specific
types may be different.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">ptrace</span><span class="p">(</span><span class="kt">int</span> <span class="n">request</span><span class="p">,</span> <span class="n">pid_t</span> <span class="n">pid</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pid</code> is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.</p>

<p>The <code class="language-plaintext highlighter-rouge">request</code> field selects a specific Ptrace function, just like the
<code class="language-plaintext highlighter-rouge">ioctl(2)</code> interface. For strace, only two are needed:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code>: This process is to be traced by its parent.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code>: Continue, but stop at the next system call
entrance or exit.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code>: Get a copy of the tracee’s registers.</li>
</ul>

<p>The other two fields, <code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">data</code>, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.</p>

<p>The strace interface is essentially a prefix to another command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace [strace options] program [arguments]
</code></pre></div></div>

<p>My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is <code class="language-plaintext highlighter-rouge">fork(2)</code> and <code class="language-plaintext highlighter-rouge">exec(2)</code> the
tracee process on the tail of <code class="language-plaintext highlighter-rouge">argv</code>. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pid_t</span> <span class="n">pid</span> <span class="o">=</span> <span class="n">fork</span><span class="p">();</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="cm">/* error */</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
    <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>  <span class="cm">/* child */</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_TRACEME</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">execvp</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">argv</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The parent waits for the child’s <code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code> using <code class="language-plaintext highlighter-rouge">wait(2)</code>. When
<code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the child will be paused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
<code class="language-plaintext highlighter-rouge">PTRACE_O_TRACEFORK</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETOPTIONS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">PTRACE_O_EXITKILL</span><span class="p">);</span>
</code></pre></div></div>

<p>All that’s left is a simple, endless loop that catches on system calls
one at a time. The body of the loop has four steps:</p>

<ol>
  <li>Wait for the process to enter the next system call.</li>
  <li>Print a representation of the system call.</li>
  <li>Allow the system call to execute and wait for the return.</li>
  <li>Print the system call return value.</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
<code class="language-plaintext highlighter-rouge">wait(2)</code> is needed to wait for the tracee to enter the desired state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, <em>the operating system has not yet serviced this system call</em>.
This detail will be important later.</p>

<p>The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, <a href="/blog/2015/05/15/">the system call number is
passed in <code class="language-plaintext highlighter-rouge">rax</code></a>, and the arguments (up to 6) are passed in
<code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">r10</code>, <code class="language-plaintext highlighter-rouge">r8</code>, and <code class="language-plaintext highlighter-rouge">r9</code>. Reading the registers is
another Ptrace call, though there’s no need to <code class="language-plaintext highlighter-rouge">wait(2)</code> since the
tracee isn’t changing state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="kt">long</span> <span class="n">syscall</span> <span class="o">=</span> <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">;</span>

<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"%ld(%ld, %ld, %ld, %ld, %ld, %ld)"</span><span class="p">,</span>
        <span class="n">syscall</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rsi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdx</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r10</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r8</span><span class="p">,</span>  <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r9</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s one caveat. For <a href="https://web.archive.org/web/20190323050358/https://stackoverflow.com/a/6469069">internal kernel purposes</a>, the system
call number is stored in <code class="language-plaintext highlighter-rouge">orig_rax</code> rather than <code class="language-plaintext highlighter-rouge">rax</code>. All the other
system call arguments are straightforward.</p>

<p>Next it’s another <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> and <code class="language-plaintext highlighter-rouge">wait(2)</code>, then another
<code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code> to fetch the result. The result is stored in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">" = %ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rax</span><span class="p">);</span>
</code></pre></div></div>

<p>The output from this simple program is <em>very</em> crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use <code class="language-plaintext highlighter-rouge">process_vm_readv(2)</code> to
read those buffers from the tracee in order to print them appropriately.</p>

<p>However, this does lay the groundwork for system call interception.</p>

<h3 id="system-call-interception">System call interception</h3>

<p>Suppose we want to use Ptrace to implement something like OpenBSD’s
<a href="https://man.openbsd.org/pledge.2"><code class="language-plaintext highlighter-rouge">pledge(2)</code></a>, in which <a href="http://www.openbsd.org/papers/hackfest2015-pledge/mgp00001.html">a process <em>pledges</em> to use only a
restricted set of system calls</a>. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.</p>

<p>Before entering this main loop, a process can limit itself to the few
operations that it needs. If <a href="/blog/2017/07/19/">the program has a flaw</a> allowing it
to be exploited by bad input, the pledge significantly limits what the
exploit can accomplish.</p>

<p>Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call <code class="language-plaintext highlighter-rouge">exit(2)</code> in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.</p>

<p>The tricky part is that <strong>there’s no way to abort a system call once
it’s started</strong>. When tracer returns from <code class="language-plaintext highlighter-rouge">wait(2)</code> on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.</p>

<p>However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” <code class="language-plaintext highlighter-rouge">EPERM</code> error
in <code class="language-plaintext highlighter-rouge">errno</code> <a href="/blog/2016/09/23/">via the normal in-band signaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Enter next system call */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="cm">/* Is this system call permitted? */</span>
    <span class="kt">int</span> <span class="n">blocked</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">is_syscall_blocked</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">blocked</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// set to invalid syscall</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="cm">/* Run system call and stop on exit */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">blocked</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* errno = EPERM */</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">rax</span> <span class="o">=</span> <span class="o">-</span><span class="n">EPERM</span><span class="p">;</span> <span class="c1">// Operation not permitted</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (<code class="language-plaintext highlighter-rouge">open(2)</code>) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.</p>

<p>How <em>could</em> the tracee communicate to the tracer? Use an artificial
system call!</p>

<h3 id="creating-an-artificial-system-call">Creating an artificial system call</h3>

<p>For my new pledge-like system call — which I call <code class="language-plaintext highlighter-rouge">xpledge()</code> to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYS_xpledge 10000
</span></code></pre></div></div>

<p>Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s <code class="language-plaintext highlighter-rouge">pledge(2)</code>, which uses a <a href="https://www.tedunangst.com/flak/post/string-interfaces">string interface</a>.
<em>Actually</em> designing robust and secure sets of privileges is really
complicated, as the <code class="language-plaintext highlighter-rouge">pledge(2)</code> manpage shows. Here’s the entire
interface <em>and</em> implementation of the system call for the tracee:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="cp">#define XPLEDGE_RDWR  (1 &lt;&lt; 0)
#define XPLEDGE_OPEN  (1 &lt;&lt; 1)
</span>
<span class="cp">#define xpledge(arg) syscall(SYS_xpledge, arg)
</span></code></pre></div></div>

<p>If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. <code class="language-plaintext highlighter-rouge">brk(2)</code>). The
<code class="language-plaintext highlighter-rouge">PLEDGE_RDWR</code> bit allows <a href="/blog/2017/03/01/">various</a> read and write system calls
(<code class="language-plaintext highlighter-rouge">read(2)</code>, <code class="language-plaintext highlighter-rouge">readv(2)</code>, <code class="language-plaintext highlighter-rouge">pread(2)</code>, <code class="language-plaintext highlighter-rouge">preadv(2)</code>, etc.). The
<code class="language-plaintext highlighter-rouge">PLEDGE_OPEN</code> bit allows <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<p>To prevent privileges from being escalated back, <code class="language-plaintext highlighter-rouge">pledge()</code> blocks
itself — though this also prevents dropping more privileges later down
the line.</p>

<p>In the xpledge tracer, I just need to check for this system call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle entrance */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">register_pledge</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The operating system will return <code class="language-plaintext highlighter-rouge">ENOSYS</code> (Function not implemented)
since this isn’t a <em>real</em> system call. So on the way out I overwrite
this with a success (0).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle exit */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_POKEUSER</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">RAX</span> <span class="o">*</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I wrote a little test program that opens <code class="language-plaintext highlighter-rouge">/dev/urandom</code>, makes a read,
tries to pledge, then tries to open <code class="language-plaintext highlighter-rouge">/dev/urandom</code> a second time, then
confirms it can read from the original <code class="language-plaintext highlighter-rouge">/dev/urandom</code> file descriptor.
Running without a pledge tracer, the output looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
</code></pre></div></div>

<p>Making an invalid system call doesn’t crash an application. It just
fails, which is a rather convenient fallback. When run under the
tracer, it looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
</code></pre></div></div>

<p>The pledge succeeds but the second <code class="language-plaintext highlighter-rouge">fopen(3)</code> does not since the tracer
blocked it with <code class="language-plaintext highlighter-rouge">EPERM</code>.</p>

<p>This concept could be taken much further, to, say, change file paths or
return fake results. A tracer could effectively chroot its tracee,
prepending some chroot path to the root of any path passed through a
system call. It could even lie to the process about what user it is,
claiming that it’s running as root. In fact, this is exactly how the
<a href="https://fakeroot-ng.lingnu.com/index.php/Home_Page">Fakeroot NG</a> program works.</p>

<h3 id="foreign-system-emulation">Foreign system emulation</h3>

<p>Suppose you don’t just want to intercept <em>some</em> system calls, but
<em>all</em> system calls. You’ve got <a href="/blog/2017/11/30/">a binary intended to run on another
operating system</a>, so none of the system calls it makes will ever
work.</p>

<p>You could manage all this using only what I’ve described so far. The
tracer would always replace the system call number with a dummy, allow
it to fail, then service the system call itself. But that’s really
inefficient. That’s essentially three context switches for each system
call: one to stop on the entrance, one to make the always-failing
system call, and one to stop on the exit.</p>

<p>The Linux version of PTrace has had a more efficient operation for
this technique since 2005: <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>. PTrace stops only <em>once</em>
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSEMU</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">OS_read</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_write</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_open</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_exit</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="cm">/* ... and so on ... */</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>
tracer, a loader (to take the place of <code class="language-plaintext highlighter-rouge">exec(2)</code>), and whatever system
libraries the binary needs (or only run static binaries).</p>

<p>In fact, this sounds like a fun weekend project.</p>

<h3 id="see-also">See also</h3>

<ul>
  <li><a href="https://www.youtube.com/watch?v=uXgxMDglxVM">Implementing a clone of OpenBSD pledge into the Linux kernel</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>When FFI Function Calls Beat Native C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/27/"/>
    <id>urn:uuid:cb339e3b-382e-3762-4e5c-10cf049f7627</id>
    <updated>2018-05-27T20:03:15Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update: There’s a good discussion on <a href="https://news.ycombinator.com/item?id=17171252">Hacker News</a>.</em></p>

<p>Over on GitHub, David Yu has an interesting performance benchmark for
function calls of various Foreign Function Interfaces (<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a>):</p>

<p><a href="https://github.com/dyu/ffi-overhead">https://github.com/dyu/ffi-overhead</a></p>

<p>He created a shared object (<code class="language-plaintext highlighter-rouge">.so</code>) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.</p>

<p>For the C “FFI” he used standard dynamic linking, not <code class="language-plaintext highlighter-rouge">dlopen()</code>. This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.</p>

<p>The most surprising result of the benchmark is that
<strong><a href="http://luajit.org/">LuaJIT’s</a> FFI is substantially faster than C</strong>. It’s about
25% faster than a native C function call to a shared object function.
How could a weakly and dynamically typed scripting language come out
ahead on a benchmark? Is this accurate?</p>

<p>It’s actually quite reasonable. The benchmark was run on Linux, so the
performance penalty we’re seeing comes the <em>Procedure Linkage Table</em>
(PLT). I’ve put together a really simple experiment to demonstrate the
same effect in plain old C:</p>

<p><a href="https://github.com/skeeto/dynamic-function-benchmark">https://github.com/skeeto/dynamic-function-benchmark</a></p>

<p>Here are the results on an Intel i7-6700 (Skylake):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
</code></pre></div></div>

<p>These are three different types of function calls:</p>

<ol>
  <li>Through the PLT</li>
  <li>An indirect function call (via <code class="language-plaintext highlighter-rouge">dlsym(3)</code>)</li>
  <li>A direct function call (via a JIT-compiled function)</li>
</ol>

<p>As shown, the last one is the fastest. It’s typically not an option
for C programs, but it’s natural in the presence of a JIT compiler,
including, apparently, LuaJIT.</p>

<p>In my benchmark, the function being called is named <code class="language-plaintext highlighter-rouge">empty()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">empty</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
</code></pre></div></div>

<p>And to compile it into a shared object:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -Os -o empty.so empty.c
</code></pre></div></div>

<p>Just as in my <a href="/blog/2017/09/21/">PRNG shootout</a>, the benchmark calls this function
repeatedly as many times as possible before an alarm goes off.</p>

<h3 id="procedure-linkage-tables">Procedure Linkage Tables</h3>

<p>When a program or library calls a function in another shared object,
the compiler cannot know where that function will be located in
memory. That information isn’t known until run time, after the program
and its dependencies are loaded into memory. These are usually at
randomized locations — e.g. <em>Address Space Layout Randomization</em>
(ASLR).</p>

<p>How is this resolved? Well, there are a couple of options.</p>

<p>One option is to make a note about each such call in the binary’s
metadata. The run-time dynamic linker can then <em>patch</em> in the correct
address at each call site. How exactly this would work depends on the
particular <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">code model</a> used when compiling the binary.</p>

<p>The downside to this approach is slower loading, larger binaries, and
less <a href="/blog/2016/04/10/">sharing of code pages</a> between different processes. It’s
slower loading because every dynamic call site needs to be patched
before the program can begin execution. The binary is larger because
each of these call sites needs an entry in the relocation table. And the
lack of sharing is due to the code pages being modified.</p>

<p>On the other hand, the overhead for dynamic function calls would be
eliminated, giving JIT-like performance as seen in the benchmark.</p>

<p>The second option is to route all dynamic calls through a table. The
original call site calls into a stub in this table, which jumps to the
actual dynamic function. With this approach the code does not need to
be patched, meaning it’s <a href="/blog/2016/12/23/">trivially shared</a> between processes.
Only one place needs to be patched per dynamic function: the entries
in the table. Even more, these patches can be performed <em>lazily</em>, on
the first function call, making the load time even faster.</p>

<p>On systems using ELF binaries, this table is called the Procedure
Linkage Table (PLT). The PLT itself doesn’t actually get patched —
it’s mapped read-only along with the rest of the code. Instead the
<em>Global Offset Table</em> (GOT) gets patched. The PLT stub fetches the
dynamic function address from the GOT and <em>indirectly</em> jumps to that
address. To lazily load function addresses, these GOT entries are
initialized with an address of a function that locates the target
symbol, updates the GOT with that address, and then jumps to that
function. Subsequent calls use the lazily discovered address.</p>

<p><img src="/img/diagram/plt.svg" alt="" /></p>

<p>The downside of a PLT is extra overhead per dynamic function call,
which is what shows up in the benchmark. Since the benchmark <em>only</em>
measures function calls, this appears to be pretty significant, but in
practice it’s usually drowned out in noise.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Cleared by an alarm signal. */</span>
<span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">plt_benchmark</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">empty</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">empty()</code> is in the shared object, that call goes through the PLT.</p>

<h3 id="indirect-dynamic-calls">Indirect dynamic calls</h3>

<p>Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
<code class="language-plaintext highlighter-rouge">dlsym(3)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">h</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="s">"path/to/lib.so"</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span>
<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="s">"f"</span><span class="p">);</span>
<span class="n">f</span><span class="p">();</span>
</code></pre></div></div>

<p>Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then <code class="language-plaintext highlighter-rouge">dlsym(3)</code> may actually return the address of
the PLT stub.)</p>

<p>However, this is still an <em>indirect</em> function call. On conventional
architectures, <em>direct</em> function calls have an immediate relative
address. That is, the target of the call is some hard-coded offset from
the call site. The CPU can see well ahead of time where the call is
going.</p>

<p>An indirect function call has more overhead. First, the address has to
be stored somewhere. Even if that somewhere is just a register, it
increases register pressure by using up a register. Second, it
provokes the CPU’s branch predictor since the call target isn’t
static, making for extra bookkeeping in the CPU. In the worst case the
function call may even cause a pipeline stall.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">indirect_benchmark</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">f</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function passed to this benchmark is fetched with <code class="language-plaintext highlighter-rouge">dlsym(3)</code> so the
compiler can’t <a href="/blog/2018/05/01/">do something tricky</a> like convert that indirect
call back into a direct call.</p>

<p>If the body of the loop was complicated enough that there was register
pressure, thereby requiring the address to be spilled onto the stack,
this benchmark might not fare as well against the PLT benchmark.</p>

<h3 id="direct-function-calls">Direct function calls</h3>

<p>The first two types of dynamic function calls are simple and easy to
use. <em>Direct</em> calls to dynamic functions is trickier business since it
requires modifying code at run time. In my benchmark I put together a
<a href="/blog/2015/03/19/">little JIT compiler</a> to generate the direct call.</p>

<p>There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, <code class="language-plaintext highlighter-rouge">empty()</code>. If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.</p>

<p>To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this <a href="/blog/2016/09/03/">requires parsing virtual files under <code class="language-plaintext highlighter-rouge">/proc</code></a>.</p>

<p>Here’s what my JIT’s memory allocation looks like. It assumes
<a href="/blog/2016/05/30/">reasonable behavior for <code class="language-plaintext highlighter-rouge">uintptr_t</code> casts</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="n">jit_func</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">empty</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">desired</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="n">addr</span> <span class="o">-</span> <span class="n">SAFETY_MARGIN</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">PAGEMASK</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">desired</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It allocates two pages, one writable and the other containing
non-writable code. Similar to <a href="/blog/2017/01/08/">my closure library</a>, the lower
page is writable and holds the <code class="language-plaintext highlighter-rouge">running</code> variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">jit_benchmark:</span>
        <span class="nf">push</span>  <span class="nb">rbx</span>
        <span class="nf">xor</span>   <span class="nb">ebx</span><span class="p">,</span> <span class="nb">ebx</span>
<span class="nl">.loop:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">running</span><span class="p">]</span>
        <span class="nf">test</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">je</span>    <span class="nv">.done</span>
        <span class="nf">call</span>  <span class="nv">empty</span>
        <span class="nf">inc</span>   <span class="nb">ebx</span>
        <span class="nf">jmp</span>   <span class="nv">.loop</span>
<span class="nl">.done:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebx</span>
        <span class="nf">pop</span>   <span class="nb">rbx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">call empty</code> is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the <em>end</em> of the instruction):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// call empty</span>
    <span class="kt">uintptr_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="mh">0xe8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">empty()</code> wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for <code class="language-plaintext highlighter-rouge">plt_benchmark()</code>, assuming somehow it didn’t inline
<code class="language-plaintext highlighter-rouge">empty()</code>.</p>

<p>Ironically, calling the JIT-compiled code requires an indirect call
(e.g. via a function pointer), and there’s no way around this. What
are you going to do, JIT compile another function that makes the
direct call? Fortunately this doesn’t matter since the part being
measured in the loop is only a direct call.</p>

<h3 id="its-no-mystery">It’s no mystery</h3>

<p>Given these results, it’s really no mystery that LuaJIT can generate
more efficient dynamic function calls than a PLT, <em>even if they still
end up being indirect calls</em>. In my benchmark, the non-PLT indirect
calls were 28% faster than the PLT, and the direct calls 43% faster
than the PLT. That’s a small edge that JIT-enabled programs have over
plain old native programs, though it comes at the cost of absolutely
no code sharing between processes.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>A Crude Personal Package Manager</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/03/27/"/>
    <id>urn:uuid:b100f50f-c8f8-3a08-e149-a04b2308226b</id>
    <updated>2018-03-27T02:10:35Z</updated>
    <category term="c"/><category term="posix"/><category term="linux"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>For the past couple of months I’ve been using a custom package manager
to manage a handful of software packages within various unix-like
environments. Packages are <a href="/blog/2017/06/19/">installed in my home directory</a> under
<code class="language-plaintext highlighter-rouge">~/.local/bin</code>, and the package manager itself is just a 110 line Bourne
shell script. It’s is not intended to replace the system’s package
manager but, instead, compliment it in some cases where I need more
flexibility. I use it to run custom versions of specific pieces of
software — newer or older than the system-installed versions, or with my
own patches and modifications — without interfering with the rest of
system, and without a need for root access. It’s worked out <em>really</em>
well so far and I expect to continue making heavy use of it in the
future.</p>

<p>It’s so simple that I haven’t even bothered putting the script in its
own repository. It sits unadorned within my dotfiles repository with the
name <em>qpkg</em> (“quick package”):</p>

<ul>
  <li><a href="https://github.com/skeeto/dotfiles/blob/master/bin/qpkg">https://github.com/skeeto/dotfiles/blob/master/bin/qpkg</a></li>
</ul>

<p>Sitting alongside my dotfiles means it’s always there when I need it,
just as if it was a built-in command.</p>

<p>I say it’s crude because its “install” (<code class="language-plaintext highlighter-rouge">-I</code>) procedure is little more
than a wrapper around tar. It doesn’t invoke libtool after installing a
library, and there’s no post-install script — or <code class="language-plaintext highlighter-rouge">postinst</code> as Debian
calls it. It doesn’t check for conflicts between packages, though
there’s a command for doing so manually ahead of time. It doesn’t manage
dependencies, nor even have them as a concept. That’s all on the user to
screw up.</p>

<p>In other words, it doesn’t attempt solve most of the hard problems
tackled by package managers… <em>except</em> for three important issues:</p>

<ol>
  <li>
    <p>It provides a clean, guaranteed-to-work uninstall procedure. Some
Makefiles <em>do</em> have a token “uninstall” target, but it’s often
unreliable.</p>
  </li>
  <li>
    <p>Unlike blindly using a Makefile “install” target, I can check for
conflicts <em>before</em> installing the software. I’ll know if and how a
package clobbers an already-installed package, and I can manage, or
ignore, that conflict manually as needed.</p>
  </li>
  <li>
    <p>It produces a compact, reusable package file that I can reinstall
later, even on a different machine (with a couple of caveats). I
don’t need to keep around the original source and build directories
should I want to install or uninstall later. I can also rapidly
switch back and forth between different builds of the same software.</p>
  </li>
</ol>

<p>The first caveat is that the package will be configured for exactly my
own home directory, so I usually can’t share it with other users, or
install it on machines where I have a different home directory. Though I
could still create packages for different installation prefixes.</p>

<p>The second caveat is that some builds tailor themselves by default to
the host (e.g. <code class="language-plaintext highlighter-rouge">-march=native</code>). If care isn’t taken, those packages may
not be very portable. This is more common than I had expected and has
mildly annoyed me.</p>

<h3 id="birth-of-a-package-manager">Birth of a package manager</h3>

<p>While the package manager is new, I’ve been building and installing
software in my home directory for years. I’d follow the normal process
of setting the install <em>prefix</em> to <code class="language-plaintext highlighter-rouge">$HOME/.local</code>, running the build,
and then letting the “install” target do its thing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
</code></pre></div></div>

<p>This worked well enough for years. However, I’ve come to rely a lot on
this technique, and I’m using it for increasingly sophisticated
purposes, such as building custom cross-compiler toolchains.</p>

<p>A common difficulty has been handling the release of new versions of
software. I’d like to upgrade to the new version, but lack a way to
cleanly uninstall the previous version. Simply clobbering the old
version by installing it on top <em>usually</em> works. Occasionally it
wouldn’t, and I’d have to blow away <code class="language-plaintext highlighter-rouge">~/.local</code> and start all over again.
With more and more software installed in my home directory, restarting
has become more and more of a chore that I’d like to avoid.</p>

<p>What I needed was a way to track exactly which files were installed so
that I could remove them later when I needed to uninstall. Fortunately
there’s a widely-used convention for exactly this purpose: <code class="language-plaintext highlighter-rouge">DESTDIR</code>.</p>

<p>It’s expected that when a Makefile provides an “install” target, it
prefixes the installation path with the <code class="language-plaintext highlighter-rouge">DESTDIR</code> macro, which is
assigned to the empty string by default. This allows the user to install
the software to a temporary location for the purposes of packaging.
Unlike the installation prefix (<code class="language-plaintext highlighter-rouge">--prefix</code>) configured before the build
began, the software is not expected to function properly when run in the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> location.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install
</code></pre></div></div>

<p>A different tool will used to copy these files into place and actually
install it. This tool can track what files were installed, allowing them
to be removed later when uninstalling. My package manager uses the tar
program for both purposes. First it creates a package by packing up the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> (at the root of the actual install prefix):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar czf package.tgz -C $DESTDIR$HOME/.local .
</code></pre></div></div>

<p>So a package is nothing more than a gzipped tarball. To install, it
unpacks the tarball in <code class="language-plaintext highlighter-rouge">~/.local</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd $HOME/.local
$ tar xzf ~/package.tgz
</code></pre></div></div>

<p>But how does it uninstall a package? It didn’t keep track of what was
installed. Easy! The tarball itself contains the package list, and it’s
printed with tar’s <code class="language-plaintext highlighter-rouge">t</code> mode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">grep</code> to skip directories, which are conveniently listed with
a trailing slash. Note that in the example above, there are a couple of
issues with file names containing whitespace. If the file contains a
space character, it will word split incorrectly in the <code class="language-plaintext highlighter-rouge">for</code> loop. A
Makefile couldn’t handle such a file in the first place, but, in case
it’s still necessary, my package manager sets <code class="language-plaintext highlighter-rouge">IFS</code> to just a newline.</p>

<p>If the file name contains a newline, then my package manager relies on
<a href="http://dinaburg.org/bitsquatting.html">a cosmic ray striking just the right bit at just the right
instant</a> to make it all work out, because no version of tar can
unambiguously print such file names. Crossing your fingers during this
process may help.</p>

<h3 id="commands">Commands</h3>

<p>There are five commands, each assigned to a capital letter: <code class="language-plaintext highlighter-rouge">-B</code>, <code class="language-plaintext highlighter-rouge">-C</code>,
<code class="language-plaintext highlighter-rouge">-I</code>, <code class="language-plaintext highlighter-rouge">-V</code>,  and <code class="language-plaintext highlighter-rouge">-U</code>. It’s an interface pattern inspired by <a href="https://www.openbsd.org/papers/bsdcan-signify.html">Ted
Unangst’s signify</a> (see <a href="https://man.openbsd.org/signify.1"><code class="language-plaintext highlighter-rouge">signify(1)</code></a>). I also used this
pattern with <a href="/blog/2017/09/15/">Blowpipe</a> and, in retrospect, wish I had also used
with <a href="/blog/2017/03/12/">Enchive</a>.</p>

<h4 id="build--b">Build (<code class="language-plaintext highlighter-rouge">-B</code>)</h4>

<p>Unlike the other three commands, the “build” command isn’t essential,
and is just for convenience. It assumes the build uses an Autoconfg-like
configure script and runs it automatically, followed by <code class="language-plaintext highlighter-rouge">make</code> with the
appropriate <code class="language-plaintext highlighter-rouge">-j</code> (jobs) option. It automatically sets the <code class="language-plaintext highlighter-rouge">--prefix</code>
argument when running the configure script.</p>

<p>If the build uses something other and an Autoconf-like configure script,
such as CMake, then you can’t use the “build” command and must perform
the build yourself. For example, I must do this when building LLVM and
Clang.</p>

<p>Before using the “build” command, the package must first be unpacked and
patched if necessary. Then the package manager can take over to run the
build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 &lt; ../0001.patch
$ patch -p1 &lt; ../0002.patch
$ patch -p1 &lt; ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/
</code></pre></div></div>

<p>In this example I’m doing an out-of-source build by invoking the
configure script from a different directory. Did you know Autoconf
scripts support this? I didn’t know until recently! Unfortunately some
hand-written Autoconf-like scripts don’t, though this will
be immediately obvious.</p>

<p>Once <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the program will be fully built — or stuck on a
build error if you’re unlucky. If you need to pass custom configure
options, just tack them on the <code class="language-plaintext highlighter-rouge">qpkg</code> command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses
</code></pre></div></div>

<p>Since the second and third steps — creating the build directory and
moving into it — is so common, there’s an optional switch for it: <code class="language-plaintext highlighter-rouge">-d</code>.
This option’s argument is the build directory. <code class="language-plaintext highlighter-rouge">qpkg</code> creates that
directory and runs the build inside it. In practice I just use “x” for
the build directory since it’s so quick to add “dx” to the command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/
</code></pre></div></div>

<p>With the software compiled, the next step is creating the package.</p>

<h4 id="create--c">Create (<code class="language-plaintext highlighter-rouge">-C</code>)</h4>

<p>The “create” command creates the <code class="language-plaintext highlighter-rouge">DESTDIR</code> (<code class="language-plaintext highlighter-rouge">_destdir</code> in the working
directory) and runs the “install” Makefile target to fill it with files.
Continuing with the example above and its <code class="language-plaintext highlighter-rouge">x/</code> build directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -Cdx name
</code></pre></div></div>

<p>Where “name” is the name of the package, without any file name
extension. Like with “build”, extra arguments after the package name are
passed to <code class="language-plaintext highlighter-rouge">make</code> in case there needs to be any additional tweaking.</p>

<p>When the “create” command finishes, there will be new package named
<code class="language-plaintext highlighter-rouge">name.tgz</code> in the working directory. At this point the source and build
directories are no longer needed, assuming everything went fine.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf name-version/
$ rm -rf x/
</code></pre></div></div>

<p>This package is ready to install, though you may want to verify it
first.</p>

<h4 id="verify--v">Verify (<code class="language-plaintext highlighter-rouge">-V</code>)</h4>

<p>The “verify” command checks for collisions against installed packages.
It works like uninstallation, but rather than deleting files, it checks
if any of the files already exist. If they do, it means there’s a
conflict with an existing package. These file names are printed.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -V name.tgz
</code></pre></div></div>

<p>The most common conflict I’ve seen is in the info index (<code class="language-plaintext highlighter-rouge">info/dir</code>)
file, which is safe to ignore since I don’t care about it.</p>

<p>If the package has already been installed, there will of course be tons
of conflicts. This is the easiest way to check if a package has been
installed.</p>

<h4 id="install--i">Install (<code class="language-plaintext highlighter-rouge">-I</code>)</h4>

<p>The “install” command is just the dumb <code class="language-plaintext highlighter-rouge">tar xzf</code> explained above. It
will clobber anything in its way without warning, which is why, if that
matters, “verify” should be used first.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -I name.tgz
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the package has been installed and is probably
ready to go. A lot of packages complain that you need to run libtool to
finalize an installation, but I’ve never had a problem skipping it. This
dumb unpacking generally works fine.</p>

<h4 id="uninstall--u">Uninstall (<code class="language-plaintext highlighter-rouge">-U</code>)</h4>

<p>Obviously the last command is “uninstall”. As explained above, this
needs the original package that was given to the “install” command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -U name.tgz
</code></pre></div></div>

<p>Just as “install” is dumb, so is “uninstall,” blindly deleting anything
listed in the tarball. One thing I like about dumb tools is that there
are no surprises.</p>

<p>I typically suffix the package name with the version number to help keep
the packages organized. When upgrading to a new version of a piece of
software, I build the new package, which, thanks to the version suffix,
will have a distinct name. Then I uninstall the old package, and,
finally, install the new one in its place. So far I’ve been keeping the
old package around in case I still need it, though I could always
rebuild it in a pinch.</p>

<h3 id="package-by-accumulation">Package by accumulation</h3>

<p>Building a GCC cross-compiler toolchain is a tricky case that doesn’t
fit so well with the build, create, and install process illustrated
above. It would be nice for the cross-compiler to be a single, big
package, but due to the way it’s built, it would need to be five or so
packages, a couple of which will conflict (one being a subset of
another):</p>

<ol>
  <li>binutils</li>
  <li>C headers</li>
  <li>core GCC</li>
  <li>C runtime</li>
  <li>rest of GCC</li>
</ol>

<p>Each step needs to be installed before the next step will work. (I don’t
even want to think about cross-compiling a cross-compiler.)</p>

<p>To deal with this, I added a “keep” (<code class="language-plaintext highlighter-rouge">-k</code>) option that leaves the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> around after creating the package. To keep things tidy, the
intermediate packages exist and are installed, but the final, big
cross-compiler package <em>accumulates</em> into the <code class="language-plaintext highlighter-rouge">DESTDIR</code>. The final
package at the end is actually the whole cross compiler in one package,
a superset of them all.</p>

<p>Complicated situations like these are where I can really understand the
value of Debian’s <a href="https://wiki.debian.org/FakeRoot">fakeroot</a> tool.</p>

<h3 id="my-use-case-and-an-alternative">My use case, and an alternative</h3>

<p>The role filled by my package manager is actually pretty well suited for
<a href="https://www.pkgsrc.org/">pkgsrc</a>, which is NetBSD’s ports system made available to other
unix-like systems. However, I just need something really lightweight
that gives me absolute control — even more than I get with pkgsrc — in
the dozen or so cases where I <em>really</em> need it.</p>

<p>All I need is a standard C toolchain in a unix-like environment (even a
really old one), the source tarballs for the software I need, my 110
line shell script package manager, and one to two cans of elbow grease.
From there I can bootstrap everything I might need without root access,
even <a href="/blog/2017/04/01/">in a disaster</a>. If the software I need isn’t written in C, it
can ultimately get bootstrapped from some crusty old C compiler, which
might even involve building some newer C compilers in between. After a
certain point it’s C all the way down.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Initial Evaluation of the Windows Subsystem for Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/11/30/"/>
    <id>urn:uuid:3edd1b7d-74c3-3dab-83b6-aa07ee54460f</id>
    <updated>2017-11-30T21:03:53Z</updated>
    <category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Recently I had my first experiences with the <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/"><em>Windows Subsystem for
Linux</em></a> (WSL), evaluating its potential as an environment for
getting work done. This subsystem, introduced to Windows 10 in August
2016, allows Windows to natively run x86 and x86-64 Linux binaries.
It’s essentially the counterpart to Wine, which allows Linux to
natively run Windows binaries.</p>

<p>WSL interfaces with Linux programs only at the kernel level, servicing
system calls the same way <a href="/blog/2015/05/15/">the Linux kernel would</a>. The
subsystem’s main job is translating Linux system calls into NT
requests. There’s a <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/wsl/">series of articles about its internals</a> if
you’re interested in learning more.</p>

<p>I was honestly impressed by how well this all works, especially since
Microsoft has long had an affinity for producing flimsy imitations
(Windows console, PowerShell, Arial, etc.). WSL’s design allows
Microsoft to dump an Ubuntu system wholesale inside Windows — and,
more recently, other Linux distributions — bypassing a bunch of
annoying issues, particularly in regards to glibc.</p>

<p>WSL processes can <code class="language-plaintext highlighter-rouge">exec(2)</code> Windows binaries, which then run in under
their appropriate subsystem, similar to <a href="https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/binfmt-misc.rst">binfmt</a> on Linux. In
theory this nice interop should allow for <em>some</em> automation
Linux-style even for Windows’ services and programs. More on that
later.</p>

<p>There are some notable issues, though.</p>

<h3 id="lack-of-device-emulation">Lack of device emulation</h3>

<p>No soundcard devices are exposed to the subsystem, so Linux programs
can’t play sound. There’s <a href="https://trzeci.eu/configure-graphic-and-sound-on-wsl/">a hack to talk PulseAudio</a> with a
Windows’ process that can access, but that’s about it. Generally
there’s not much reason to be playing media or games under WSL, but
this can be an annoyance if you’re, say, <a href="/blog/2017/11/03/">writing software that
synthesizes audio</a>.</p>

<p>Really, there’s almost no device emulation at all and <code class="language-plaintext highlighter-rouge">/proc</code> is
pretty empty. You won’t see hard drives or removable media under
<code class="language-plaintext highlighter-rouge">/dev</code>, nor will you see USB devices like webcams and
<a href="/blog/2016/11/05/">joysticks</a>. A lot of the useful things you might do on a Linux
system aren’t available under WSL.</p>

<h3 id="no-filesystem-in-userspace-fuse">No Filesystem in Userspace (FUSE)</h3>

<p>Microsoft hasn’t implemented any of the system calls for FUSE, so don’t
expect to use your favorite userspace filesystems. The biggest loss for
me is <a href="https://github.com/libfuse/sshfs">sshfs</a>, which I use frequently.</p>

<p>If FUSE <em>was</em> supported, it would be interesting to see how the rest of
Windows interacts with these mounted filesystems, if at all.</p>

<h3 id="fragile-services">Fragile services</h3>

<p>Services running under WSL are flaky. The big issue is that when the
initial WSL shell process exits, all WSL processes are killed and the
entire subsystem is torn down. This includes any services that are
running. That’s certainly surprising to anyone with experience running
services on any kind of unix system. This is probably the worst part
of WSL.</p>

<p>While systemd is the standard for Linux these days and may even be
“installed” in the WSL virtual filesystem, it’s not actually running
and you can’t use <code class="language-plaintext highlighter-rouge">systemctl</code> to interact with services. Services can
only be controlled the old fashioned way, and, per above, that initial
WSL console window has to remain open while services are running.</p>

<p>That’s a bit of a damper if you’re intending to spend a lot of time
remotely SSHing into your Windows 10 system. So yes, it’s trivial to run
an OpenSSH server under WSL, but it won’t feel like a proper system
service.</p>

<h3 id="limited-graphics-support">Limited graphics support</h3>

<p>WSL doesn’t come with an X server, so you have to supply one
separately (<a href="https://sourceforge.net/projects/xming/">Xming</a>, etc.) that runs outside WSL, as a normal
Windows process. WSL processes can connect to that server (<code class="language-plaintext highlighter-rouge">DISPLAY</code>)
allowing you to run most Linux graphical software.</p>

<p>However, this means there’s no hardware acceleration. There will be no
<a href="https://en.wikipedia.org/wiki/GLX">GLX extensions</a> available. If your goal is to run the Emacs or
Vim GUIs, that’s not a big deal, but it might matter if you were
interested in running a browser under WSL. It also means it’s not a
suitable environment for <a href="/blog/2015/06/06/">developing software using OpenGL</a>.</p>

<h3 id="filesystem-woes">Filesystem woes</h3>

<p>The filesystem manages to be both one of the smallest issues as well
as one of the biggest.</p>

<h4 id="filename-translation">Filename translation</h4>

<p>On the small issue side is filename translation. Under most Linux
filesystems — and even more broadly for unix — <a href="http://yarchive.net/comp/linux/case_insensitive_filenames.html">a filename is just a
bytestring</a>. They’re not necessarily UTF-8 or any other
particular encoding, and that’s partly why filenames are
case-sensitive — the meaning of case depends on the encoding.</p>

<p>However, Windows uses a <a href="/blog/2016/06/13/">pseudo-UTF-16 scheme</a> for filenames,
incompatible with bytestrings. Since WSL lives <em>within</em> a Windows’
filesystem, there must be some bijection between bytestring filenames
and pseudo-UTF-16 filenames. It will also have to reject filenames
that can’t be mapped. WSL does both.</p>

<p>I couldn’t find any formal documentation about how filename
translation works, but most of it can be reverse engineered through
experimentation. In practice, Linux filenames are <a href="/blog/2017/10/06/">UTF-8 encoded
strings</a>, and WSL’s translation takes advantage of this.
Filenames are decoded as UTF-8 and re-encoded as UTF-16 for Windows.
Any byte that doesn’t decode as valid UTF-8 is silently converted to
REPLACEMENT CHARACTER (U+FFFD), and decoding continues from the next
byte.</p>

<p>I wonder if there are security consequences for different filenames
silently mapping to the same underlying file.</p>

<p>Exercise for the reader: How is an unmatched surrogate half from
Windows translated to WSL, where it doesn’t have a UTF-8 equivalent? I
haven’t tried this yet.</p>

<p>Even for valid UTF-8, there are many bytes that most Linux filesystems
allow in filenames that Windows does not. This ranges from simple things
like ASCII backslash and colon — special components of Windows’ paths —
to unusual characters like newlines, escape, and other ASCII control
characters. There are two different ways these are handled:</p>

<ol>
  <li>
    <p>The C drive is available under <code class="language-plaintext highlighter-rouge">/mnt/c</code>, and WSL processes can access
regular Windows files under this “mountpoint.” Attempting to access
filenames with invalid characters under this mountpoint always
results in ENOENT: “No such file or directory.”</p>
  </li>
  <li>
    <p>Outside of <code class="language-plaintext highlighter-rouge">/mnt/c</code> is WSL territory, and Windows processes aren’t
supposed to touch these files. This allows for more freedom when
translating filenames. REPLACEMENT CHARACTER is still used for
invalid UTF-8 sequences, but the forbidden characters, including
backslashes, are all permitted. They’re translated to <code class="language-plaintext highlighter-rouge">#XXXX</code> where X
is hexadecimal for the normally invalid character. For example, <code class="language-plaintext highlighter-rouge">a:b</code>
becomes <code class="language-plaintext highlighter-rouge">a#003Ab</code>.</p>
  </li>
</ol>

<p>While WSL doesn’t let you get away with all the crazy, ill-advised
filenames that Linux allows, it’s still quite reasonable. Since Windows
and Linux filenames aren’t entirely compatible, there’s going to be some
trade-off no matter how this translation is done.</p>

<h4 id="filesystem-performance">Filesystem performance</h4>

<p>On the other hand, filesystem performance is abysmal, and I doubt the
subsystem is to blame. This isn’t a surprise to anyone who’s used
moderately-sized Git repositories on Windows, where the large numbers
of loose files brings things to a crawl. This has been a Windows issue
for years, and that’s even <em>before</em> you start plugging in the
typically “security” services — virus scanners, whitelists, etc. —
that are typically present on a Windows system and make this even
worse.</p>

<p>To test out WSL, I went around my normal business <a href="/blog/2017/06/19/">compiling
tools</a> and making myself at home, just as I would on Linux.
Doing nearly anything in WSL was noticably slower than doing the same
on Linux on the exact same hardware. I didn’t run any benchmarks, but
I’d expect to see around an order of magnitude difference on average
for filesystem operations. Building LLVM and Clang took a couple
hours rather than the typical 20 minutes.</p>

<p>I don’t expect this issue to get fixed anytime soon, and it’s probably
always going to be a notable limitation of WSL.</p>

<h3 id="so-is-wsl-useful">So is WSL useful?</h3>

<p>One of my hopes for WSL appears to be unfeasible. I thought it might
be a way to avoid <a href="/blog/2017/03/30/">porting software from POSIX to Win32</a>. I
could just supply Windows users with the same Linux binary and they’d
be fine. <del>However, WSL requires switching Windows into a special
“developer mode,” putting it well out of reach of the vast majority of
users, especially considering the typical corporate computing
environment that will lock this down. In practice, WSL is only useful
to developers. I’m sure this is no accident.</del> (Developer mode is <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/commandline/2017/10/11/whats-new-in-wsl-in-windows-10-fall-creators-update/">no
longer required</a> as of October 2017.)</p>

<p>Mostly I see WSL as a Cygwin killer. <a href="https://sanctum.geek.nz/arabesque/series/unix-as-ide/">Unix is my IDE</a> and, on
Windows, Cygwin has been my preferred go to for getting a solid unix
environment for software development. Unlike WSL, Cygwin processes can
make direct Win32 calls, which is occasionally useful. But, in exchange,
WSL will overall be better equipped. It has native Linux tools,
including a better suite of debugging tools — even better than you get
in Windows itself — Valgrind, strace, and properly-working GDB (always
been flaky in Cygwin). WSL is not nearly as good as actual Linux, but
it’s better than Cygwin <em>if</em> you can get access to it.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Building and Installing Software in $HOME</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/06/19/"/>
    <id>urn:uuid:ae490550-a3b8-3b8f-4338-c2aba7306c8f</id>
    <updated>2017-06-19T02:34:39Z</updated>
    <category term="linux"/><category term="tutorial"/><category term="debian"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>For more than 5 years now I’ve kept a private “root” filesystem within
my home directory under <code class="language-plaintext highlighter-rouge">$HOME/.local/</code>. Within are the standard
<code class="language-plaintext highlighter-rouge">/usr</code> directories, such as <code class="language-plaintext highlighter-rouge">bin/</code>, <code class="language-plaintext highlighter-rouge">include/</code>, <code class="language-plaintext highlighter-rouge">lib/</code>, etc.,
containing my own software, libraries, and man pages. These are
first-class citizens, indistinguishable from the system-installed
programs and libraries. With one exception (setuid programs), none of
this requires root privileges.</p>

<p>Installing software in $HOME serves two important purposes, both of
which are indispensable to me on a regular basis.</p>

<ul>
  <li><strong>No root access</strong>: Sometimes I’m using a system administered by
someone else, and I don’t have root access.</li>
</ul>

<p>This prevents me from installing packaged software myself through the
system’s package manager. Building and installing the software myself in
my home directory, without involvement from the system administrator,
neatly works around this issue. As a software developer, it’s already
perfectly normal for me to build and run custom software, and this is
just an extension of that behavior.</p>

<p>In the most desperate situation, all I need from the sysadmin is a
decent C compiler and at least a minimal POSIX environment. I can
<a href="/blog/2016/11/17/">bootstrap anything I might need</a>, both libraries and
programs, including a better C compiler along the way. This is one
major strength of open source software.</p>

<p>I have noticed one alarming trend: Both GCC (since 4.8) and Clang are
written in C++, so it’s becoming less and less reasonable to bootstrap
a C++ compiler from a C compiler, or even from a C++ compiler that’s
more than a few years old. So you may also need your sysadmin to
supply a fairly recent C++ compiler if you want to bootstrap an
environment that includes C++. I’ve had to avoid some C++ software
(such as CMake) for this reason.</p>

<ul>
  <li><strong>Custom software builds</strong>: Even if I <em>am</em> root, I may still want to
install software not available through the package manager, a version
not available in the package manager, or a version with custom
patches.</li>
</ul>

<p>In theory this is what <code class="language-plaintext highlighter-rouge">/usr/local</code> is all about. It’s typically the
location for software not managed by the system’s package manager.
However, I think it’s cleaner to put this in <code class="language-plaintext highlighter-rouge">$HOME/.local</code>, so long
as other system users don’t need it.</p>

<p>For example, I have an installation of each version of Emacs between
24.3 (the oldest version worth supporting) through the latest stable
release, each suffixed with its version number, under <code class="language-plaintext highlighter-rouge">$HOME/.local</code>.
This is useful for quickly running a test suite under different
releases.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...
</code></pre></div></div>

<p>Another example is NetHack, which I prefer to play with a couple of
custom patches (<a href="https://bilious.alt.org/?11">Menucolors</a>, <a href="https://gist.github.com/skeeto/11fed852dbfe9889a5fce80e9f6576ac">wchar</a>). The install to
<code class="language-plaintext highlighter-rouge">$HOME/.local</code> <a href="https://gist.github.com/skeeto/5cb9d5e774ce62655aff3507cb806981">is also captured as a patch</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 &lt; ~/nh343-menucolor.diff
$ patch -p1 &lt; ~/nh343-wchar.diff
$ patch -p1 &lt; ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install
</code></pre></div></div>

<p>Normally NetHack wants to be setuid (e.g. run as the “games” user) in
order to restrict access to high scores, saves, and bones — saved levels
where a player died, to be inserted randomly into other players’ games.
This prevents cheating, but requires root to set up. Fortunately, when I
install NetHack in my home directory, this isn’t a feature I actually
care about, so I can ignore it.</p>

<p><a href="/blog/2017/06/15/">Mutt</a> is in a similar situation, since it wants to install a
special setgid program (<code class="language-plaintext highlighter-rouge">mutt_dotlock</code>) that synchronizes mailbox
access. All MUAs need something like this.</p>

<p>Everything described below is relevant to basically any modern
unix-like system: Linux, BSD, etc. I personally install software in
$HOME across a variety of systems and, fortunately, it mostly works
the same way everywhere. This is probably in large part due to
everyone standardizing around the GCC and GNU binutils interfaces,
even if the system compiler is actually LLVM/Clang.</p>

<h3 id="configuring-for-home-installs">Configuring for $HOME installs</h3>

<p>Out of the box, installing things in <code class="language-plaintext highlighter-rouge">$HOME/.local</code> won’t do anything
useful. You need to set up some environment variables in your shell
configuration (i.e. <code class="language-plaintext highlighter-rouge">.profile</code>, <code class="language-plaintext highlighter-rouge">.bashrc</code>, etc.) to tell various
programs, such as your shell, about it. The most obvious variable is
$PATH:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<p>Notice I put it in the front of the list. This is because I want my
home directory programs to override system programs with the same
name. For what other reason would I install a program with the same
name if not to override the system program?</p>

<p>In the simplest situation this is good enough, but in practice you’ll
probably need to set a few more things. If you install libraries in
your home directory and expect to use them just as if they were
installed on the system, you’ll need to tell the compiler where else
to look for those headers and libraries, both for C and C++.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">C_INCLUDE_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/include
<span class="nb">export </span><span class="nv">CPLUS_INCLUDE_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/include
<span class="nb">export </span><span class="nv">LIBRARY_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib
</code></pre></div></div>

<p>The first two are like the <code class="language-plaintext highlighter-rouge">-I</code> compiler option and the third is like
<code class="language-plaintext highlighter-rouge">-L</code> linker option, except you <em>usually</em> won’t need to use them
explicitly. Unfortunately <code class="language-plaintext highlighter-rouge">LIBRARY_PATH</code> doesn’t override the system
library paths, so in some cases, you will need to explicitly set
<code class="language-plaintext highlighter-rouge">-L</code>. Otherwise you will still end up linking against the system library
rather than the custom packaged version. I really wish GCC and Clang
didn’t behave this way.</p>

<p>Some software uses <code class="language-plaintext highlighter-rouge">pkg-config</code> to determine its compiler and linker
flags, and your home directory will contain some of the needed
information. So set that up too:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PKG_CONFIG_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib/pkgconfig
</code></pre></div></div>

<h4 id="run-time-linker">Run-time linker</h4>

<p>Finally, when you install libraries in your home directory, the run-time
dynamic linker will need to know where to find them. There are three
ways to deal with this:</p>

<ol>
  <li>The <a href="https://web.archive.org/web/20090312014334/http://blogs.sun.com/rie/entry/tt_ld_library_path_tt">crude, easy way</a>: <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>.</li>
  <li>The elegant, difficult way: ELF runpath.</li>
  <li>Screw it, just statically link the bugger. (Not always possible.)</li>
</ol>

<p>For the crude way, point the run-time linker at your <code class="language-plaintext highlighter-rouge">lib/</code> and you’re
done:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib
</code></pre></div></div>

<p>However, this is like using a shotgun to kill a fly. If you install a
library in your home directory that is also installed on the system,
and then run a system program, it may be linked against <em>your</em> library
rather than the library installed on the system as was originally
intended. This could have detrimental effects.</p>

<p>The precision method is to set the ELF “runpath” value. It’s like a
per-binary <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>. The run-time linker uses this path first
in its search for libraries, and it will only have an effect on that
particular program/library. This also applies to <code class="language-plaintext highlighter-rouge">dlopen()</code>.</p>

<p>Some software will configure the runpath by default in their build
system, but often you need to configure this yourself. The simplest way
is to set the <code class="language-plaintext highlighter-rouge">LD_RUN_PATH</code> environment variable when building software.
Another option is to manually pass <code class="language-plaintext highlighter-rouge">-rpath</code> options to the linker via
<code class="language-plaintext highlighter-rouge">LDFLAGS</code>. It’s used directly like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux
</code></pre></div></div>

<p>Verify with <code class="language-plaintext highlighter-rouge">readelf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]
</code></pre></div></div>

<p>ELF supports a special <code class="language-plaintext highlighter-rouge">$ORIGIN</code> “variable” set to the binary’s
location. This allows the program and associated libraries to be
installed anywhere without changes, so long as they have the same
relative position to each other . (Note the quotes to prevent shell
interpolation.)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux
</code></pre></div></div>

<p>There is one situation where <code class="language-plaintext highlighter-rouge">runpath</code> won’t work: when you want a
system-installed program to find a home directory library with
<code class="language-plaintext highlighter-rouge">dlopen()</code> — e.g. as an extension to that program. You either need to
ensure it uses a relative or absolute path (i.e. the argument to
<code class="language-plaintext highlighter-rouge">dlopen()</code> contains a slash) or you must use <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>.</p>

<p>Personally, I always use the <a href="https://www.jwz.org/doc/worse-is-better.html">Worse is Better</a> <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>
shotgun. Occasionally it’s caused some annoying issues, but the vast
majority of the time it gets the job done with little fuss. This is
just my personal development environment, after all, not a production
server.</p>

<h4 id="manual-pages">Manual pages</h4>

<p>Another potentially tricky issue is man pages. When a program or
library installs a man page in your home directory, it would certainly
be nice to access it with <code class="language-plaintext highlighter-rouge">man &lt;topic&gt;</code> just like it was installed on
the system. Fortunately, Debian and Debian-derived systems, using a
mechanism I haven’t yet figured out, discover home directory man pages
automatically without any assistance. No configuration needed.</p>

<p>It’s more complicated on other systems, such as the BSDs. You’ll need to
set the <code class="language-plaintext highlighter-rouge">MANPATH</code> variable to include <code class="language-plaintext highlighter-rouge">$HOME/.local/share/man</code>. It’s
unset by default and it overrides the system settings, which means you
need to manually include the system paths. The <code class="language-plaintext highlighter-rouge">manpath</code> program can
help with this … if it’s available.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">MANPATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/share/man:<span class="si">$(</span>manpath<span class="si">)</span>
</code></pre></div></div>

<p>I haven’t figured out a portable way to deal with this issue, so I
mostly ignore it.</p>

<h3 id="how-to-install-software-in-home">How to install software in $HOME</h3>

<p>While I’ve <a href="/blog/2017/03/30/">poo-pooed autoconf</a> in the past, the standard
<code class="language-plaintext highlighter-rouge">configure</code> script usually makes it trivial to build and install
software in $HOME. The key ingredient is the <code class="language-plaintext highlighter-rouge">--prefix</code> option:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
</code></pre></div></div>

<p>Most of the time it’s that simple! If you’re linking against your own
libraries and want to use <code class="language-plaintext highlighter-rouge">runpath</code>, it’s a little more complicated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"
</code></pre></div></div>

<p>For <a href="https://cmake.org/">CMake</a>, there’s <code class="language-plaintext highlighter-rouge">CMAKE_INSTALL_PREFIX</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
</code></pre></div></div>

<p>The CMake builds I’ve seen use ELF runpath by default, and no further
configuration may be required to make that work. I’m sure that’s not
always the case, though.</p>

<p>Some software is just a single, static, standalone binary with
<a href="/blog/2016/11/15/">everything baked in</a>. It doesn’t need to be given a prefix, and
installation is as simple as copying the binary into place. For example,
<a href="https://github.com/skeeto/enchive">Enchive</a> works like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin
</code></pre></div></div>

<p>Some software uses its own unique configuration interface. I can respect
that, but it does add some friction for users who now have something
additional and non-transferable to learn. I demonstrated a NetHack build
above, which has a configuration much more involved than it really
should be. Another example is LuaJIT, which uses <code class="language-plaintext highlighter-rouge">make</code> variables that
must be provided consistently on every invocation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install
</code></pre></div></div>

<p>(You <em>can</em> use the “install” target to both build and install, but I
wanted to illustrate the repetition of <code class="language-plaintext highlighter-rouge">PREFIX</code>.)</p>

<p>Some libraries aren’t so smart about <code class="language-plaintext highlighter-rouge">pkg-config</code> and need some
handholding — for example, <a href="https://www.gnu.org/software/ncurses/">ncurses</a>. I mention it because
it’s required for both Vim and Emacs, among many others, so I’m often
building it myself. It ignores <code class="language-plaintext highlighter-rouge">--prefix</code> and needs to be told a
second time where to install things:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH
</code></pre></div></div>

<p>Another issue is that a whole lot of software has been hardcoded for
ncurses 5.x (i.e. <code class="language-plaintext highlighter-rouge">ncurses5-config</code>), and it requires hacks/patching
to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x
for this reason.</p>

<h3 id="learning-through-experience">Learning through experience</h3>

<p>I could go on and on like this, discussing the quirks for the various
libraries and programs that I use. Over the years I’ve gotten used to
many of these issues, committing the solutions to memory.
Unfortunately, even within the same version of a piece of software,
the quirks can change <a href="https://www.debian.org/News/2017/20170617.en.html">between major operating system
releases</a>, so I’m continuously learning my way around new
issues. It’s really given me an appreciation for all the hard work
that package maintainers put into customizing and maintaining software
builds to <a href="https://www.debian.org/doc/manuals/maint-guide/">fit properly into a larger ecosystem</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Asynchronous Requests from Emacs Dynamic Modules</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/02/14/"/>
    <id>urn:uuid:00a59e4f-268c-343f-e6c6-bb23cde265de</id>
    <updated>2017-02-14T02:30:00Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>A few months ago I had a discussion with Vladimir Kazanov about his
<a href="https://github.com/vkazanov/toy-orgfuse">Orgfuse</a> project: a Python script that exposes an Emacs
Org-mode document as a <a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace">FUSE filesystem</a>. It permits other
programs to navigate the structure of an Org-mode document through the
standard filesystem APIs. I suggested that, with the new dynamic
modules in Emacs 25, Emacs <em>itself</em> could serve a FUSE filesystem. In
fact, support for FUSE services in general could be an package of his
own.</p>

<p>So that’s what he did: <a href="https://github.com/vkazanov/elfuse"><strong>Elfuse</strong></a>. It’s an old joke that
Emacs is an operating system, and here it is handling system calls.</p>

<p>However, there’s a tricky problem to solve, an issue also present <a href="/blog/2016/11/05/">my
joystick module</a>. Both modules handle asynchronous events —
filesystem requests or joystick events — but Emacs runs the event loop
and owns the main thread. The external events somehow need to feed
into the main event loop. It’s even more difficult with FUSE because
FUSE <em>also</em> wants control of its own thread for its own event loop.
This requires Elfuse to spawn a dedicated FUSE thread and negotiate a
request/response hand-off.</p>

<p>When a filesystem request or joystick event arrives, how does Emacs
know to handle it? The simple and obvious solution is to poll the
module from a timer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">queue</span> <span class="n">requests</span><span class="p">;</span>

<span class="n">emacs_value</span>
<span class="nf">Frequest_next</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emacs_value</span> <span class="n">next</span> <span class="o">=</span> <span class="n">Qnil</span><span class="p">;</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">queue_length</span><span class="p">(</span><span class="n">requests</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">queue_pop</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_empty</span><span class="p">,</span> <span class="n">request</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">request</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And then ask Emacs to check the module every, say, 10ms:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request--poll</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">next</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">when</span> <span class="nv">next</span>
      <span class="p">(</span><span class="nv">request-handle</span> <span class="nv">next</span><span class="p">))))</span>

<span class="p">(</span><span class="nv">run-at-time</span> <span class="mi">0</span> <span class="mf">0.01</span> <span class="nf">#'</span><span class="nv">request--poll</span><span class="p">)</span>
</code></pre></div></div>

<p>Blocking directly on the module’s event pump with Emacs’ thread would
prevent Emacs from doing important things like, you know, <em>being a
text editor</em>. The timer allows it to handle its own events
uninterrupted. It gets the job done, but it’s far from perfect:</p>

<ol>
  <li>
    <p>It imposes an arbitrary latency to handling requests. Up to the
poll period could pass before a request is handled.</p>
  </li>
  <li>
    <p>Polling the module 100 times per second is inefficient. Unless you
really enjoy recharging your laptop, that’s no good.</p>
  </li>
</ol>

<p>The poll period is a sliding trade-off between latency and battery
life. If only there was some mechanism to, ahem, <em>signal</em> the Emacs
thread, informing it that a request is waiting…</p>

<h3 id="sigusr1">SIGUSR1</h3>

<p>Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals,
which is exactly the mechanism we need. The interface is a “key”
binding on <code class="language-plaintext highlighter-rouge">special-event-map</code>, the keymap that handles these kinds of
events. When the signal arrives, Emacs queues it up for the main event
loop.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[sigusr1]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-handle</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>The module blocks on its own thread on its own event pump. When a
request arrives, it queues the request, rings the bell for Emacs to
come handle it (<code class="language-plaintext highlighter-rouge">raise()</code>), and waits on a semaphore. For illustration
purposes, assume the module reads requests from and writes responses
to a file descriptor, like a socket.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">event_fd</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">request</span> <span class="n">request</span><span class="p">;</span>
<span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Blocking read for request event */</span>
    <span class="n">read</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">));</span>

    <span class="cm">/* Put request on the queue */</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">queue_push</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">);</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">raise</span><span class="p">(</span><span class="n">SIGUSR1</span><span class="p">);</span>  <span class="c1">// TODO: Should raise() go inside the lock?</span>

    <span class="cm">/* Wait for Emacs */</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">))</span>
        <span class="p">;</span>

    <span class="cm">/* Reply with Emacs' response */</span>
    <span class="n">write</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">sem_wait()</code> is in a loop because signals will wake it up
prematurely. In fact, it may even wake up due to its own signal on the
line before. This is the only way this particular use of <code class="language-plaintext highlighter-rouge">sem_wait()</code>
might fail, so there’s no need to check <code class="language-plaintext highlighter-rouge">errno</code>.</p>

<p>If there are multiple module threads making requests to the same
global queue, the lock is necessary to protect the queue. The
semaphore is only for blocking the thread until Emacs has finished
writing its particular response. Each thread has its own semaphore.</p>

<p>When Emacs is done writing the response, it releases the module thread
by incrementing the semaphore. It might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">emacs_value</span>
<span class="nf">Frequest_complete</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">request</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">request</span><span class="p">)</span>
        <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="o">-&gt;</span><span class="n">sem</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">Qnil</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The top-level handler dispatches to the specific request handler,
calling <code class="language-plaintext highlighter-rouge">request-complete</code> above when it’s done.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request-handle</span> <span class="p">(</span><span class="nv">next</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">condition-case</span> <span class="nv">e</span>
      <span class="p">(</span><span class="nv">cl-ecase</span> <span class="p">(</span><span class="nv">request-type</span> <span class="nv">next</span><span class="p">)</span>
        <span class="p">(</span><span class="ss">:open</span>  <span class="p">(</span><span class="nv">request-handle-open</span>  <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:close</span> <span class="p">(</span><span class="nv">request-handle-close</span> <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:read</span>  <span class="p">(</span><span class="nv">request-handle-read</span>  <span class="nv">next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">error</span> <span class="p">(</span><span class="nv">request-respond-as-error</span> <span class="nv">next</span> <span class="nv">e</span><span class="p">)))</span>
  <span class="p">(</span><span class="nv">request-complete</span><span class="p">))</span>
</code></pre></div></div>

<p>This SIGUSR1+semaphore mechanism is roughly how Elfuse currently
processes requests.</p>

<h3 id="making-it-work-on-windows">Making it work on Windows</h3>

<p>Windows doesn’t have signals. This isn’t a problem for Elfuse since
Windows doesn’t have FUSE either. Nor does it matter for Joymacs since
XInput isn’t event-driven and always requires polling. But someday
someone will need this mechanism for a dynamic module on Windows.</p>

<p>Fortunately there’s a solution: <em>input language change</em> events,
<code class="language-plaintext highlighter-rouge">WM_INPUTLANGCHANGE</code>. It’s also on <code class="language-plaintext highlighter-rouge">special-event-map</code>:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[language-change]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-process</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>Instead of <code class="language-plaintext highlighter-rouge">raise()</code> (or <code class="language-plaintext highlighter-rouge">pthread_kill()</code>), broadcast the window event
with <code class="language-plaintext highlighter-rouge">PostMessage()</code>. Outside of invoking the <code class="language-plaintext highlighter-rouge">language-change</code> key
binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t
belong to any particular window. We don’t <em>really</em> want to change the
input language, after all.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PostMessageA</span><span class="p">(</span><span class="n">HWND_BROADCAST</span><span class="p">,</span> <span class="n">WM_INPUTLANGCHANGE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Naturally you’ll also need to replace the POSIX threading primitives
with the Windows versions (<code class="language-plaintext highlighter-rouge">CreateThread()</code>, <code class="language-plaintext highlighter-rouge">CreateSemaphore()</code>,
etc.). With a bit of abstraction in the right places, it should be
pretty easy to support both POSIX and Windows in these asynchronous
dynamic module events.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Manual Control Flow Guard in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/21/"/>
    <id>urn:uuid:f185405a-3e30-3612-7a21-6d4ec450519d</id>
    <updated>2017-01-21T22:44:15Z</updated>
    <category term="c"/><category term="linux"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p>Recent versions of Windows have a new exploit mitigation feature
called <a href="http://sjc1-te-ftp.trendmicro.com/assets/wp/exploring-control-flow-guard-in-windows10.pdf"><em>Control Flow Guard</em></a> (CFG). Before an indirect function
call — e.g. function pointers and virtual functions — the target
address checked against a table of valid call addresses. If the
address isn’t the entry point of a known function, then the program is
aborted.</p>

<p>If an application has a buffer overflow vulnerability, an attacker may
use it to overwrite a function pointer and, by the call through that
pointer, control the execution flow of the program. This is one way to
initiate a <a href="https://skeeto.s3.amazonaws.com/share/p15-coffman.pdf"><em>Return Oriented Programming</em></a> (ROP) attack, where
the attacker constructs <a href="https://github.com/JonathanSalwan/ROPgadget">a chain of <em>gadget</em> addresses</a> — a
gadget being a couple of instructions followed by a return
instruction, all in the original program — using the indirect call as
the starting point. The execution then flows from gadget to gadget so
that the program does what the attacker wants it to do, all without
the attacker supplying any code.</p>

<p>The two most widely practiced ROP attack mitigation techniques today
are <em>Address Space Layout Randomization</em> (ASLR) and <em>stack
protectors</em>. The former randomizes the base address of executable
images (programs, shared libraries) so that process memory layout is
unpredictable to the attacker. The addresses in the ROP attack chain
depend on the run-time memory layout, so the attacker must also find
and exploit an <a href="https://github.com/torvalds/linux/blob/4c9eff7af69c61749b9eb09141f18f35edbf2210/Documentation/sysctl/kernel.txt#L373">information leak</a> to bypass ASLR.</p>

<p>For stack protectors, the compiler allocates a <em>canary</em> on the stack
above other stack allocations and sets the canary to a per-thread
random value. If a buffer overflows to overwrite the function return
pointer, the canary value will also be overwritten. Before the
function returns by the return pointer, it checks the canary. If the
canary doesn’t match the known value, the program is aborted.</p>

<p><img src="/img/cfg/canary.svg" alt="" /></p>

<p>CFG works similarly — performing a check prior to passing control to
the address in a pointer — except that instead of checking a canary,
it checks the target address itself. This is a lot more sophisticated,
and, unlike a stack canary, essentially requires coordination by the
platform. The check must be informed on all valid call targets,
whether from the main program or from shared libraries.</p>

<p>While not (yet?) widely deployed, a worthy mention is <a href="http://clang.llvm.org/docs/SafeStack.html">Clang’s
SafeStack</a>. Each thread gets <em>two</em> stacks: a “safe stack” for
return pointers and other safely-accessed values, and an “unsafe
stack” for buffers and such. Buffer overflows will corrupt other
buffers but will not overwrite return pointers, limiting the effect of
their damage.</p>

<h3 id="an-exploit-example">An exploit example</h3>

<p>Consider this trivial C program, <code class="language-plaintext highlighter-rouge">demo.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It reads a name into a buffer and prints it back out with a greeting.
While trivial, it’s far from innocent. That naive call to <code class="language-plaintext highlighter-rouge">gets()</code>
doesn’t check the bounds of the buffer, introducing an exploitable
buffer overflow. It’s so obvious that both the compiler and linker
will yell about it.</p>

<p>For simplicity, suppose the program also contains a dangerous
function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">self_destruct</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"**** GO BOOM! ****"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The attacker can use the buffer overflow to call this dangerous
function.</p>

<p>To make this attack simpler for the sake of the article, assume the
program isn’t using ASLR (e.g. without <code class="language-plaintext highlighter-rouge">-fpie</code>/<code class="language-plaintext highlighter-rouge">-pie</code>, or with
<code class="language-plaintext highlighter-rouge">-fno-pie</code>/<code class="language-plaintext highlighter-rouge">-no-pie</code>). For this particular example, I’ll also
explicitly disable buffer overflow protections (e.g. <code class="language-plaintext highlighter-rouge">_FORTIFY_SOURCE</code>
and stack protectors).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
      -o demo demo.c
</code></pre></div></div>

<p>First, find the address of <code class="language-plaintext highlighter-rouge">self_destruct()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct
</code></pre></div></div>

<p>This is on x86-64, so it’s a 64-bit address. The size of the <code class="language-plaintext highlighter-rouge">name</code>
buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes
allocated above, so there’s 16 bytes to fill, then 8 bytes to
overwrite the return pointer with the address of <code class="language-plaintext highlighter-rouge">self_destruct</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' &gt; boom
$ ./demo &lt; boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault
</code></pre></div></div>

<p>With this input I’ve successfully exploited the buffer overflow to
divert control to <code class="language-plaintext highlighter-rouge">self_destruct()</code>. When <code class="language-plaintext highlighter-rouge">main</code> tries to return into
libc, it instead jumps to the dangerous function, and then crashes
when that function tries to return — though, presumably, the system
would have self-destructed already. Turning on the stack protector
stops this exploit.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
      -o demo demo.c
$ ./demo &lt; boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...
</code></pre></div></div>

<p>The stack protector successfully blocks the exploit. To get around
this, I’d have to either guess the canary value or discover an
information leak that reveals it.</p>

<p>The stack protector transformed the program into something that looks
like the following:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">__canary</span> <span class="o">=</span> <span class="n">__get_thread_canary</span><span class="p">();</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">__canary</span> <span class="o">!=</span> <span class="n">__get_thread_canary</span><span class="p">())</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, it’s not actually possible to implement the stack protector
within C. Buffer overflows are undefined behavior, and a canary is
only affected by a buffer overflow, allowing the compiler to optimize
it away.</p>

<h3 id="function-pointers-and-virtual-functions">Function pointers and virtual functions</h3>

<p>After the attacker successfully self-destructed the last computer,
upper management has mandated password checks before all
self-destruction procedures. Here’s what it looks like now:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">self_destruct</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">password</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="s">"12345"</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">puts</span><span class="p">(</span><span class="s">"**** GO BOOM! ****"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The password is hardcoded, and it’s the kind of thing an idiot would
have on his luggage, but assume it’s actually unknown to the attacker.
Especially since, as I’ll show shortly, it won’t matter. Upper
management has also mandated stack protectors, so assume that’s
enabled from here on.</p>

<p>Additionally, the program has evolved a bit, and now <a href="/blog/2014/10/21/">uses a function
pointer for polymorphism</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">greeter</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">greet</span><span class="p">)(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>

<span class="kt">void</span>
<span class="nf">greet_hello</span><span class="p">(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">greet_aloha</span><span class="p">(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Aloha, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s now a greeter object and the function pointer makes its
behavior polymorphic. Think of it as a hand-coded virtual function for
C. Here’s the new (contrived) <code class="language-plaintext highlighter-rouge">main</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">greeter</span> <span class="n">greeter</span> <span class="o">=</span> <span class="p">{.</span><span class="n">greet</span> <span class="o">=</span> <span class="n">greet_hello</span><span class="p">};</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">greeter</span><span class="p">.</span><span class="n">name</span><span class="p">);</span>
    <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">(</span><span class="o">&amp;</span><span class="n">greeter</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(In a real program, something else provides <code class="language-plaintext highlighter-rouge">greeter</code> and picks its
own function pointer for <code class="language-plaintext highlighter-rouge">greet</code>.)</p>

<p>Rather than overwriting the return pointer, the attacker has the
opportunity to overwrite the function pointer on the struct. Let’s
reconstruct the exploit like before.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct
</code></pre></div></div>

<p>We don’t know the password, but we <em>do</em> know (from peeking at the
disassembly) that the password check is 16 bytes. The attack should
instead jump 16 bytes into the function, skipping over the check
(0x4006a5 + 16 = 0x4006b5).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' &gt; boom
$ ./demo &lt; boom
**** GO BOOM! ****
</code></pre></div></div>

<p>Neither the stack protector nor the password were of any help. The
stack protector only protects the <em>return</em> pointer, not the function
pointer on the struct.</p>

<p><strong>This is where the Control Flow Guard comes into play.</strong> With CFG
enabled, the compiler inserts a check before calling the <code class="language-plaintext highlighter-rouge">greet()</code>
function pointer. It must point to the beginning of a known function,
otherwise it will abort just like the stack protector. Since the
middle of <code class="language-plaintext highlighter-rouge">self_destruct()</code> isn’t the <em>beginning</em> of a function, it
would abort if this exploit is attempted.</p>

<p>However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll
implement it myself, with manual checks.</p>

<h3 id="function-address-bitmap">Function address bitmap</h3>

<p>As described in the PDF linked at the top of this article, CFG on
Windows is implemented using a bitmap. Each bit in the bitmap
represents 8 bytes of memory. If those 8 bytes contains the beginning
of a function, the bit will be set to one. Checking a pointer means
checking its associated bit in the bitmap.</p>

<p>For my CFG, I’ve decided to keep the same 8-byte resolution: the
bottom three bits of the target address will be dropped. The next 24
bits will be used to index into the bitmap. All other bits in the
pointer will be ignored. A 24-bit bit index means the bitmap will only
be 2MB.</p>

<p>These 24 bits is perfectly sufficient for 32-bit systems, but it means
on 64-bit systems there may be false positives: some addresses will
not represent the start of a function, but will have their bit set
to 1. This is acceptable, especially because only functions known to
be targets of indirect calls will be registered in the table, reducing
the false positive rate.</p>

<p>Note: Relying on <a href="/blog/2016/05/30/">the bits of a pointer cast to an integer is
unspecified</a> and isn’t portable, but this implementation will
work fine anywhere I would care to use it.</p>

<p>Here are the CFG parameters. I’ve made them macros so that they can
easily be tuned at compile-time. The <code class="language-plaintext highlighter-rouge">cfg_bits</code> is the integer type
backing the bitmap array. The <code class="language-plaintext highlighter-rouge">CFG_RESOLUTION</code> is the number of bits
dropped, so “3” is a granularity of 8 bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">cfg_bits</span><span class="p">;</span>
<span class="cp">#define CFG_RESOLUTION  3
#define CFG_BITS        24
</span></code></pre></div></div>

<p>Given a function pointer <code class="language-plaintext highlighter-rouge">f</code>, this macro extracts the bitmap index.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define CFG_INDEX(f) \
    (((uintptr_t)f &gt;&gt; CFG_RESOLUTION) &amp; ((1UL &lt;&lt; CFG_BITS) - 1))
</span></code></pre></div></div>

<p>The CFG bitmap is just an array of integers. Zero it to initialize.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cfg</span> <span class="p">{</span>
    <span class="n">cfg_bits</span> <span class="n">bitmap</span><span class="p">[(</span><span class="mi">1UL</span> <span class="o">&lt;&lt;</span> <span class="n">CFG_BITS</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">)];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Functions are manually registered in the bitmap using
<code class="language-plaintext highlighter-rouge">cfg_register()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">cfg_register</span><span class="p">(</span><span class="k">struct</span> <span class="n">cfg</span> <span class="o">*</span><span class="n">cfg</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">CFG_INDEX</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">z</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">;</span>
    <span class="n">cfg</span><span class="o">-&gt;</span><span class="n">bitmap</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="n">z</span><span class="p">]</span> <span class="o">|=</span> <span class="mi">1UL</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="n">z</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because functions are registered at run-time, it’s fully compatible
with ASLR. If ASLR is enabled, the bitmap will be a little different
each run. On the same note, it may be worth XORing each bitmap element
with a random, run-time value — along the same lines as the stack
canary value — to make it harder for an attacker to manipulate the
bitmap should he get the ability to overwrite it by a vulnerability.
Alternatively the bitmap could be switched to read-only (e.g.
<code class="language-plaintext highlighter-rouge">mprotect()</code>) once everything is registered.</p>

<p>And finally, the check function, used immediately before indirect
calls. It ensures <code class="language-plaintext highlighter-rouge">f</code> was previously passed to <code class="language-plaintext highlighter-rouge">cfg_register()</code>
(except for false positives, as discussed). Since it will be invoked
often, it needs to be fast and simple.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">cfg_check</span><span class="p">(</span><span class="k">struct</span> <span class="n">cfg</span> <span class="o">*</span><span class="n">cfg</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">CFG_INDEX</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">z</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">((</span><span class="n">cfg</span><span class="o">-&gt;</span><span class="n">bitmap</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="n">z</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="n">z</span><span class="p">))</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">))</span>
        <span class="n">abort</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it! Now augment <code class="language-plaintext highlighter-rouge">main</code> to make use of it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cfg</span> <span class="n">cfg</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">self_destruct</span><span class="p">);</span>  <span class="c1">// to prove this works</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greet_hello</span><span class="p">);</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greet_aloha</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">greeter</span> <span class="n">greeter</span> <span class="o">=</span> <span class="p">{.</span><span class="n">greet</span> <span class="o">=</span> <span class="n">greet_hello</span><span class="p">};</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">greeter</span><span class="p">.</span><span class="n">name</span><span class="p">);</span>
    <span class="n">cfg_check</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">);</span>
    <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">(</span><span class="o">&amp;</span><span class="n">greeter</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And now attempting the exploit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./demo &lt; boom
Aborted
</code></pre></div></div>

<p>Normally <code class="language-plaintext highlighter-rouge">self_destruct()</code> wouldn’t be registered since it’s not a
legitimate target of an indirect call, but the exploit <em>still</em> didn’t
work because it called into the middle of <code class="language-plaintext highlighter-rouge">self_destruct()</code>, which
isn’t a valid address in the bitmap. The check aborts the program
before it can be exploited.</p>

<p>In a real application I would have a <a href="/blog/2016/12/23/">global <code class="language-plaintext highlighter-rouge">cfg</code> bitmap</a> for
the whole program, and define <code class="language-plaintext highlighter-rouge">cfg_check()</code> in a header as an <code class="language-plaintext highlighter-rouge">inline</code>
function.</p>

<p>Despite being possible implement in straight C without the help of the
toolchain, it would be far less cumbersome and error-prone to let the
compiler and platform handle Control Flow Guard. That’s the right
place to implement it.</p>

<p><em>Update</em>: Ted Unangst pointed out <a href="http://www.tedunangst.com/inks/l/849">OpenBSD performing a similar
check</a> in its mbuf library. Instead of a bitmap, the function
pointer is replaced with an index into an array of registered function
pointers. That approach is cleaner, more efficient, completely
portable, and has no false positives.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>C Closures as a Library</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/08/"/>
    <id>urn:uuid:a5f897bc-0510-3164-a949-fcb848d9279b</id>
    <updated>2017-01-08T22:45:38Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>A common idiom is C is the callback function pointer, either to
deliver information (i.e. a <em>visitor</em> or <em>handler</em>) or to customize
the function’s behavior (e.g. a comparator). Examples of the latter in
the C standard library are <code class="language-plaintext highlighter-rouge">qsort()</code> and <code class="language-plaintext highlighter-rouge">bsearch()</code>, each requiring a
comparator function in order to operate on arbitrary types.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">bsearch</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span>
              <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
              <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>A problem with these functions is that there’s no way to pass context
to the callback. The callback may need information beyond the two
element pointers when making its decision, or to <a href="/blog/2016/09/05/">update a
result</a>. For example, suppose I have a structure representing a
two-dimensional coordinate, and a coordinate distance function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">y</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">float</span>
<span class="nf">distance</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dx</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">x</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">y</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">y</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">dx</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">+</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I have an array of coordinates and I want to sort them based on
their distance from some target, the comparator needs to know the
target. However, the <code class="language-plaintext highlighter-rouge">qsort()</code> interface has no way to directly pass
this information. Instead it has to be passed by another means, such
as a global variable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">coord_cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">size_t</span> <span class="n">ncoords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">coords</span> <span class="o">*</span><span class="n">coords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">current_target</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="c1">// ...</span>
    <span class="n">target</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span>
    <span class="nf">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp</span><span class="p">);</span>
</code></pre></div></div>

<p>Potential problems are that it’s neither thread-safe nor re-entrant.
Two different threads cannot use this comparator <a href="/blog/2014/10/12/">at the same
time</a>. Also, on some platforms and configurations, repeatedly
accessing a global variable in a comparator <a href="/blog/2016/12/23/">may have a significant
cost</a>. A common workaround for thread safety is to make the
global variable thread-local by allocating it in thread-local storage
(TLS):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">_Thread_local</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>       <span class="c1">// C11</span>
<span class="kr">__thread</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>            <span class="c1">// GCC and Clang</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="kr">thread</span><span class="p">)</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>  <span class="c1">// Visual Studio</span>
</code></pre></div></div>

<p>This makes the comparator thread-safe. However, it’s still not
re-entrant (usually unimportant) and accessing thread-local variables
on some platforms is even more expensive — which is the situation for
Pthreads TLS, though not a problem for native x86-64 TLS.</p>

<p>Modern libraries usually provide some sort of “user data” pointer — a
generic pointer that is passed to the callback function as an
additional argument. For example, the GNU C Library has long had
<code class="language-plaintext highlighter-rouge">qsort_r()</code>: <em>re-entrant</em> qsort.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort_r</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">),</span>
           <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>The new comparator looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">coord_cmp_r</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">;</span>
    <span class="n">qsort_r</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp_r</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>User data arguments are thread-safe, re-entrant, performant, and
perfectly portable. They completely and cleanly solve the entire
problem with virtually no drawbacks. If every library did this, there
would be nothing left to discuss and this article would be boring.</p>

<h3 id="the-closure-solution">The closure solution</h3>

<p>In order to make things more interesting, suppose you’re stuck calling
a function in some old library that takes a callback but doesn’t
support a user data argument. A global variable is insufficient, and
the thread-local storage solution isn’t viable for one reason or
another. What do you do?</p>

<p>The core problem is that a function pointer is just an address, and
it’s the same address no matter the context for any particular
callback. On any particular call, the callback has three ways to
distinguish this call from other calls. These align with the three
solutions above:</p>

<ol>
  <li>Inspect some global state: the <strong>global variable solution</strong>. The
caller will change this state for some other calls.</li>
  <li>Query its unique thread ID: the <strong>thread-local storage solution</strong>.
Calls on different threads will have different thread IDs.</li>
  <li>Examine a context argument: the <strong>user pointer solution</strong>.</li>
</ol>

<p>A wholly different approach is to <strong>use a unique function pointer for
each callback</strong>. The callback could then inspect its own address to
differentiate itself from other callbacks. Imagine defining multiple
instances of <code class="language-plaintext highlighter-rouge">coord_cmp</code> each getting their context from a different
global variable. Using a unique copy of <code class="language-plaintext highlighter-rouge">coord_cmp</code> on each thread for
each usage would be both re-entrant and thread-safe, and wouldn’t
require TLS.</p>

<p>Taking this idea further, I’d like to <strong>generate these new functions
on demand at run time</strong> akin to a JIT compiler. This can be done as a
library, mostly agnostic to the implementation of the callback. Here’s
an example of what its usage will be like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The callback to be converted into a closure is <code class="language-plaintext highlighter-rouge">f</code> and the number of
arguments it takes is <code class="language-plaintext highlighter-rouge">nargs</code>. A new closure is allocated and returned
as a function pointer. This closure takes <code class="language-plaintext highlighter-rouge">nargs - 1</code> arguments, and
it will call the original callback with the additional argument
<code class="language-plaintext highlighter-rouge">userdata</code>.</p>

<p>So, for example, this code uses a closure to convert <code class="language-plaintext highlighter-rouge">coord_cmp_r</code>
into a function suitable for <code class="language-plaintext highlighter-rouge">qsort()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">closure</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="n">closure</span> <span class="o">=</span> <span class="n">closure_create</span><span class="p">(</span><span class="n">coord_cmp_r</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">);</span>

<span class="n">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">closure</span><span class="p">);</span>

<span class="n">closure_destroy</span><span class="p">(</span><span class="n">closure</span><span class="p">);</span>
</code></pre></div></div>

<p><strong>Caveat</strong>: This API is <em>utterly insufficient</em> for any sort of
portability. The number of arguments isn’t nearly enough information
for the library to generate a closure. For practically every
architecture and ABI, it’s going to depend on the types of each of
those arguments. On x86-64 with the System V ABI — where I’ll be
implementing this — this argument will only count integer/pointer
arguments. To find out what it takes to do this properly, see the
<a href="https://www.gnu.org/software/libjit/">libjit</a> documentation.</p>

<h3 id="memory-design">Memory design</h3>

<p>This implementation will be for x86-64 Linux, though the high level
details will be the same for any program running in virtual memory. My
closures will span exactly two consecutive pages (typically 8kB),
though it’s possible to use exactly one page depending on the desired
trade-offs. The reason I need two pages are because each page will
have different protections.</p>

<p><img src="/img/diagram/closure-pages.svg" alt="" /></p>

<p>Native code — the <em>thunk</em> — lives in the upper page. The user data
pointer and callback function pointer lives at the high end of the
lower page. The two pointers could really be anywhere in the lower
page, and they’re only at the end for aesthetic reasons. The thunk
code will be identical for all closures of the same number of
arguments.</p>

<p>The upper page will be executable and the lower page will be writable.
This allows new pointers to be set without writing to executable thunk
memory. In the future I expect operating systems to enforce W^X
(“write xor execute”), and this code will already be compliant.
Alternatively, the pointers could be “baked in” with the thunk page
and immutable, but since creating closure requires two system calls, I
figure it’s better that the pointers be mutable and the closure object
reusable.</p>

<p>The address for the closure itself will be the upper page, being what
other functions will call. The thunk will load the user data pointer
from the lower page as an additional argument, then jump to the
actual callback function also given by the lower page.</p>

<h3 id="thunk-assembly">Thunk assembly</h3>

<p>The x86-64 thunk assembly for a 2-argument closure calling a
3-argument callback looks like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">user:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="nl">func:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">user</span><span class="p">]</span>
        <span class="nf">jmp</span>  <span class="p">[</span><span class="nv">rel</span> <span class="nv">func</span><span class="p">]</span>
</code></pre></div></div>

<p>As a reminder, the integer/pointer argument register order for the
System V ABI calling convention is: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">r8</code>,
<code class="language-plaintext highlighter-rouge">r9</code>. The third argument is passed through <code class="language-plaintext highlighter-rouge">rdx</code>, so the user pointer
is loaded into this register. Then it jumps to the callback address
with the original arguments still in place, plus the new argument. The
<code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">func</code> values are loaded <em>RIP-relative</em> (<code class="language-plaintext highlighter-rouge">rel</code>) to the
address of the code. The thunk is using the callback address (its own
address) to determine the context.</p>

<p>The assembled machine code for the thunk is just 13 bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk2</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1">// mov  rdx, [rel user]</span>
    <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
    <span class="c1">// jmp  [rel func]</span>
    <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
<span class="p">}</span>
</code></pre></div></div>

<p>All <code class="language-plaintext highlighter-rouge">closure_create()</code> has to do is allocate two pages, copy this
buffer into the upper page, adjust the protections, and return the
address of the thunk. Since <code class="language-plaintext highlighter-rouge">closure_create()</code> will work for <code class="language-plaintext highlighter-rouge">nargs</code>
number of arguments, there will actually be 6 slightly different
thunks, one for each of the possible register arguments (<code class="language-plaintext highlighter-rouge">rdi</code> through
<code class="language-plaintext highlighter-rouge">r9</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk</span><span class="p">[</span><span class="mi">6</span><span class="p">][</span><span class="mi">13</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x3d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x05</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Given a closure pointer returned from <code class="language-plaintext highlighter-rouge">closure_create()</code>, here are the
setter functions for setting the closure’s two pointers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_set_data</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">closure_set_function</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">closure_create()</code>, allocation is done with an anonymous <code class="language-plaintext highlighter-rouge">mmap()</code>,
just like in <a href="/blog/2015/03/19/">my JIT compiler</a>. It’s initially mapped writable in
order to copy the thunk, then the thunk page is set to executable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="kt">void</span> <span class="o">*</span><span class="n">closure</span> <span class="o">=</span> <span class="n">p</span> <span class="o">+</span> <span class="n">page_size</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">thunk</span><span class="p">[</span><span class="n">nargs</span> <span class="o">-</span> <span class="mi">1</span><span class="p">],</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">thunk</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>

    <span class="n">closure_set_function</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="n">closure_set_data</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">userdata</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">closure</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Destroying a closure is done by computing the lower page address and
calling <code class="language-plaintext highlighter-rouge">munmap()</code> on it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="n">munmap</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">closure</span> <span class="o">-</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it! You can see the entire demo here:</p>

<ul>
  <li><a href="/download/closure-demo.c" class="download">closure-demo.c</a></li>
</ul>

<p>It’s a lot simpler for x86-64 than it is for x86, where there’s no
RIP-relative addressing and arguments are passed on the stack. The
arguments must all be copied back onto the stack, above the new
argument, and it cannot be a tail call since the stack has to be fixed
before returning. Here’s what the thunk looks like for a 2-argument
closure:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">data:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="nl">func:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">call</span> <span class="nv">.rip2eax</span>
<span class="nl">.rip2eax:</span>
        <span class="nf">pop</span> <span class="nb">eax</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">13</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">call</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">9</span><span class="p">]</span>
        <span class="nf">add</span> <span class="nb">esp</span><span class="p">,</span> <span class="mi">12</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Exercise for the reader: Port the closure demo to a different
architecture or to the the Windows x64 ABI.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Relocatable Global Data on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/23/"/>
    <id>urn:uuid:56be19e0-ce9a-3f37-dc85-578f397ed3e1</id>
    <updated>2016-12-23T22:50:51Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Relocatable code — program code that executes correctly from any
properly-aligned address — is an essential feature for shared
libraries. Otherwise all of a system’s shared libraries would need to
coordinate their virtual load addresses. Loading programs and
libraries to random addresses is also a valuable security feature:
Address Space Layout Randomization (ASLR). But how does a compiler
generate code for a function that accesses a global variable if that
variable’s address isn’t known at compile time?</p>

<p>Consider this simple C code sample.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function needs the base address of <code class="language-plaintext highlighter-rouge">values</code> in order to
dereference it for <code class="language-plaintext highlighter-rouge">values[x]</code>. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -Os -fPIC get_value.c
</code></pre></div></div>

<p>I optimized for size (<code class="language-plaintext highlighter-rouge">-Os</code>) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with <code class="language-plaintext highlighter-rouge">objdump</code>. Alternatively I
could have asked for the compiler’s assembly output with <code class="language-plaintext highlighter-rouge">-S</code>, but
this will be good reverse engineering practice.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -d -Mintel get_value.o
0000000000000000 &lt;get_value&gt;:
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 &lt;get_value+0x16&gt;
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret
</code></pre></div></div>

<p>There are a couple of interesting things going on, but let’s start
from the beginning.</p>

<ol>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">The ABI</a> specifies that the first integer/pointer argument
(the 32-bit integer <code class="language-plaintext highlighter-rouge">x</code>) is passed through the <code class="language-plaintext highlighter-rouge">edi</code> register. The
function compares <code class="language-plaintext highlighter-rouge">x</code> to 3, to satisfy <code class="language-plaintext highlighter-rouge">x &lt; 4</code>.</p>
  </li>
  <li>
    <p>The ABI specifies that floating point values are returned through
the <a href="/blog/2015/07/10/">SSE2 SIMD register</a> <code class="language-plaintext highlighter-rouge">xmm0</code>. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of <code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>It then uses the result of the previous comparison to perform a
jump, <code class="language-plaintext highlighter-rouge">ja</code> (“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to <code class="language-plaintext highlighter-rouge">cmp</code>
(<code class="language-plaintext highlighter-rouge">edi</code>) comes after the first operand (<code class="language-plaintext highlighter-rouge">0x3</code>) as <em>unsigned</em> values.
Its cousin, <code class="language-plaintext highlighter-rouge">jg</code> (“jump if greater”), is for signed values. If <code class="language-plaintext highlighter-rouge">x</code>
is outside the array bounds, it jumps straight to <code class="language-plaintext highlighter-rouge">ret</code>, returning
<code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">x</code> was in bounds, it uses a <code class="language-plaintext highlighter-rouge">lea</code> (“load effective address”) to
load <em>something</em> into the 64-bit <code class="language-plaintext highlighter-rouge">rax</code> register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into <code class="language-plaintext highlighter-rouge">rax</code> is the address of the <code class="language-plaintext highlighter-rouge">values</code> array. More on
this in a moment.</p>
  </li>
  <li>
    <p>Finally it uses <code class="language-plaintext highlighter-rouge">x</code> as an index into address in <code class="language-plaintext highlighter-rouge">rax</code>. The <code class="language-plaintext highlighter-rouge">movss</code>
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of <code class="language-plaintext highlighter-rouge">xmm0</code>, where the caller expects to find the
return value. This is all preceded by a <code class="language-plaintext highlighter-rouge">mov edi, edi</code> which
<a href="/blog/2016/03/31/"><em>looks</em> like a hotpatch nop</a>, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses <code class="language-plaintext highlighter-rouge">rdi</code> not <code class="language-plaintext highlighter-rouge">edi</code>.
All 32-bit register assignments clear the upper 32 bits, and so
this <code class="language-plaintext highlighter-rouge">mov</code> zero-extends <code class="language-plaintext highlighter-rouge">edi</code> into <code class="language-plaintext highlighter-rouge">rdi</code>. This is in case of the
unlikely event that the caller left garbage in those upper bits.</p>
  </li>
</ol>

<h3 id="clearing-xmm0">Clearing <code class="language-plaintext highlighter-rouge">xmm0</code></h3>

<p>The first interesting part: <code class="language-plaintext highlighter-rouge">xmm0</code> is cleared even when its first lane
is loaded with a value. There are two reasons to do this.</p>

<p>The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra <code class="language-plaintext highlighter-rouge">ret</code> or an conditional <code class="language-plaintext highlighter-rouge">jmp</code> over the “else” branch.</p>

<p>The less obvious reason is that it breaks a <em>data dependency</em>. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called <a href="https://en.wikipedia.org/wiki/Register_renaming">register renaming</a>. <em>Architectural
registers</em> (<code class="language-plaintext highlighter-rouge">rax</code>, <code class="language-plaintext highlighter-rouge">edi</code>, etc.) are just temporary names for
underlying <em>physical registers</em>. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.</p>

<p>For example, take this assembly sample. It assembles to 9 bytes of
machine code.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>This reads a 32-bit value from the address stored in <code class="language-plaintext highlighter-rouge">rcx</code>, then
assigns <code class="language-plaintext highlighter-rouge">ecx</code> and uses <code class="language-plaintext highlighter-rouge">cl</code> (the lowest byte of <code class="language-plaintext highlighter-rouge">rcx</code>) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of <code class="language-plaintext highlighter-rouge">rcx</code>, wiping the unused parts of
register.</p>

<p>So after the second instruction, it’s guaranteed that the value in
<code class="language-plaintext highlighter-rouge">rcx</code> has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, <em>before</em> the load. Ingenious!</p>

<p>Compare it to this example, where the second instruction assigns to
<code class="language-plaintext highlighter-rouge">cl</code> instead of <code class="language-plaintext highlighter-rouge">ecx</code>. This assembles to just 6 bytes.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">cl</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of <code class="language-plaintext highlighter-rouge">rcx</code> still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.</p>

<p>By clearing <code class="language-plaintext highlighter-rouge">xmm0</code>, instructions in <code class="language-plaintext highlighter-rouge">get_value</code> involving <code class="language-plaintext highlighter-rouge">xmm0</code> have
the opportunity to be executed prior to instructions in the callee
that use <code class="language-plaintext highlighter-rouge">xmm0</code>.</p>

<h3 id="rip-relative-addressing">RIP-relative addressing</h3>

<p>Going back to the instruction that computes the address of <code class="language-plaintext highlighter-rouge">values</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
</code></pre></div></div>

<p>Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
<em>RIP-relative addressing</em> that’s still the case, but the register with
the absolute address is <code class="language-plaintext highlighter-rouge">rip</code>, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.</p>

<p>That means this instruction copies the instruction pointer (pointing
to the next instruction) into <code class="language-plaintext highlighter-rouge">rax</code>, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you <em>want</em> a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a <em>relocation entry</em> to the object file so that the linker knows how
to do this.</p>

<p>On platforms that <a href="/blog/2016/11/17/">use ELF</a> we can inspect relocations this with
<code class="language-plaintext highlighter-rouge">readelf</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
</code></pre></div></div>

<p>The relocation type is <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code>. In the <a href="http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_amd64.pdf">AMD64 Architecture
Processor Supplement</a>, this is defined as “S + A - P”.</p>

<ul>
  <li>
    <p>S: Represents the value of the symbol whose index resides in the
relocation entry.</p>
  </li>
  <li>
    <p>A: Represents the addend used to compute the value of the
relocatable field.</p>
  </li>
  <li>
    <p>P: Represents the place of the storage unit being relocated.</p>
  </li>
</ul>

<p>The symbol, S, is <code class="language-plaintext highlighter-rouge">.rodata</code> — the final address for this object file’s
portion of <code class="language-plaintext highlighter-rouge">.rodata</code> (where <code class="language-plaintext highlighter-rouge">values</code> resides). The addend, A, is <code class="language-plaintext highlighter-rouge">-4</code>
since the instruction pointer points at the <em>next</em> instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the <code class="language-plaintext highlighter-rouge">lea</code> instruction. These values are all known at
link-time, so no run-time support is necessary.</p>

<p>Being “S - P” (overall), this will be the displacement between these
two addresses: the 32-bit value is relative. It’s relocatable so long
as these two parts of the binary (code and data) maintain a fixed
distance from each other. The binary is relocated as a whole, so this
assumption holds.</p>

<h3 id="32-bit-relocation">32-bit relocation</h3>

<p>Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the <code class="language-plaintext highlighter-rouge">-m32</code> flag for a 32-bit target, and <code class="language-plaintext highlighter-rouge">-fomit-frame-pointer</code> to
make it simpler for explanatory purposes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 &lt;get_value&gt;:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 &lt;get_value+0x7&gt;
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f &lt;get_value+0x1f&gt;
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 &lt;__x86.get_pc_thunk.cx&gt;:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret
</code></pre></div></div>

<p>Hmm, this one includes an extra function.</p>

<ol>
  <li>
    <p>In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, <code class="language-plaintext highlighter-rouge">x</code>, into <code class="language-plaintext highlighter-rouge">eax</code>.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">fldz</code> instruction clears the x87 floating pointer return
register, just like clearing <code class="language-plaintext highlighter-rouge">xmm0</code> in the x86-64 version.</p>
  </li>
  <li>
    <p>Next it calls <code class="language-plaintext highlighter-rouge">__x86.get_pc_thunk.cx</code>. The call pushes the
instruction pointer, <code class="language-plaintext highlighter-rouge">eip</code>, onto the stack. This function reads
that value off the stack into <code class="language-plaintext highlighter-rouge">ecx</code> and returns. In other words,
calling this function copies <code class="language-plaintext highlighter-rouge">eip</code> into <code class="language-plaintext highlighter-rouge">ecx</code>. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.</p>
  </li>
  <li>
    <p>Next a 32-bit displacement is added to <code class="language-plaintext highlighter-rouge">ecx</code>. In this case it’s
<code class="language-plaintext highlighter-rouge">2</code>, but, like before, this is actually going be filled in later by
the linker.</p>
  </li>
  <li>
    <p>Then it’s just like before: a branch to optionally load a value.
The floating pointer load (<code class="language-plaintext highlighter-rouge">fld</code>) is another relocation.</p>
  </li>
</ol>

<p>Let’s look at the relocations. There are three this time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata
</code></pre></div></div>

<p>The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is <code class="language-plaintext highlighter-rouge">R_386_PC32</code>: a code-relative
displacement just like on x86-64.</p>

<p>The next is of type <code class="language-plaintext highlighter-rouge">R_386_GOTPC</code> and sets the second operand in that
<code class="language-plaintext highlighter-rouge">add ecx</code>. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since <code class="language-plaintext highlighter-rouge">values</code> is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.</p>

<p>The final relocation is of type <code class="language-plaintext highlighter-rouge">R_386_GOTOFF</code>. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, <code class="language-plaintext highlighter-rouge">fld</code>. Ultimately the load adds these last
two relocations together, canceling the GOT:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
</code></pre></div></div>

<p>So the GOT isn’t relevant in this case. It’s just a mechanism for
constructing a custom relocation type.</p>

<h3 id="branch-optimization">Branch optimization</h3>

<p>Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will <code class="language-plaintext highlighter-rouge">x</code> be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.</p>

<p>The typical way to provide such a compiler hint is with a pair of
macros, <code class="language-plaintext highlighter-rouge">likely()</code> and <code class="language-plaintext highlighter-rouge">unlikely()</code>. With GCC and Clang, these would
be defined to use <code class="language-plaintext highlighter-rouge">__builtin_expect</code>. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)
</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">unlikely</span><span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this makes no difference even in the latest version of
GCC. In Clang it changes branch fall-through (for <a href="http://www.agner.org/optimize/microarchitecture.pdf">static branch
prediction</a>), but still always calls the thunk. It seems
compilers <a href="https://ewontfix.com/18/">have difficulty</a> with <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232">optimizing relocatable
code</a> on x86.</p>

<h3 id="x86-64-isnt-just-about-more-memory">x86-64 isn’t just about more memory</h3>

<p>It’s commonly understood that the advantage of 64-bit versus 32-bit
systems is processes having access to more than 4GB of memory. But as
this shows, there’s more to it than that. Even programs that don’t
need that much memory can really benefit from newer features like
RIP-relative addressing.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>A Showerthoughts Fortune File</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/01/"/>
    <id>urn:uuid:0a266c4d-a224-3399-a851-848f71b47dc3</id>
    <updated>2016-12-01T23:58:15Z</updated>
    <category term="reddit"/><category term="linux"/><category term="emacs"/>
    <content type="html">
      <![CDATA[<p>I have created a <a href="https://en.wikipedia.org/wiki/Fortune_(Unix)"><code class="language-plaintext highlighter-rouge">fortune</code> file</a> for the all-time top 10,000
<a href="https://old.reddit.com/r/Showerthoughts/">/r/Showerthoughts</a> posts, as of October 2016. As a word of
warning: Many of these entries are adult humor and may not be
appropriate for your work computer. These fortunes would be
categorized as “offensive” (<code class="language-plaintext highlighter-rouge">fortune -o</code>).</p>

<p>Download: <a href="https://skeeto.s3.amazonaws.com/share/showerthoughts" class="download">showerthoughts</a> (1.3 MB)</p>

<p>The copyright status of this file is subject to each of its thousands
of authors. Since it’s not possible to contact many of these authors —
some may not even still live — it’s obviously never going to be under
an open source license (Creative Commons, etc.). Even more, some
quotes are probably from comedians and such, rather than by the
redditor who made the post. I distribute it only for fun.</p>

<h3 id="installation">Installation</h3>

<p>To install this into your <code class="language-plaintext highlighter-rouge">fortune</code> database, first process it with
<code class="language-plaintext highlighter-rouge">strfile</code> to create a random-access index, showerthoughts.dat, then
copy them to the directory with the rest.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strfile showerthoughts
"showerthoughts.dat" created
There were 10000 strings
Longest string: 343 bytes
Shortest string: 39 bytes

$ cp showerthoughts* /usr/share/games/fortunes/
</code></pre></div></div>

<p>Alternatively, <code class="language-plaintext highlighter-rouge">fortune</code> can be told to use this file directly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
        ―AndItsDeepToo, Aug 2016
</code></pre></div></div>

<p>If you didn’t already know, <code class="language-plaintext highlighter-rouge">fortune</code> is an old unix utility that
displays a random quotation from a quotation database — a digital
<em>fortune cookie</em>. I use it as an interactive login shell greeting on
my <a href="http://www.hardkernel.com/main/products/prdt_info.php">ODROID-C2</a> server:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if </span><span class="nb">shopt</span> <span class="nt">-q</span> login_shell<span class="p">;</span> <span class="k">then
    </span>fortune ~/.fortunes
<span class="k">fi</span>
</code></pre></div></div>

<h3 id="how-was-it-made">How was it made?</h3>

<p>Fortunately I didn’t have to do something crazy like scrape reddit for
weeks on end. Instead, I downloaded <a href="http://files.pushshift.io/reddit/">the pushshift.io submission
archives</a>, which is currently around 70 GB compressed. Each file
contains one month’s worth of JSON data, one object per submission,
one submission per line, all compressed with bzip2.</p>

<p>Unlike so many other datasets, especially when it’s made up of
arbitrary inputs from millions of people, the format of the
/r/Showerthoughts posts is surprisingly very clean and requires
virtually no touching up. It’s some really fantastic data.</p>

<p>A nice feature of bzip2 is concatenating compressed files also
concatenates the uncompressed files. Additionally, it’s easy to
parallelize bzip2 compression and decompression, which gives it <a href="/blog/2009/03/16/">an
edge over xz</a>. I strongly recommend using <a href="http://lbzip2.org/">lbzip2</a> to
decompress this data, should you want to process it yourself.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat </span>RS_<span class="k">*</span>.bz2 | lbunzip2 <span class="o">&gt;</span> everything.json
</code></pre></div></div>

<p><a href="https://stedolan.github.io/jq/">jq</a> is my favorite command line tool for processing JSON (and
<a href="/blog/2016/09/15/">rendering fractals</a>). To filter all the /r/Showerthoughts posts,
it’s a simple <code class="language-plaintext highlighter-rouge">select</code> expression. Just mind the capitalization of the
subreddit’s name. The <code class="language-plaintext highlighter-rouge">-c</code> tells <code class="language-plaintext highlighter-rouge">jq</code> to keep it one per line.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat </span>RS_<span class="k">*</span>.bz2 | <span class="se">\</span>
    lbunzip2 | <span class="se">\</span>
    jq <span class="nt">-c</span> <span class="s1">'select(.subreddit == "Showerthoughts")'</span> <span class="se">\</span>
    <span class="o">&gt;</span> showerthoughts.json
</code></pre></div></div>

<p>However, you’ll quickly find that jq is the bottleneck, parsing all
that JSON. Your cores won’t be exploited by lbzip2 as they should. So
I throw <code class="language-plaintext highlighter-rouge">grep</code> in front to dramatically decrease the workload for
<code class="language-plaintext highlighter-rouge">jq</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="k">*</span>.bz2 | <span class="se">\</span>
    lbunzip2 | <span class="se">\</span>
    <span class="nb">grep</span> <span class="nt">-a</span> Showerthoughts | <span class="se">\</span>
    jq <span class="nt">-c</span> <span class="s1">'select(.subreddit == "Showerthoughts")'</span>
    <span class="o">&gt;</span> showerthoughts.json
</code></pre></div></div>

<p>This will let some extra things through, but it’s a superset. The <code class="language-plaintext highlighter-rouge">-a</code>
option is necessary because the data contains some null bytes. Without
it, <code class="language-plaintext highlighter-rouge">grep</code> switches into binary mode and breaks everything. This is
incredibly frustrating when you’ve already waited half an hour for
results.</p>

<p>To further reduce the workload further down the pipeline, I take
advantage of the fact that only four fields will be needed: <code class="language-plaintext highlighter-rouge">title</code>,
<code class="language-plaintext highlighter-rouge">score</code>, <code class="language-plaintext highlighter-rouge">author</code>, and <code class="language-plaintext highlighter-rouge">created_utc</code>. The rest can — and should, for
efficiency’s sake — be thrown away where it’s cheap to do so.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="k">*</span>.bz2 | <span class="se">\</span>
    lbunzip2 | <span class="se">\</span>
    <span class="nb">grep</span> <span class="nt">-a</span> Showerthoughts | <span class="se">\</span>
    jq <span class="nt">-c</span> <span class="s1">'select(.subreddit == "Showerthoughts") |
               {title, score, author, created_utc}'</span> <span class="se">\</span>
    <span class="o">&gt;</span> showerthoughts.json
</code></pre></div></div>

<p>This gathers all 1,199,499 submissions into a 185 MB JSON file (as of
this writing). Most of these submissions are terrible, so the next
step is narrowing it to the small set of good submissions and putting
them into the <code class="language-plaintext highlighter-rouge">fortune</code> database format.</p>

<p><strong>It turns out reddit already has a method for finding the best
submissions: a voting system.</strong> Just pick the highest scoring posts.
Through experimentation I arrived at 10,000 as the magic cut-off
number. After this the quality really starts to drop off. Over time
this should probably be scaled up with the total number of
submissions.</p>

<p>I did both steps at the same time using a bit of Emacs Lisp, which is
particularly well-suited to the task:</p>

<ul>
  <li><a href="https://github.com/skeeto/showerthoughts">https://github.com/skeeto/showerthoughts</a></li>
</ul>

<p>This Elisp program reads one JSON object at a time and sticks each
into a AVL tree sorted by score (descending), then timestamp
(ascending), then title (ascending). The AVL tree is limited to 10,000
items, with the lowest items being dropped. This was a lot faster than
the more obvious approach: collecting everything into a big list,
sorting it, and keeping the top 10,000 items.</p>

<h4 id="formatting">Formatting</h4>

<p>The most complicated part is actually paragraph wrapping the
submissions. Most are too long for a single line, and letting the
terminal hard wrap them is visually unpleasing. The submissions are
encoded in UTF-8, some with characters beyond simple ASCII. Proper
wrapping requires not just Unicode awareness, but also some degree of
Unicode <em>rendering</em>. The algorithm needs to recognize grapheme
clusters and know the size of the rendered text. This is not so
trivial! Most paragraph wrapping tools and libraries get this wrong,
some counting width by bytes, others counting width by codepoints.</p>

<p>Emacs’ <code class="language-plaintext highlighter-rouge">M-x fill-paragraph</code> knows how to do all these things — only
for a monospace font, which is all I needed — and I decided to
leverage it when generating the <code class="language-plaintext highlighter-rouge">fortune</code> file. Here’s an example that
paragraph-wraps a string:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">string-fill-paragraph</span> <span class="p">(</span><span class="nv">s</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">with-temp-buffer</span>
    <span class="p">(</span><span class="nv">insert</span> <span class="nv">s</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">fill-paragraph</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">buffer-string</span><span class="p">)))</span>
</code></pre></div></div>

<p>For the file format, items are delimited by a <code class="language-plaintext highlighter-rouge">%</code> on a line by itself.
I put the wrapped content, followed by a <a href="http://www.fileformat.info/info/unicode/char/2015/index.htm">quotation dash</a>, the
author, and the date. A surprising number of these submissions have
date-sensitive content (“on this day X years ago”), so I found it was
important to include a date.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
        ―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
        ―MKLV, Aug 2015
%
</code></pre></div></div>

<p>There’s the potential that a submission itself could end with a lone
<code class="language-plaintext highlighter-rouge">%</code> and, with a bit of bad luck, it happens to wrap that onto its own
line. Fortunately this hasn’t happened yet. But, now that I’ve
advertised it, someone could make such a submission, popular enough
for the top 10,000, with the intent to personally trip me up in a
future update. I accept this, though it’s unlikely, and it would be
fairly easy to work around if it happened.</p>

<p>The <code class="language-plaintext highlighter-rouge">strfile</code> program looks for the <code class="language-plaintext highlighter-rouge">%</code> delimiters and fills out a
table of file offsets. The header of the <code class="language-plaintext highlighter-rouge">.dat</code> file indicates the
number strings along with some other metadata. What follows is a table
of 32-bit file offsets.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">str_version</span><span class="p">;</span>  <span class="cm">/* version number */</span>
    <span class="kt">uint32_t</span> <span class="n">str_numstr</span><span class="p">;</span>   <span class="cm">/* # of strings in the file */</span>
    <span class="kt">uint32_t</span> <span class="n">str_longlen</span><span class="p">;</span>  <span class="cm">/* length of longest string */</span>
    <span class="kt">uint32_t</span> <span class="n">str_shortlen</span><span class="p">;</span> <span class="cm">/* shortest string length */</span>
    <span class="kt">uint32_t</span> <span class="n">str_flags</span><span class="p">;</span>    <span class="cm">/* bit field for flags */</span>
    <span class="kt">char</span> <span class="n">str_delim</span><span class="p">;</span>        <span class="cm">/* delimiting character */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note that the table doesn’t necessarily need to list the strings in
the same order as they appear in the original file. In fact, recent
versions of <code class="language-plaintext highlighter-rouge">strfile</code> can sort the strings by sorting the table, all
without touching the original file. Though none of this important to
<code class="language-plaintext highlighter-rouge">fortune</code>.</p>

<p>Now that you know how it all works, you can build your own <code class="language-plaintext highlighter-rouge">fortune</code>
file from your own inputs!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>A Magnetized Needle and a Steady Hand</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/17/"/>
    <id>urn:uuid:1abbb17d-9836-3efc-8493-52dd93a90736</id>
    <updated>2016-11-17T23:35:26Z</updated>
    <category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Now they’ve gone an done it. An unidentified agency has spread a
potent computer virus across all the world’s computers and deleted the
binaries for every copy of every software development tool. Even the
offline copies — it’s <em>that</em> potent.</p>

<p>Most of the source code still exists, even for the compilers, and most
computer systems will continue operating without disruption, but no
new software can be developed unless it’s written byte by byte in raw
machine code. Only <em>real programmers</em> can get anything done.</p>

<p><a href="http://xkcd.com/378/"><img src="http://imgs.xkcd.com/comics/real_programmers.png" alt="" /></a></p>

<p>The world’s top software developers have been put to work
bootstrapping a C compiler (and others) completely from scratch so
that we can get back to normal. Without even an assembler, it’s a
slow, tedious process.</p>

<p>In the mean time, rather than wait around for the bootstrap work to
complete, the rest of us have been assigned individual programs hit by
the virus. For example, many basic unix utilities have been wiped out,
and the bootstrap would benefit from having them. Having different
groups tackle each missing program will allow the bootstrap effort to
move forward somewhat in parallel. <em>At least that’s what the compiler
nerds told us.</em> The real reason is that they’re tired of being asked
if they’re done yet, and these tasks will keep the rest of us quietly
busy.</p>

<p>Fortunately you and I have been assigned the easiest task of all:
<strong>We’re to write the <code class="language-plaintext highlighter-rouge">true</code> command from scratch.</strong> We’ll have to
figure it out byte by byte. The target is x86-64 Linux, which means
we’ll need the following documentation:</p>

<ol>
  <li>
    <p><a href="http://refspecs.linuxbase.org/elf/elf.pdf">Executable and Linking Format (ELF) Specification</a>. This is
the binary format used by modern Unix-like systems, including
Linux. A more convenient way to access this document is <a href="http://man7.org/linux/man-pages/man5/elf.5.html"><code class="language-plaintext highlighter-rouge">man 5
elf</code></a>.</p>
  </li>
  <li>
    <p><a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">Intel 64 and IA-32 Architectures Software Developer’s
Manual</a> (Volume 2). This fully documents the instruction set
and its encoding. It’s all the information needed to write x86
machine code by hand. The AMD manuals would work too.</p>
  </li>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">System V Application Binary Interface: AMD64 Architecture
Processor Supplement</a>. Only a few pieces of information are
needed from this document, but more would be needed for a more
substantial program.</p>
  </li>
  <li>
    <p>Some magic numbers from header files.</p>
  </li>
</ol>

<h3 id="manual-assembly">Manual Assembly</h3>

<p>The program we’re writing is <code class="language-plaintext highlighter-rouge">true</code>, whose behavior is documented as
“do nothing, successfully.” All command line arguments are ignored and
no input is read. The program only needs to perform the exit system
call, immediately terminating the process.</p>

<p>According to the ABI document (3) Appendix A, the registers for system
call arguments are: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">r10</code>, <code class="language-plaintext highlighter-rouge">r8</code>, <code class="language-plaintext highlighter-rouge">r9</code>. The system
call number goes in <code class="language-plaintext highlighter-rouge">rax</code>. The exit system call takes only one
argument, and that argument will be 0 (success), so <code class="language-plaintext highlighter-rouge">rdi</code> should be
set to zero. It’s likely that it’s already zero when the program
starts, but the ABI document says its contents are undefined (§3.4),
so we’ll set it explicitly.</p>

<p>For Linux on x86-64, the system call number for exit is 60,
(/usr/include/asm/unistd_64.h), so <code class="language-plaintext highlighter-rouge">rax</code> will be set to 60, followed
by <code class="language-plaintext highlighter-rouge">syscall</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">xor</span>  <span class="nb">edi</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">60</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<p>There’s no assembler available to turn this into machine code, so it
has to be assembled by hand. For that we need the Intel manual (2).</p>

<p>The first instruction is <code class="language-plaintext highlighter-rouge">xor</code>, so look up that mnemonic in the
manual. Like most x86 mnemonics, there are many different opcodes and
multiple ways to encode the same operation. For <code class="language-plaintext highlighter-rouge">xor</code>, we have 22
opcodes to examine.</p>

<p><img src="/img/steady-hand/xor.png" alt="" /></p>

<p>The operands are two 32-bit registers, so there are two options:
opcodes 0x31 and 0x33.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 /r      XOR r/m32, r32
33 /r      XOR r32, r/m32
</code></pre></div></div>

<p>The “r/m32” means the operand can be either a register or the address
of a 32-bit region of memory. With two register operands, both
encodings are equally valid, both have the same length (2 bytes), and
neither is canonical, so the decision is entirely arbitrary. Let’s
pick the first one, opcode 0x31, since it’s listed first.</p>

<p>The “/r” after the opcode means the register-only operand (“r32” in
both cases) will be specified in the ModR/M byte. This is the byte
that immediately follows the opcode and specifies one of two of the
operands.</p>

<p>The ModR/M byte is broken into three parts: mod (2 bits), reg (3
bits), r/m (3 bits). This gets a little complicated, but if you stare
at Table 2-1 in the Intel manual for long enough it eventually makes
sense. In short, two high bits (11) for mod indicates we’re working
with a register rather than a load. Here’s where we’re at for ModR/M:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 ??? ???
</code></pre></div></div>

<p>The order of the x86 registers is unintuitive: <code class="language-plaintext highlighter-rouge">ax</code>, <code class="language-plaintext highlighter-rouge">cx</code>, <code class="language-plaintext highlighter-rouge">dx</code>, <code class="language-plaintext highlighter-rouge">bx</code>,
<code class="language-plaintext highlighter-rouge">sp</code>, <code class="language-plaintext highlighter-rouge">bp</code>, <code class="language-plaintext highlighter-rouge">si</code>, <code class="language-plaintext highlighter-rouge">di</code>. With 0-indexing, that gives <code class="language-plaintext highlighter-rouge">di</code> a value of 7
(111 in binary). With <code class="language-plaintext highlighter-rouge">edi</code> as both operands, this makes ModR/M:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 111 111
</code></pre></div></div>

<p>Or, in hexadecimal, FF. And that’s it for this instruction. With the
opcode (0x31) and the ModR/M byte (0xFF):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 FF
</code></pre></div></div>

<p>The encoding for <code class="language-plaintext highlighter-rouge">mov</code> is a bit different. Look it up and match the
operands. Like before, there are two possible options:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B8+rd id   MOV r32, imm32
C7 /0 id   MOV r/m32, imm32
</code></pre></div></div>

<p>In the <code class="language-plaintext highlighter-rouge">B8+rd</code> notation means the 32-bit register operand (<em>rd</em> for
“register double word”) is added to the opcode instead of having a
ModR/M byte. It’s followed by a 32-bit immediate value (<em>id</em> for
“integer double word”). That’s a total of 5 bytes.</p>

<p>The “/0” in second means 0 goes in the “reg” field of ModR/M, and the
whole instruction is followed by the 32-bit immediate (id). That’s a
total of 6 bytes. Since this is longer, we’ll use the first encoding.</p>

<p>So, that’s opcode <code class="language-plaintext highlighter-rouge">0xB8 + 0</code>, since <code class="language-plaintext highlighter-rouge">eax</code> is register number 0,
followed by 60 (0x3C) as a little endian, 4-byte value. Here’s the
encoding for the second instruction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B8 3C 00 00 00
</code></pre></div></div>

<p>The final instruction is a cakewalk. There are no operands, it comes
in only one form of two opcode bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0F 05   SYSCALL
</code></pre></div></div>

<p>So the encoding for this instruction is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0F 05
</code></pre></div></div>

<p>Putting it all together the program is 9 bytes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>31 FF B8 3C 00 00 00 0F 05
</code></pre></div></div>

<p>Aren’t you glad you don’t normally have to assemble entire programs by
hand?</p>

<h3 id="constructing-the-elf">Constructing the ELF</h3>

<p>Back in the old days you may have been able to simply drop these bytes
into a file and execute it. That’s how <a href="/blog/2014/12/09/">DOS COM programs worked</a>.
But this definitely won’t work if you tried it on Linux. Binaries must
be in the Executable and Linking Format (ELF). This format tells the
loader how to initialize the program in memory and how to start it.</p>

<p>Fortunately for this program we’ll only need to fill out two
structures: the ELF header and one program header. The binary will be
the ELF header, followed immediately by the program header, followed
immediately by the program.</p>

<p><img src="/img/steady-hand/elf.svg" alt="" /></p>

<p>To fill this binary out, we’d use whatever method the virus left
behind for writing raw bytes to a file. For now I’ll assume the <code class="language-plaintext highlighter-rouge">echo</code>
command is still available, and we’ll use hexadecimal <code class="language-plaintext highlighter-rouge">\xNN</code> escapes
to write raw bytes. If this isn’t available, you might need to use the
magnetic needle and steady hand method, or the butterflies.</p>

<p>The very first structure in an ELF file must be the ELF header, from
the ELF specification (1):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">e_ident</span><span class="p">[</span><span class="n">EI_NIDENT</span><span class="p">];</span>
        <span class="kt">uint16_t</span>      <span class="n">e_type</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_machine</span><span class="p">;</span>
        <span class="kt">uint32_t</span>      <span class="n">e_version</span><span class="p">;</span>
        <span class="n">ElfN_Addr</span>     <span class="n">e_entry</span><span class="p">;</span>
        <span class="n">ElfN_Off</span>      <span class="n">e_phoff</span><span class="p">;</span>
        <span class="n">ElfN_Off</span>      <span class="n">e_shoff</span><span class="p">;</span>
        <span class="kt">uint32_t</span>      <span class="n">e_flags</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_ehsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_phentsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_phnum</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shentsize</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shnum</span><span class="p">;</span>
        <span class="kt">uint16_t</span>      <span class="n">e_shstrndx</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">ElfN_Ehdr</span><span class="p">;</span>
</code></pre></div></div>

<p>No other data is at a fixed location because this header specifies
where it can be found. If you’re writing a C program in the future,
once compilers have been bootstrapped back into existence, you can
access this structure in <code class="language-plaintext highlighter-rouge">elf.h</code>.</p>

<h4 id="the-elf-header">The ELF header</h4>

<p>The <code class="language-plaintext highlighter-rouge">EI_NIDENT</code> macro is 16, so <code class="language-plaintext highlighter-rouge">e_ident</code> is 16 bytes. The first 4
bytes are fixed: 0x7F, E, L, F.</p>

<p>The 5th byte is called <code class="language-plaintext highlighter-rouge">EI_CLASS</code>: a 32-bit program (<code class="language-plaintext highlighter-rouge">ELFCLASS32</code> =
1) or a 64-bit program (<code class="language-plaintext highlighter-rouge">ELFCLASS64</code> = 2). This will be a 64-bit
program (2).</p>

<p>The 6th byte indicates the integer format (<code class="language-plaintext highlighter-rouge">EI_DATA</code>). The one we want
for x86-64 is <code class="language-plaintext highlighter-rouge">ELFDATA2LSB</code> (1), two’s complement, little-endian.</p>

<p>The 7th byte is the ELF version (<code class="language-plaintext highlighter-rouge">EI_VERSION</code>), always 1 as of this
writing.</p>

<p>The 8th byte is the ABI (<code class="language-plaintext highlighter-rouge">ELF_OSABI</code>), which in this case is
<code class="language-plaintext highlighter-rouge">ELFOSABI_SYSV</code> (0).</p>

<p>The 9th byte is the version (<code class="language-plaintext highlighter-rouge">EI_ABIVERSION</code>), which is just 0 again.</p>

<p>The rest is zero padding.</p>

<p>So writing the ELF header:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x7FELF\x02\x01\x01\x00' &gt; true
echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The next field is the <code class="language-plaintext highlighter-rouge">e_type</code>. This is an executable program, so it’s
<code class="language-plaintext highlighter-rouge">ET_EXEC</code> (2). Other options are object files (<code class="language-plaintext highlighter-rouge">ET_REL</code> = 1), shared
libraries (<code class="language-plaintext highlighter-rouge">ET_DYN</code> = 3), and core files (<code class="language-plaintext highlighter-rouge">ET_CORE</code> = 4).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x02\x00' &gt;&gt; true
</code></pre></div></div>

<p>The value for <code class="language-plaintext highlighter-rouge">e_machine</code> is <code class="language-plaintext highlighter-rouge">EM_X86_64</code> (0x3E). This value isn’t in
the ELF specification but rather the ABI document (§4.1.1). On BSD
this is instead named <code class="language-plaintext highlighter-rouge">EM_AMD64</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x3E\x00' &gt;&gt; true
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">e_version</code> it’s always 1, like in the header.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_entry</code> field will be 8 bytes because this is a 64-bit ELF. This
is the virtual address of the program’s entry point. It’s where the
loader will pass control and so it’s where we’ll load the program. The
typical entry address is somewhere around 0x400000. For a reason I’ll
explain shortly, our entry point will be 120 bytes (0x78) after that
nice round number, at 0x40000078.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phoff</code> field holds the offset of the program header table. The
ELF header is 64 bytes (0x40) and this structure will immediately
follow. It’s also 8 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shoff</code> header holds the offset of the section table. In an
executable program we don’t need sections, so this is zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_flags</code> field has processor-specific flags, which in our case is
just 0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_ehsize</code> holds the size of the ELF header, which, as I said, is
64 bytes (0x40).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phentsize</code> is the size of one program header, which is 56 bytes
(0x38).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x38\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_phnum</code> field indicates how many program headers there are. We
only need the one: the segment with the 9 program bytes, to be loaded
into memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shentsize</code> is the size of a section header. We’re not using
this, but we’ll do our due diligence. These are 64 bytes (0x40).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x40\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shnum</code> field is the number of sections (0).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">e_shstrndx</code> is the index of the section with the string table. It
doesn’t exist, so it’s 0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00' &gt;&gt; true
</code></pre></div></div>

<h4 id="the-program-header">The program header</h4>

<p>Next is our program header.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">uint32_t</span>   <span class="n">p_type</span><span class="p">;</span>
        <span class="kt">uint32_t</span>   <span class="n">p_flags</span><span class="p">;</span>
        <span class="n">Elf64_Off</span>  <span class="n">p_offset</span><span class="p">;</span>
        <span class="n">Elf64_Addr</span> <span class="n">p_vaddr</span><span class="p">;</span>
        <span class="n">Elf64_Addr</span> <span class="n">p_paddr</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_filesz</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_memsz</span><span class="p">;</span>
        <span class="kt">uint64_t</span>   <span class="n">p_align</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">Elf64_Phdr</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_type</code> field indicates the segment type. This segment will hold
the program and will be loaded into memory, so we want <code class="language-plaintext highlighter-rouge">PT_LOAD</code> (1).
Other kinds of segments set up dynamic loading and such.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x01\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_flags</code> field gives the memory protections. We want executable
(<code class="language-plaintext highlighter-rouge">PF_X</code> = 1) and readable (<code class="language-plaintext highlighter-rouge">PF_R</code> = 4). These are ORed together to
make 5.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x05\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_offset</code> is the file offset for the content of this segment.
This will be the program we assembled. It will immediately follow the
this header. The ELF header was 64 bytes, plus a 56 byte program
header, which is 120 (0x78).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_vaddr</code> is the virtual address where this segment will be
loaded. This is the entry point from before. A restriction is that
this value must be congruent with <code class="language-plaintext highlighter-rouge">p_offset</code> modulo the page size.
That’s why the entry point was offset by 120 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_paddr</code> is unused for this platform.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_filesz</code> is the size of the segment in the file: 9 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_memsz</code> is the size of the segment in memory, also 9 bytes. It
might sound redundant, but these are allowed to differ, in which case
it’s either truncated or padded with zeroes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">p_align</code> indicates the segment’s alignment. We don’t care about
alignment.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' &gt;&gt; true
</code></pre></div></div>

<h4 id="append-the-program">Append the program</h4>

<p>Finally, append the program we assembled at the beginning.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -ne '\x31\xFF\xB8\x3C\x00\x00\x00\x0F\x05' &gt;&gt; true
</code></pre></div></div>

<p>Set it executable (hopefully <code class="language-plaintext highlighter-rouge">chmod</code> survived!):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x true
</code></pre></div></div>

<p>And test it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./true &amp;&amp; echo 'Success'
</code></pre></div></div>

<p>Here’s the whole thing as a shell script:</p>

<ul>
  <li><a href="/download/make-true.sh">make-true.sh</a></li>
</ul>

<p>Is the C compiler done bootstrapping yet?</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Emacs, Dynamic Modules, and Joysticks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/05/"/>
    <id>urn:uuid:c53305bb-4770-3a7f-934c-31eea37d38eb</id>
    <updated>2016-11-05T04:01:51Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Two months ago Emacs 25 was released and introduced a <a href="http://diobla.info/blog-archive/modules-tut.html">new dynamic
module feature</a>. Emacs can now load shared libraries built
against Emacs’ module API, defined in <a href="http://git.savannah.gnu.org/cgit/emacs.git/tree/src/emacs-module.h?h=emacs-25.1">emacs-module.h</a>. What’s
interesting about this API is that it doesn’t require linking against
Emacs or any sort of library. Instead, at run time Emacs supplies the
module’s initialization function with function pointers for the entire
API.</p>

<p>As a demonstration, in this article I’ll build an Emacs joystick
interface (Linux only) using a dynamic module. It will allow Emacs to
read events from any joystick on the system. All the source code is
here:</p>

<ul>
  <li><a href="https://github.com/skeeto/joymacs">https://github.com/skeeto/joymacs</a></li>
</ul>

<p>It includes a calibration interface (<code class="language-plaintext highlighter-rouge">M-x joydemo</code>) within Emacs:</p>

<p><a href="/img/joymacs/joymacs.png"><img src="/img/joymacs/joymacs-thumb.png" alt="" /></a></p>

<p>Currently, Emacs’ emacs-module.h header is the entirety of the module
documentation. It’s a bit thin and leaves ambiguities that requires
some reading of the Emacs source code. Even reading the source, it’s
not clear which behaviors are a reliable part of the interface. For
example, if there’s a pending non-local exit, it’s safe for a function
to return <code class="language-plaintext highlighter-rouge">NULL</code> since the return value is never inspected (Emacs
25.1), but will this always be the case? While mistakes are
unforgiving (a hard crash), the API is mostly intuitive and it’s been
pretty easy to feel my way around it.</p>

<p><em>Update</em>: Philipp Stephani has <a href="https://phst.github.io/emacs-modules">written thorough, reliable module
documentation</a>.</p>

<h3 id="dynamic-module-types">Dynamic Module Types</h3>

<p>All Emacs values — integers, floats, cons cells, vectors, strings,
etc. — are represented as the polymorphic, pointer-valued type,
<code class="language-plaintext highlighter-rouge">emacs_value</code>. Despite being a pointer, <code class="language-plaintext highlighter-rouge">NULL</code> is not a valid value,
as convenient as that would be. The API includes functions for
creating and extracting the fundamental types: integers, floats,
strings. Almost all other object types can only be accessed by making
Lisp function calls to regular Emacs functions from the module.</p>

<p>Modules also introduce a brand new Emacs object type: a <em>user
pointer</em>. These are <a href="/blog/2013/12/30/">non-readable</a>, opaque pointer values
returned by modules, typically representing a handle to some resource,
be it a memory block, database connection, or a joystick. These
objects include a finalizer function pointer — which, surprisingly, is
not permitted to be NULL — and their lifetime is managed by Emacs’
garbage collector.</p>

<p>User pointers are a somewhat dangerous feature since there’s little to
stop Emacs Lisp code from misusing them. A Lisp program can take a
user pointer from one module and pass it to a function in a different
module. Since it’s just a pointer, there’s no way to type check it. At
best, a module could maintain a table of all its live pointers,
checking all user pointer arguments against the table before
dereferencing. But I don’t expect this to be normal practice.</p>

<h3 id="module-initialization">Module Initialization</h3>

<p>After loading the module through the platform’s mechanism, the first
thing Emacs does is check for the symbol <code class="language-plaintext highlighter-rouge">plugin_is_GPL_compatible</code>.
While tacky, this is not surprising given the culture around Emacs.</p>

<p>Next it calls <code class="language-plaintext highlighter-rouge">emacs_module_init()</code>, passing it the first function
pointer. From this, the module can get a Lisp environment and start
doing Emacs things, such as binding module functions to Lisp symbols.</p>

<p>Here’s a complete “Hello, world!” example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"emacs-module.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">plugin_is_GPL_compatible</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">emacs_module_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">emacs_runtime</span> <span class="o">*</span><span class="n">ert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">ert</span><span class="o">-&gt;</span><span class="n">get_environment</span><span class="p">(</span><span class="n">ert</span><span class="p">);</span>
    <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"message"</span><span class="p">);</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="n">hi</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!"</span><span class="p">;</span>
    <span class="n">emacs_value</span> <span class="n">string</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">hi</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">hi</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">string</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In a real module, it’s common to create function objects for native
functions, then fetch the <code class="language-plaintext highlighter-rouge">fset</code> symbol and make a Lisp call on it to
bind the newly-created function object to a name. You’ll see this in
action later.</p>

<h3 id="joystick-api">Joystick API</h3>

<p>The joystick API will closely resemble <a href="https://www.kernel.org/doc/Documentation/input/joystick-api.txt">Linux’s own joystick API</a>,
making for a fairly thin wrapper. It’s so thin that Emacs <em>almost</em>
doesn’t even need a dynamic module. This is because, on Linux,
joysticks are just files under <code class="language-plaintext highlighter-rouge">/dev/input/</code>. Want to see the input
events on the first joystick? Just read <code class="language-plaintext highlighter-rouge">/dev/input/js0</code>. So Plan 9.</p>

<p>Emacs already knows how to read files, but these virtual files are a
little <em>too</em> special for that. The header <code class="language-plaintext highlighter-rouge">linux/joystick.h</code> defines a
<code class="language-plaintext highlighter-rouge">struct js_event</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">js_event</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">time</span><span class="p">;</span>  <span class="cm">/* event timestamp in milliseconds */</span>
    <span class="kt">int16_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="kt">uint8_t</span> <span class="n">type</span><span class="p">;</span>
    <span class="kt">uint8_t</span> <span class="n">number</span><span class="p">;</span> <span class="cm">/* axis/button number */</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The idea is to read from the joystick device into this structure. The
first several reads are initialization that define the axes and
buttons of the joystick and their initial state. Further events are
queued up for the file descriptor. This all means that the file can’t
just be opened each time joystick input is needed. It has to be held
open for the duration, and is typically configured non-blocking.</p>

<p>The Emacs package will be called <code class="language-plaintext highlighter-rouge">joymacs</code> and there will be three
functions:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">joymacs-open</span> <span class="nv">N</span><span class="p">)</span>
<span class="p">(</span><span class="nv">joymacs-close</span> <span class="nv">JOYSTICK</span><span class="p">)</span>
<span class="p">(</span><span class="nv">joymacs-read</span> <span class="nv">JOYSTICK</span> <span class="nv">EVENT-VECTOR</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="joymacs-open">joymacs-open</h4>

<p>The <code class="language-plaintext highlighter-rouge">joymacs-open</code> function will take an integer, opening the Nth
joystick (<code class="language-plaintext highlighter-rouge">/dev/input/jsN</code>). It will create a file descriptor for the
joystick device, returning it as a user pointer. Think of it as a sort
of “joystick handle.” Now, it <em>could</em> instead return the file
descriptor as an integer, but the user pointer has two significant
benefits:</p>

<ol>
  <li>
    <p><strong>The resource will be garbage collected.</strong> If the caller loses
track of a file descriptor returned as an integer, the joystick
device will be held open until Emacs shuts down, using up one of
Emacs’ file descriptors. By putting it in a user pointer, the
garbage collector will have the module to release the file
descriptor if the user loses track of it.</p>
  </li>
  <li>
    <p><strong>It should be difficult for the user to make a dangerous call.</strong>
Emacs Lisp can’t create user pointers — they only come from modules
— and so the module is less likely to get passed the wrong thing.
In the case of <code class="language-plaintext highlighter-rouge">joystick-close</code>, the module will be calling
<code class="language-plaintext highlighter-rouge">close(2)</code> on the argument. We definitely don’t want to make that
system call on file descriptors owned by Emacs. Further, since user
pointers are mutable, the module can ensure it doesn’t call
<code class="language-plaintext highlighter-rouge">close(2)</code> twice.</p>
  </li>
</ol>

<p>Here’s the implementation for <code class="language-plaintext highlighter-rouge">joymacs-open</code>. I’ll over over each part
in detail.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_open</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">extract_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">buflen</span> <span class="o">=</span> <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/dev/input/js%d"</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">O_RDONLY</span> <span class="o">|</span> <span class="n">O_NONBLOCK</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">buflen</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_close</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The C function name doesn’t matter to Emacs. It’s <code class="language-plaintext highlighter-rouge">static</code> because it
doesn’t even matter if the function visible to Emacs. It will get the
function pointer later as part of initialization.</p>

<p>This is the prototype for all functions callable by Emacs Lisp,
regardless of its arity. It has four arguments:</p>

<ol>
  <li>
    <p>It gets an environment, <code class="language-plaintext highlighter-rouge">env</code>, through which to call back into
Emacs.</p>
  </li>
  <li>
    <p>It gets <code class="language-plaintext highlighter-rouge">n</code>, the number of arguments. This is guaranteed to be the
correct number of arguments, as specified later when creating the
function object, so only variadic functions need to inspect this
argument.</p>
  </li>
  <li>
    <p>The Lisp arguments are passed as an array of values, <code class="language-plaintext highlighter-rouge">args</code>.
There’s no type declaration when declaring a function object, so
these may be of the wrong type. I’ll go over how to deal with this.</p>
  </li>
  <li>
    <p>Finally, it gets an arbitrary pointer, supplied at function object
creation time. This allows the module to create closures, but will
usually be ignored.</p>
  </li>
</ol>

<p>The first thing the function does is extract its integer argument.
This is actually an <code class="language-plaintext highlighter-rouge">intmax_t</code>, but I don’t think anyone has that many
USB ports. An <code class="language-plaintext highlighter-rouge">int</code> will suffice.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">extract_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
</code></pre></div></div>

<p>As for not underestimating fools, what if the user passed a value that
isn’t an integer? Will the world come crashing down? Fortunately Emacs
checks that in <code class="language-plaintext highlighter-rouge">extract_integer</code> and, if there’s a mismatch, sets a
pending error signal in the environment. This is really great because
checking types directly in the module is a <em>real pain the ass</em>. So,
before committing to anything further, such as opening a file, I check
for this signal and bail out early if necessary. In Emacs 25.1 it’s
safe to return NULL since the return value will be completely ignored,
but I’d rather hedge my bets.</p>

<p>By the way, the <code class="language-plaintext highlighter-rouge">nil</code> here is a global variable set in initialization.
You don’t just get that for free!</p>

<p>The next step is opening the joystick device, read-only and
non-blocking. The non-blocking is vital because the module would
otherwise hang Emacs later if there are no events (well, except for
the read being quickly interrupted by a POSIX signal).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">buflen</span> <span class="o">=</span> <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/dev/input/js%d"</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">O_RDONLY</span> <span class="o">|</span> <span class="n">O_NONBLOCK</span><span class="p">);</span>
</code></pre></div></div>

<p>If the joystick fails to open (e.g. it doesn’t exist, or the user
lacks permission), manually set an error signal for a non-local exit.
I chose the <code class="language-plaintext highlighter-rouge">file-error</code> signal and I’m just using the filename as the
signal data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">buflen</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Otherwise create the user pointer. No need to allocate any memory;
just stuff it in the pointer itself. If the user mistakenly passes it
to another module, it will sure be in for a surprise when it tries to
dereference it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_close</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fin_close()</code> function is defined as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">fin_close</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fdptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fdptr</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The garbage collector will call this function when the user pointer is
lost. If the user closes it early with <code class="language-plaintext highlighter-rouge">joymacs-close</code>, that function
will set the user pointer to -1, an invalid file descriptor, so that
it doesn’t get closed a second time here.</p>

<h4 id="joymacs-close">joymacs-close</h4>

<p>Here’s <code class="language-plaintext highlighter-rouge">joymacs-close</code>, which is a bit simpler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_close</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">set_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Again, it starts by extracting its argument, relying on Emacs to do
the check:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
</code></pre></div></div>

<p>If the user pointer hasn’t been closed yet, then close it and strip
out the file descriptor to prevent further closes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">set_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<h4 id="joymacs-read">joymacs-read</h4>

<p>The <code class="language-plaintext highlighter-rouge">joymacs-read</code> function is doing something a little unusual for an
Emacs Lisp function. It takes two arguments: the joystick handle and a
5-element vector. Instead of returning the event in some
representation, it fills the vector with the event details. The are
two reasons for this:</p>

<ol>
  <li>
    <p>The API has no function for creating vectors … though the module
<em>could</em> get the <code class="language-plaintext highlighter-rouge">make-symbol</code> vector and call it to create a
vector.</p>
  </li>
  <li>
    <p>The idiom for event pumps is for the caller to supply a buffer to
the pump. This has better performance by avoiding lots of
unnecessary allocations, especially since events tend to be
message-like objects with a short, well-defined extent.</p>
  </li>
</ol>

<p>Here’s the full definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_read</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">js_event</span> <span class="n">e</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">e</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">==</span> <span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* No more events. */</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* An actual read error (joystick unplugged, etc.). */</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">error</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">error</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="cm">/* Fill out event vector. */</span>
        <span class="n">emacs_value</span> <span class="n">v</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">emacs_value</span> <span class="n">type</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_BUTTON</span> <span class="o">?</span> <span class="n">button</span> <span class="o">:</span> <span class="n">axis</span><span class="p">;</span>
        <span class="n">emacs_value</span> <span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="n">button</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="n">value</span> <span class="o">=</span>  <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_float</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">INT16_MAX</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">time</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">type</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">number</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_INIT</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As before, extract the first argument and check for a signal. Then
call <code class="language-plaintext highlighter-rouge">read(2)</code> to get an event. If the read fails with <code class="language-plaintext highlighter-rouge">EAGAIN</code>, it’s
not a real failure. There are just no more events, so return nil.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">struct</span> <span class="n">js_event</span> <span class="n">e</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">e</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">==</span> <span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* No more events. */</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If the read failed with something else — perhaps the joystick was
unplugged — signal an error. The <code class="language-plaintext highlighter-rouge">strerror(3)</code> string is used for the
signal data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* An actual read error (joystick unplugged, etc.). */</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">error</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">error</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Otherwise fill out the event vector. If the second argument isn’t a
vector, or if it’s too short, the signal will automatically get raised
by Emacs. The module can keep plowing through the <code class="language-plaintext highlighter-rouge">vec_set()</code> calls
safely since it’s not committing to anything.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="cm">/* Fill out event vector. */</span>
        <span class="n">emacs_value</span> <span class="n">v</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">emacs_value</span> <span class="n">type</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_BUTTON</span> <span class="o">?</span> <span class="n">button</span> <span class="o">:</span> <span class="n">axis</span><span class="p">;</span>
        <span class="n">emacs_value</span> <span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="n">button</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="n">value</span> <span class="o">=</span>  <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_float</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">INT16_MAX</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">time</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">type</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">number</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_INIT</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
</code></pre></div></div>

<p>The Linux event struct has four fields and the function fills out five
values of the vector. This is because the <code class="language-plaintext highlighter-rouge">type</code> field has a bit flag
indicating initialization events. This is split out into an extra
t/nil value. It also normalizes axis values and converts button values
into t/nil, which makes more sense for Emacs Lisp. The event itself is
returned since it’s a truthy value and it’s convenient for the caller.</p>

<p>The astute programmer might notice that the negative side of the axis
could go just below -1.0, since <code class="language-plaintext highlighter-rouge">INT16_MIN</code> has one extra value over
<code class="language-plaintext highlighter-rouge">INT16_MAX</code> (two’s complement). It doesn’t seem to be documented, but
the joystick drivers I’ve seen never exactly return <code class="language-plaintext highlighter-rouge">INT16_MIN</code>, so
this is in fact the correct way to normalize it.</p>

<h4 id="initialization">Initialization</h4>

<p><em>Update 2021</em>: In a previous version of this article, I talked about
interning symbols during initialziation so that they do not need to be
re-interned each time the module is called. This <a href="https://github.com/skeeto/joymacs/issues/1">no longer works</a>,
and it was probably never intended to be work in the first place. The
lesson is simple: <strong>Do not reuse Emacs objects between module calls.</strong></p>

<p>First grab the <code class="language-plaintext highlighter-rouge">fset</code> symbol since this function will be needed to bind
names to the module’s functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">fset</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"fset"</span><span class="p">);</span>
</code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">fset</code>, bind the functions. The second and third arguments to
<code class="language-plaintext highlighter-rouge">make_function</code> are the minimum and maximum number of arguments, which
<a href="/blog/2014/01/04/">may look familiar</a>. The last argument is that closure pointer
I mentioned at the beginning.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">args</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"joymacs-open"</span><span class="p">);</span>
    <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_function</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">joymacs_open</span><span class="p">,</span> <span class="n">doc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fset</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">args</span><span class="p">);</span>
</code></pre></div></div>

<p>If the module is to be loaded with <code class="language-plaintext highlighter-rouge">require</code> like any other package,
it needs to provide: <code class="language-plaintext highlighter-rouge">(provide 'joymacs)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">provide</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"provide"</span><span class="p">);</span>
    <span class="n">emacs_value</span> <span class="n">joymacs</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"joymacs"</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">provide</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">joymacs</span><span class="p">);</span>
</code></pre></div></div>

<p>And that’s it!</p>

<p>The source repository now includes a port to Windows (XInput). If
you’re on Linux or Windows, have Emacs 25 with modules enabled, and a
joystick is plugged in, then <code class="language-plaintext highlighter-rouge">make run</code> in the repository should bring
up Emacs running a joystick calibration demonstration. The module
can’t poke at Emacs when events are ready, so instead there’s a timer
that polls the module for events.</p>

<p>I’d like to someday see an Emacs Lisp game well-suited for a joystick.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>An Array of Pointers vs. a Multidimensional Array</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/27/"/>
    <id>urn:uuid:d1302ff9-f958-3486-134d-01c8ab84aa51</id>
    <updated>2016-10-27T21:01:33Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In a C program, suppose I have a table of color names of similar
length. There are two straightforward ways to construct this table.
The most common would be an array of <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The other is a two-dimensional <code class="language-plaintext highlighter-rouge">char</code> array.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The initializers are identical, and the syntax by which these tables
are used is the same, but the underlying data structures are very
different. For example, suppose I had a lookup() function that
searches the table for a particular color.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lookup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">color</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">ncolors</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ncolors</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">color</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is <code class="language-plaintext highlighter-rouge">char
colors[][7]</code> or <code class="language-plaintext highlighter-rouge">char *colors[]</code>. It’s a little bit misleading because
the compiler generates different code depending on the type.</p>

<h3 id="memory-layout">Memory Layout</h3>

<p>Here’s what <code class="language-plaintext highlighter-rouge">colors_ptr</code>, a <em>jagged array</em>, typically looks like in
memory.</p>

<p><img src="/img/colortab/pointertab.png" alt="" /></p>

<p>The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression <code class="language-plaintext highlighter-rouge">colors_ptr[n]</code> is <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<p>On x86-64, suppose the base of the table is in <code class="language-plaintext highlighter-rouge">rax</code>, the index of the
string I want to retrieve is <code class="language-plaintext highlighter-rouge">rcx</code>, and I want to put the string’s
address back into <code class="language-plaintext highlighter-rouge">rax</code>. It’s one load instruction.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Contrast this with <code class="language-plaintext highlighter-rouge">colors_2d</code>: six 7-byte elements in a row. No
pointers or addresses. Only strings.</p>

<p><img src="/img/colortab/arraytab.png" alt="" /></p>

<p>The strings are in their defined order, packed together. The type of
the expression <code class="language-plaintext highlighter-rouge">colors_2d[n]</code> is <code class="language-plaintext highlighter-rouge">char [7]</code>, an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.</p>

<p>In the same scenario before with x86-64, it takes two instructions to
put the string’s address in <code class="language-plaintext highlighter-rouge">rax</code>, but neither is a load.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">imul</span>  <span class="nb">rcx</span><span class="p">,</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">7</span>
<span class="nf">add</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rcx</span>
</code></pre></div></div>

<p>In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. <code class="language-plaintext highlighter-rouge">char colors_2d[][8]</code>). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">lea</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="relocation">Relocation</h3>

<p>There’s another factor to consider: relocation. Nearly every process
running on a modern system takes advantage of a security feature
called Address Space Layout Randomization (ASLR). The virtual address
of code and data is randomized at process load time. For shared
libraries, it’s not just a security feature, it’s essential to their
basic operation. Libraries cannot possibly coordinate their preferred
load addresses with every other library on the system, and so must be
relocatable.</p>

<p>If the program is compiled with GCC or Clang configured for position
independent code — <code class="language-plaintext highlighter-rouge">-fPIC</code> (for libraries) or <code class="language-plaintext highlighter-rouge">-fpie</code> + <code class="language-plaintext highlighter-rouge">-pie</code> (for
programs) — extra work has to be done to support <code class="language-plaintext highlighter-rouge">colors_ptr</code>. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.</p>

<p>However, <code class="language-plaintext highlighter-rouge">colors_2d</code> doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.</p>

<p>On x86-64, in both cases the table itself typically doesn’t need a
relocation entry because it will be <em>RIP-relative</em> (in the <a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">small code
model</a>). That is, code that uses the table will be at a fixed
offset from the table no matter where the program is loaded. It won’t
need to be looked up using the Global Offset Table (GOT).</p>

<p>In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in <code class="language-plaintext highlighter-rouge">rax</code> looks
like so:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; NASM:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">address</span><span class="p">]</span>
<span class="c1">;; Some others:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rip</span> <span class="o">+</span> <span class="nv">address</span><span class="p">]</span>
</code></pre></div></div>

<p>Or in AT&amp;T syntax:</p>

<pre><code class="language-gas">lea    address(%rip), %rax
</code></pre>

<h3 id="virtual-memory">Virtual Memory</h3>

<p>Besides (trivially) more work for the loader, there’s another
consequence to relocations: Pages containing relocations are not
shared between processes (except after fork()). When loading a
program, the loader doesn’t copy programs and libraries to memory so
much as it memory maps their binaries with copy-on-write semantics. If
another process is running with the same binaries loaded (e.g.
libc.so), they’ll share the same physical memory so long as those
pages haven’t been modified by either process. Modifying the page
creates a unique copy for that process.</p>

<p>Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means <code class="language-plaintext highlighter-rouge">colors_2d</code> has the possibility of being shared
between processes, but <code class="language-plaintext highlighter-rouge">colors_ptr</code> (and its entire page) definitely
does not. Shucks.</p>

<p>This is one of the reasons why the Procedure Linkage Table (PLT)
exists. The PLT is an array of function stubs for shared library
functions, such as those in the C standard library. Sure, the loader
<em>could</em> go through the program and fill out the address of every
library function call, but this would modify lots and lots of code
pages, creating a unique copy of large parts of the program. Instead,
the dynamic linker <a href="https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html">lazily supplies jump addresses</a> for PLT
function stubs, one per accessed library function.</p>

<p>However, as I’ve written it above, it’s unlikely that even <code class="language-plaintext highlighter-rouge">colors_2d</code>
will be shared. It’s still missing an important ingredient: const.</p>

<h3 id="const">Const</h3>

<p>They say <a href="/blog/2016/07/25/">const isn’t for optimization</a> but, darnit, this
situation keeps coming up. Since <code class="language-plaintext highlighter-rouge">colors_ptr</code> and <code class="language-plaintext highlighter-rouge">colors_2d</code> are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
<code class="language-plaintext highlighter-rouge">colors_2d</code> to being a local copy.</p>

<p>Fortunately it’s trivial to fix by adding a const:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (<code class="language-plaintext highlighter-rouge">.rodata</code>) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.</p>

<p>Note, the equivalent for <code class="language-plaintext highlighter-rouge">colors_ptr</code> requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="k">const</span> <span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>String literals are already effectively const, though the C
specification (unlike C++) doesn’t actually define them to be this
way. But, like setting your relationship status on Facebook, declaring
it makes it official.</p>

<h3 id="its-just-micro-optimization">It’s just micro-optimization</h3>

<p>These little details are all deep down the path of micro-optimization
and will rarely ever matter in practice, but perhaps you learned
something broader from all this. This stuff fascinates me.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Linux System Calls, Error Numbers, and In-Band Signaling</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/23/"/>
    <id>urn:uuid:ee8b3af5-ce09-3f9a-ef9c-0d95807bf95e</id>
    <updated>2016-09-23T01:07:40Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Today I got an e-mail asking about a previous article on <a href="/blog/2015/05/15/">creating
threads on Linux using raw system calls</a> (specifically x86-64).
The questioner was looking to use threads in a program without any
libc dependency. However, he was concerned about checking for mmap(2)
errors when allocating the thread’s stack. The <a href="http://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2) man
page</a> says it returns -1 (a.k.a. <code class="language-plaintext highlighter-rouge">MAP_FAILED</code>) on error and sets
errno. But how do you check errno without libc?</p>

<p>As a reminder here’s what the (unoptimized) assembly looks like.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>As usual, the system call return value is in <code class="language-plaintext highlighter-rouge">rax</code>, which becomes the
return value for <code class="language-plaintext highlighter-rouge">stack_create()</code>. Again, its C prototype would look
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>If you were to, say, intentionally botch the arguments to force an
error, you might notice that the system call isn’t returning -1, but
other negative values. What gives?</p>

<p>The trick is that <strong>errno is a C concept</strong>. That’s why it’s documented
as <a href="http://man7.org/linux/man-pages/man3/errno.3.html">errno(3)</a> — the 3 means it belongs to C. Just think about
how messy this thing is: it’s a thread-local value living in the
application’s address space. The kernel rightfully has nothing to do
with it. Instead, the mmap(2) wrapper in libc assigns errno (if
needed) after the system call returns. This is how <a href="http://man7.org/linux/man-pages/man2/intro.2.html"><em>all</em> system calls
through libc work</a>, even with the <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">syscall(2)
wrapper</a>.</p>

<p>So how does the kernel report the error? It’s an old-fashioned return
value. If you have any doubts, take it straight from the horse’s
mouth: <a href="http://lxr.free-electrons.com/source/mm/mmap.c?v=4.6#L1143">mm/mmap.c:do_mmap()</a>. Here’s a sample of return
statements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

<span class="cm">/* Careful about overflows.. */</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">PAGE_ALIGN</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

<span class="cm">/* offset overflow? */</span>
<span class="k">if</span> <span class="p">((</span><span class="n">pgoff</span> <span class="o">+</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">))</span> <span class="o">&lt;</span> <span class="n">pgoff</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EOVERFLOW</span><span class="p">;</span>

<span class="cm">/* Too many mappings? */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mm</span><span class="o">-&gt;</span><span class="n">map_count</span> <span class="o">&gt;</span> <span class="n">sysctl_max_map_count</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s returning the negated error number. Simple enough.</p>

<p>If you think about it a moment, you might notice a complication: This
is a form of in-band signaling. On success, mmap(2) returns a memory
address. All those negative error numbers are potentially addresses
that a caller might want to map. How can we tell the difference?</p>

<p>1) None of the possible error numbers align on a page boundary, so
   they’re not actually valid return values. NULL <em>does</em> lie on a page
   boundary, which is one reason why it’s not used as an error return
   value for mmap(2). The other is that you might actually want to map
   NULL, for better <a href="https://blogs.oracle.com/ksplice/entry/much_ado_about_null_exploiting1">or worse</a>.</p>

<p>2) Those low negative values lie in a region of virtual memory
   reserved exclusively for the kernel (sometimes called “<a href="https://linux-mm.org/HighMemory">low
   memory</a>”). On x86-64, any address with the most significant
   bit set (i.e. the sign bit of a signed integer) is one of these
   addresses. Processes aren’t allowed to map these addresses, and so
   mmap(2) will never return such a value on success.</p>

<p>So what’s a clean, safe way to go about checking for error values?
It’s a lot easier to read <a href="https://www.musl-libc.org/">musl</a> than glibc, so let’s take a
peek at how musl does it in its own mmap: <a href="https://git.musl-libc.org/cgit/musl/tree/src/mman/mmap.c?h=v1.1.15">src/mman/mmap.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">&amp;</span> <span class="n">OFF_MASK</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">EINVAL</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;=</span> <span class="n">PTRDIFF_MAX</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_FIXED</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">__vm_wait</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">syscall</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="n">start</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
</code></pre></div></div>

<p>Hmm, it looks like its returning the result directly. What happened
to setting errno? Well, syscall() is actually a macro that runs the
result through __syscall_ret().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))
</span></code></pre></div></div>

<p>Looking a little deeper: <a href="https://git.musl-libc.org/cgit/musl/tree/src/internal/syscall_ret.c?h=v1.1.15">src/internal/syscall_ret.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">__syscall_ret</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">errno</span> <span class="o">=</span> <span class="o">-</span><span class="n">r</span><span class="p">;</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Bingo. As documented, if the value falls within that “high” (unsigned)
range of negative values for <em>any</em> system call, it’s an error number.</p>

<p>Getting back to the original question, we could employ this same check
in the assembly code. However, since this is a anonymous memory map
with a kernel-selected address, <strong>there’s only one possible error:
ENOMEM</strong> (12). This error happens if the maximum number of memory maps
has been reached, or if there’s no contiguous region available for the
4MB stack. The check will only need to test the result against -12.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Inspecting C's qsort Through Animation</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/05/"/>
    <id>urn:uuid:7d86c669-ff40-3210-7e28-78b801e35e50</id>
    <updated>2016-09-05T21:17:11Z</updated>
    <category term="c"/><category term="linux"/><category term="media"/><category term="video"/>
    <content type="html">
      <![CDATA[<p>The C standard library includes a qsort() function for sorting
arbitrary buffers given a comparator function. The name comes from its
<a href="https://gallium.inria.fr/~maranget/X/421/09/bentley93engineering.pdf">original Unix implementation, “quicker sort,”</a> a variation of
the well-known quicksort algorithm. The C standard doesn’t specify an
algorithm, except to say that it may be unstable (C99 §7.20.5.2¶4) —
equal elements have an unspecified order. As such, different C
libraries use different algorithms, and even when using the same
algorithm they make different implementation trade-offs.</p>

<p>I added a drawing routine to a comparison function to see what the
sort function was doing for different C libraries. Every time it’s
called for a comparison, it writes out a snapshot of the array as a
Netpbm PPM image. It’s <a href="/blog/2011/11/28/">easy to turn concatenated PPMs into a GIF or
video</a>. Here’s my code if you want to try it yourself:</p>

<ul>
  <li><a href="/download/qsort-animate.c">qsort-animate.c</a></li>
</ul>

<p>Adjust the parameters at the top to taste. Rather than call rand() in
the standard library, I included xorshift64star() with a hard-coded
seed so that the array will be shuffled exactly the same across all
platforms. This makes for a better comparison.</p>

<p>To get an optimized GIF on unix-like systems, run it like so.
(Microsoft’s <a href="https://web.archive.org/web/20161126142829/http://radiance-online.org:82/pipermail/radiance-dev/2016-March/001578.html">UCRT currently has serious bugs</a> with pipes, so it
was run differently in that case.)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./a.out | convert -delay 10 ppm:- gif:- | gifsicle -O3 &gt; sort.gif
</code></pre></div></div>

<p>The number of animation frames reflects the efficiency of the sort,
but this isn’t really a benchmark. The input array is fully shuffled,
and real data often not. For a benchmark, have a look at <a href="http://calmerthanyouare.org/2013/05/31/qsort-shootout.html">a libc
qsort() shootout of sorts</a> instead.</p>

<p>To help you follow along, <strong>clicking on any animation will restart it.</strong></p>

<h3 id="glibc">glibc</h3>

<p><img src="/img/qsort/glibc.gif" alt="" class="resetable" title="glibc" /></p>

<p>Sorted in <strong>307 frames</strong>. glibc prefers to use mergesort, which,
unlike quicksort, isn’t an in-place algorithm, so it has to allocate
memory. That allocation could fail for huge arrays, and, since qsort()
can’t fail, it uses quicksort as a backup. You can really see the
mergesort in action: changes are made that we cannot see until later,
when it’s copied back into the original array.</p>

<h3 id="dietlibc-032">dietlibc (0.32)</h3>

<p>Sorted in <strong>503 frames</strong>. <a href="https://www.fefe.de/dietlibc/">dietlibc</a> is an alternative C
standard library for Linux. It’s optimized for size, which shows
through its slower performance. It looks like a quicksort that always
chooses the last element as the pivot.</p>

<p><img src="/img/qsort/diet.gif" alt="" class="resetable" title="diet" /></p>

<p>Update: Felix von Leitner, the primary author of dietlibc, has alerted
me that, as of version 0.33, it now chooses a random pivot. This
comment from the source describes it:</p>

<blockquote>
  <p>We chose the rightmost element in the array to be sorted as pivot,
which is OK if the data is random, but which is horrible if the data
is already sorted. Try to improve by exchanging it with a random
other pivot.</p>
</blockquote>

<h3 id="musl">musl</h3>

<p>Sort in <strong>637 frames</strong>. <a href="https://www.musl-libc.org/">musl libc</a> is another alternative C
standard library for Linux. It’s my personal preference when I
statically link Linux binaries. Its qsort() looks a lot like a heapsort,
and with some research I see it’s actually <a href="http://www.keithschwarz.com/smoothsort/">smoothsort</a>, a
heapsort variant.</p>

<p><img src="/img/qsort/musl.gif" alt="" class="resetable" title="musl" /></p>

<h3 id="bsd">BSD</h3>

<p>Sorted in <strong>354 frames</strong>. I ran it on both OpenBSD and FreeBSD with
identical results, so, unsurprisingly, they share an implementation.
It’s quicksort, and what’s neat about it is at the beginning you can
see it searching for a median for use as the pivot. This helps avoid
the O(n^2) worst case.</p>

<p><img src="/img/qsort/bsd-qsort.gif" alt="" class="resetable" title="BSD qsort" /></p>

<p>BSD also includes a mergesort() with the same prototype, except with
an <code class="language-plaintext highlighter-rouge">int</code> return for reporting failures. This one sorted in <strong>247
frames</strong>. Like glibc before, there’s some behind-the-scenes that isn’t
captured. But even more, notice how the markers disappear during the
merge? It’s running the comparator against copies, stored outside the
original array. Sneaky!</p>

<p><img src="/img/qsort/bsd-mergesort.gif" alt="" class="resetable" title="BSD mergesort" /></p>

<p>Again, BSD also includes heapsort(), so ran that too. It sorted in
<strong>418 frames</strong>. It definitely looks like a heapsort, and the worse
performance is similar to musl. It seems heapsort is a poor fit for
this data.</p>

<p><img src="/img/qsort/bsd-heapsort.gif" alt="" class="resetable" title="BSD heapsort" /></p>

<h3 id="cygwin">Cygwin</h3>

<p>It turns out Cygwin borrowed its qsort() from BSD. It’s pixel
identical to the above. I hadn’t noticed until I looked at the frame
counts.</p>

<p><img src="/img/qsort/cygwin.gif" alt="" class="resetable" title="Cygwin (BSD)" /></p>

<h3 id="msvcrtdll-mingw-and-ucrt-visual-studio">MSVCRT.DLL (MinGW) and UCRT (Visual Studio)</h3>

<p>MinGW builds against MSVCRT.DLL, found on every Windows system despite
its <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">unofficial status</a>. Until recently Microsoft didn’t
include a C standard library as part of the OS, but that changed with
their <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/vcblog/2015/03/03/introducing-the-universal-crt/">Universal CRT (UCRT) announcement</a>. I thought I’d try
them both.</p>

<p>Turns out they borrowed their old qsort() for the UCRT, and the result
is the same: sorted in <strong>417 frames</strong>. It chooses a pivot from the
median of the ends and the middle, swaps the pivot to the middle, then
partitions. Looking to the middle for the pivot makes sorting
pre-sorted arrays much more efficient.</p>

<p><img src="/img/qsort/ucrt.gif" alt="" class="resetable" title="Microsoft UCRT" /></p>

<h3 id="pelles-c">Pelles C</h3>

<p>Finally I ran it against <a href="http://www.smorgasbordet.com/pellesc/">Pelles C</a>, a C compiler for
Windows. It sorted in <strong>463 frames</strong>. I can’t find any information
about it, but it looks like some sort of hybrid between quicksort and
insertion sort. Like BSD qsort(), it finds a good median for the
pivot, partitions the elements, and if a partition is small enough, it
switches to insertion sort. This should behave well on mostly-sorted
arrays, but poorly on well-shuffled arrays (like this one).</p>

<p><img src="/img/qsort/pellesc.gif" alt="" class="resetable" title="Pelles C" /></p>

<h3 id="more-implementations">More Implementations</h3>

<p>That’s everything that was readily accessible to me. If you can run it
against something new, I’m certainly interested in seeing more
implementations.</p>

<script type="text/javascript">
(function() {
    var r = document.querySelectorAll('.resetable');
    for (var i = 0; i < r.length; i++) {
        r[i].onclick = function() {
            var src = this.src;
            var height = this.height;
            this.src = "";
            this.height = height;
            // setTimeout() required for IE
            var _this = this;
            setTimeout(function() { _this.src = src; }, 0);
        };
    }
}());
</script>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to Read and Write Other Process Memory</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/03/"/>
    <id>urn:uuid:205f20eb-a47e-3506-fd8f-4b416fc08133</id>
    <updated>2016-09-03T21:53:26Z</updated>
    <category term="win32"/><category term="linux"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>I recently put together a little game memory cheat tool called
<a href="https://github.com/skeeto/memdig">MemDig</a>. It can find the address of a particular game value
(score, lives, gold, etc.) after being given that value at different
points in time. With the address, it can then modify that value to
whatever is desired.</p>

<p>I’ve been using tools like this going back 20 years, but I never tried
to write one myself until now. There are many memory cheat tools to
pick from these days, the most prominent being <a href="http://www.cheatengine.org/">Cheat Engine</a>.
These tools use the platform’s debugging API, so of course any good
debugger could do the same thing, though a debugger won’t be
specialized appropriately (e.g. locating the particular address and
locking its value).</p>

<p>My motivation was bypassing an in-app purchase in a single player
Windows game. I wanted to convince the game I had made the purchase
when, in fact, I hadn’t. Once I had it working successfully, I ported
MemDig to Linux since I thought it would be interesting to compare.
I’ll start with Windows for this article.</p>

<h3 id="windows">Windows</h3>

<p>Only three Win32 functions are needed, and you could almost guess at
how it works.</p>

<ul>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms684320">OpenProcess()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms680553">ReadProcessMemory()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms681674">WriteProcessMemory()</a></li>
</ul>

<p>It’s very straightforward <s>and, for this purpose, is probably the
simplest API for any platform</s> (see update).</p>

<p>As you probably guessed, you first need to open the process, given its
process ID (integer). You’ll need to select the <em>desired access</em> bit a
bit set. To read memory, you need the <code class="language-plaintext highlighter-rouge">PROCESS_VM_READ</code> and
<code class="language-plaintext highlighter-rouge">PROCESS_QUERY_INFORMATION</code> rights. To write memory, you need the
<code class="language-plaintext highlighter-rouge">PROCESS_VM_WRITE</code> and <code class="language-plaintext highlighter-rouge">PROCESS_VM_OPERATION</code> rights. Alternatively
you could just ask for all rights with <code class="language-plaintext highlighter-rouge">PROCESS_ALL_ACCESS</code>, but I
prefer to be precise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">PROCESS_VM_READ</span> <span class="o">|</span>
               <span class="n">PROCESS_QUERY_INFORMATION</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_WRITE</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_OPERATION</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">access</span><span class="p">,</span> <span class="n">FALSE</span><span class="p">,</span> <span class="n">pid</span><span class="p">);</span>
</code></pre></div></div>

<p>And then to read or write:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span> <span class="c1">// target process address</span>
<span class="n">SIZE_T</span> <span class="n">written</span><span class="p">;</span>
<span class="n">ReadProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">WriteProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
</code></pre></div></div>

<p>Don’t forget to check the return value and verify <code class="language-plaintext highlighter-rouge">written</code>. Finally,
don’t forget to <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211">close it</a> when you’re done.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CloseHandle</span><span class="p">(</span><span class="n">proc</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s all there is to it. For the full cheat tool you’d need to find
the mapped regions of memory, via <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa366907">VirtualQueryEx</a>. It’s not
as simple, but I’ll leave that for another article.</p>

<h3 id="linux">Linux</h3>

<p>Unfortunately there’s no standard, cross-platform debugging API for
unix-like systems. Most have a ptrace() system call, though each works
a little differently. Note that ptrace() is not part of POSIX, but
appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere.
The following will all be specific to Linux, though the procedure is
similar on other unix-likes.</p>

<p>In typical Linux fashion, if it involves other processes, you use the
standard file API on the /proc filesystem. Each process has a
directory under /proc named as its process ID. In this directory is a
virtual file called “mem”, which is a file view of that process’
entire address space, including unmapped regions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">file</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s">"/proc/%ld/mem"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">pid</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that while you can open this file, you can’t actually
read or write on that file without attaching to the process as a
debugger. You’ll just get EIO errors. To attach, use ptrace() with
<code class="language-plaintext highlighter-rouge">PTRACE_ATTACH</code>. This asynchronously delivers a <code class="language-plaintext highlighter-rouge">SIGSTOP</code> signal to
the target, which has to be waited on with waitpid().</p>

<p>You could select the target address with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/lseek.html">lseek()</a>, but it’s
cleaner and more efficient just to do it all in one system call with
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html">pread()</a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">pwrite()</a>. I’ve left out the error
checking, but the return value of each function should be checked:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_ATTACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="kt">off_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">...;</span> <span class="c1">// target process address</span>
<span class="n">pread</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">pwrite</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>

<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_DETACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>The process will (and must) be stopped during this procedure, so do
your reads/writes quickly and get out. The kernel will deliver the
writes to the other process’ virtual memory.</p>

<p>Like before, don’t forget to close.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>To find the mapped regions in the real cheat tool, you would read and
parse the virtual text file /proc/<em>pid</em>/maps. I don’t know if I’d call
this stringly-typed method elegant — the kernel converts the data into
string form and the caller immediately converts it right back — but
that’s the official API.</p>

<p>Update: Konstantin Khlebnikov has pointed out the
<a href="http://man7.org/linux/man-pages/man2/process_vm_readv.2.html">process_vm_readv()</a> and <a href="http://man7.org/linux/man-pages/man2/process_vm_writev.2.html">process_vm_writev()</a>
system calls, available since Linux 3.2 (January 2012) and glibc 2.15
(March 2012). These system calls do not require ptrace(), nor does the
remote process need to be stopped. They’re equivalent to
ReadProcessMemory() and WriteProcessMemory(), except there’s no
requirement to first “open” the process.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Automatic Deletion of Incomplete Output Files</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/07/"/>
    <id>urn:uuid:431fafe9-6630-363e-4596-85eb3a289ec2</id>
    <updated>2016-08-07T02:00:37Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Conventionally, a program that creates an output file will delete its
incomplete output should an error occur while writing the file. It’s
risky to leave behind a file that the user may rightfully confuse for
a valid file. They might not have noticed the error.</p>

<p>For example, compression programs such as gzip, bzip2, and xz when
given a compressed file as an argument will create a new file with the
compression extension removed. They write to this file as the
compressed input is being processed. If the compressed stream contains
an error in the middle, the partially-completed output is removed.</p>

<p>There are exceptions of course, such as programs that download files
over a network. The partial result has value, especially if the
transfer can be <a href="https://tools.ietf.org/html/rfc7233">continued from where it left off</a>. The
convention is to append another extension, such as “.part”, to
indicate a partial output.</p>

<p>The straightforward solution is to always delete the file as part of
error handling. A non-interactive program would report the error on
standard error, delete the file, and exit with an error code. However,
there are at least two situations where error handling would be unable
to operate: unhandled signals (usually including a segmentation fault)
and power failures. A partial or corrupted output file will be left
behind, possibly looking like a valid file.</p>

<p>A common, more complex approach is to name the file differently from
its final name while being written. If written successfully, the
completed file is renamed into place. This is already <a href="http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/">required for
durable replacement</a>, so it’s basically free for many
applications. In the worst case, where the program is unable to clean
up, the obviously incomplete file is left behind only wasting space.</p>

<p>Looking to be more robust, I had the following misguided idea: <strong>Rely
completely on the operating system to perform cleanup in the case of a
failure.</strong> Initially the file would be configured to be automatically
deleted when the final handle is closed. This takes care of all
abnormal exits, and possibly even power failures. The program can just
exit on error without deleting the file. Once written successfully,
the automatic-delete indicator is cleared so that the file survives.</p>

<p>The target application for this technique supports both Linux and
Windows, so I would need to figure it out for both systems. On
Windows, there’s the flag <code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. I’d just need
to find a way to clear it. On POSIX, file would be unlinked while
being written, and linked into the filesystem on success. The latter
turns out to be a lot harder than I expected.</p>

<h3 id="solution-for-windows">Solution for Windows</h3>

<p>I’ll start with Windows since the technique actually works fairly well
here — ignoring the usual, dumb Win32 filesystem caveats. This is a
little surprising, since it’s usually Win32 that makes these things
far more difficult than they should be.</p>

<p>The primary Win32 function for opening and creating files is
<a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx">CreateFile</a>. There are many options, but the key is
<code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. Here’s how an application might typically
open a file for output.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">GENERIC_WRITE</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">create</span> <span class="o">=</span> <span class="n">CREATE_ALWAYS</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">FILE_FLAG_DELETE_ON_CLOSE</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">f</span> <span class="o">=</span> <span class="n">CreateFile</span><span class="p">(</span><span class="s">"out.tmp"</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">create</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>This special flag asks Windows to delete the file as soon as the last
handle to to <em>file object</em> is closed. Notice I said file object, not
file, since <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20160108-00/?p=92821">these are different things</a>. The catch: This flag
is a property of the file object, not the file, and cannot be removed.</p>

<p>However, the solution is simple. Create a new link to the file so that
it survives deletion. This even works for files residing on a network
shares.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CreateHardLink</span><span class="p">(</span><span class="s">"out"</span><span class="p">,</span> <span class="s">"out.tmp"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>  <span class="c1">// deletes out.tmp file</span>
</code></pre></div></div>

<p>The gotcha is that the underlying filesystem must be NTFS. FAT32
doesn’t support hard links. Unfortunately, since FAT32 remains the
least common denominator and is still widely used for removable media,
depending on the application, your users may expect support for saving
files to FAT32. A workaround is probably required.</p>

<h3 id="solution-for-linux">Solution for Linux</h3>

<p>This is where things really fall apart. It’s just <em>barely</em> possible on
Linux, it’s messy, and it’s not portable anywhere else. There’s no way
to do this for POSIX in general.</p>

<p>My initial thought was to create a file then unlink it. Unlike the
situation on Windows, files can be unlinked while they’re currently
open by a process. These files are finally deleted when the last file
descriptor (the last reference) is closed. Unfortunately, using
unlink(2) to remove the last link to a file prevents that file from
being linked again.</p>

<p>Instead, the solution is to use the relatively new (since Linux 3.11),
Linux-specific <code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag when creating the file. Instead of a
filename, this variation of open(2) takes a directory and creates an
unnamed, temporary file in it. These files are special in that they’re
permitted to be given a name in the filesystem at some future point.</p>

<p>For this example, I’ll assume the output is relative to the current
working directory. If it’s not, you’ll need to open an additional file
descriptor for the parent directory, and also use openat(2) to avoid
possible race conditions (since paths can change from under you). The
number of ways this can fail is already rapidly multiplying.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"."</span><span class="p">,</span> <span class="n">O_TMPFILE</span><span class="o">|</span><span class="n">O_WRONLY</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that only a handful of filesystems support <code class="language-plaintext highlighter-rouge">O_TMPFILE</code>.
It’s like the FAT32 problem above, but worse. You could easily end up
in a situation where it’s not supported, and will almost certainly
require a workaround.</p>

<p>Linking a file from a file descriptor is where things get messier. The
file descriptor must be linked with linkat(2) from its name on the
/proc virtual filesystem, constructed as a string. The following
snippet comes straight from the Linux open(2) manpage.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/proc/self/fd/%d"</span><span class="p">,</span> <span class="n">fd</span><span class="p">);</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">AT_FDCWD</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_SYMLINK_FOLLOW</span><span class="p">);</span>
</code></pre></div></div>

<p>Even on Linux, /proc isn’t always available, such as within a chroot
or a container, so this part can fail as well. In theory there’s a way
to do this with the Linux-specific <code class="language-plaintext highlighter-rouge">AT_EMPTY_PATH</code> and avoid /proc,
but I couldn’t get it to work.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: this doesn't actually work for me.</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_EMPTY_PATH</span><span class="p">);</span>
</code></pre></div></div>

<p>Given the poor portability (even within Linux), the number of ways
this can go wrong, and that a workaround is definitely needed anyway,
I’d say this technique is worthless. I’m going to stick with the
tried-and-true approach for this one.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Appending to a File from Multiple Processes</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/03/"/>
    <id>urn:uuid:93473b6d-3be3-3d0c-d7d5-6ad485c1e9a0</id>
    <updated>2016-08-03T16:17:44Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you have multiple processes appending output to the same file
without explicit synchronization. These processes might be working in
parallel on different parts of the same problem, or these might be
threads blocked individually reading different external inputs. There
are two concerns that come into play:</p>

<p>1) <strong>The append must be atomic</strong> such that it doesn’t clobber previous
    appends by other threads and processes. For example, suppose a
    write requires two separate operations: first moving the file
    pointer to the end of the file, then performing the write. There
    would be a race condition should another process or thread
    intervene in between with its own write.</p>

<p>2) <strong>The output will be interleaved.</strong> The primary solution is to
   design the data format as atomic records, where the ordering of
   records is unimportant — like rows in a relational database. This
   could be as simple as a text file with each line as a record. The
   concern is then ensuring records are written atomically.</p>

<p>This article discusses processes, but the same applies to threads when
directly dealing with file descriptors.</p>

<h3 id="appending">Appending</h3>

<p>The first concern is solved by the operating system, with one caveat.
On POSIX systems, opening a file with the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag will
guarantee that <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">writes always safely append</a>.</p>

<blockquote>
  <p>If the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and
no intervening file modification operation shall occur between
changing the file offset and the write operation.</p>
</blockquote>

<p>However, this says nothing about interleaving. <strong>Two processes
successfully appending to the same file will result in all their bytes
in the file in order, but not necessarily contiguously.</strong></p>

<p>The caveat is that not all filesystems are POSIX-compatible. Two
famous examples are NFS and the Hadoop Distributed File System (HDFS).
On these networked filesystems, appends are simulated and subject to
race conditions.</p>

<p>On POSIX systems, fopen(3) with the <code class="language-plaintext highlighter-rouge">a</code> flag <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html">will use
<code class="language-plaintext highlighter-rouge">O_APPEND</code></a>, so you don’t necessarily need to use open(2). On
Linux this can be verified for any language’s standard library with
strace.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/null"</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the result of the trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
</code></pre></div></div>

<p>For Win32, the equivalent is the <code class="language-plaintext highlighter-rouge">FILE_APPEND_DATA</code> access right, and
similarly <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/gg258116(v=vs.85).aspx">only applies to “local files.”</a></p>

<h3 id="interleaving-and-pipes">Interleaving and Pipes</h3>

<p>The interleaving problem has two layers, and gets more complicated the
more correct you want to be. Let’s start with pipes.</p>

<p>On POSIX, a pipe is unseekable and doesn’t have a file position, so
appends are the only kind of write possible. When writing to a pipe
(or FIFO), writes less than the system-defined <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> are
guaranteed to be atomic and non-interleaving.</p>

<blockquote>
  <p>Write requests of <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes or less shall not be interleaved
with data from other processes doing writes on the same pipe. Writes
of greater than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes may have data interleaved, on
arbitrary boundaries, with writes by other processes, […]</p>
</blockquote>

<p>The minimum value for <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> for POSIX systems is 512 bytes. On
Linux it’s 4kB, and on other systems <a href="http://ar.to/notes/posix">it’s as high as 32kB</a>.
As long as each record is less than 512 bytes, a simple write(2) will
due. None of this depends on a filesystem since no files are involved.</p>

<p>If more than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes isn’t enough, the POSIX writev(2) can be
used to <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/writev.html">atomically write up to <code class="language-plaintext highlighter-rouge">IOV_MAX</code> buffers</a> of
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes. The minimum value for <code class="language-plaintext highlighter-rouge">IOV_MAX</code> is 16, but is
typically 1024. This means the maximum safe atomic write size for
pipes — and therefore the largest record size — for a perfectly
portable program is 8kB (16✕512). On Linux it’s 4MB.</p>

<p>That’s all at the system call level. There’s another layer to contend
with: buffered I/O in your language’s standard library. Your program
may pass data in appropriately-sized pieces for atomic writes to the
I/O library, but it may be undoing your hard work, concatenating all
these writes into a buffer, splitting apart your records. For this
part of the article, I’ll focus on single-threaded C programs.</p>

<p>Suppose you’re writing a simple space-separated format with one line
per record.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">baz</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">condition</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d %d %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">,</span> <span class="n">baz</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Whether or not this works depends on how <code class="language-plaintext highlighter-rouge">stdout</code> is buffered. C
standard library streams (<code class="language-plaintext highlighter-rouge">FILE *</code>) have three buffering modes:
unbuffered, line buffered, and fully buffered. Buffering is configured
through setbuf(3) and setvbuf(3), and the initial buffering state of a
stream depends on various factors. For buffered streams, the default
buffer is at least <code class="language-plaintext highlighter-rouge">BUFSIZ</code> bytes, itself at least 256 (C99
§7.19.2¶7). Note: threads share this buffer.</p>

<p>Since each record in the above program easily fits inside 256 bytes,
if stdout is a line buffered pipe then this program will interleave
correctly on any POSIX system without further changes.</p>

<p>If instead your output is comma-separated values (CSV) and <a href="https://tools.ietf.org/html/rfc4180">your
records may contain new line characters</a>, there are two
approaches. In each, the record must still be no larger than
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes.</p>

<ul>
  <li>
    <p>Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3))
and output the entire buffer in a single fwrite(3). While I believe
this will always work in practice, it’s not guaranteed by the C
specification, which defines fwrite(3) as a series of fputc(3) calls
(C99 §7.19.8.2¶2).</p>
  </li>
  <li>
    <p>Fully buffered pipe: set a sufficiently large stream buffer and
follow each record with a fflush(3). Unlike fwrite(3) on an
unbuffered stream, the specification says the buffer will be
“transmitted to the host environment as a block” (C99 §7.19.3¶3),
so this should be perfectly correct on any POSIX system.</p>
  </li>
</ul>

<p>If your situation is more complicated than this, you’ll probably have
to bypass your standard library buffered I/O and call write(2) or
writev(2) yourself.</p>

<h4 id="practical-application">Practical Application</h4>

<p>If interleaving writes to a pipe stdout sounds contrived, here’s the
real life scenario: GNU xargs with its <code class="language-plaintext highlighter-rouge">--max-procs</code> (<code class="language-plaintext highlighter-rouge">-P</code>) option to
process inputs in parallel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xargs -n1 -P$(nproc) myprogram &lt; inputs.txt | cat &gt; outputs.csv
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">| cat</code> ensures the output of each <code class="language-plaintext highlighter-rouge">myprogram</code> process is
connected to the same pipe rather than to the same file.</p>

<p>A non-portable alternative to <code class="language-plaintext highlighter-rouge">| cat</code>, especially if you’re
dispatching processes and threads yourself, is the splice(2) system
call on Linux. It efficiently moves the output from the pipe to the
output file without an intermediate copy to userspace. GNU Coreutils’
cat doesn’t use this.</p>

<h4 id="win32-pipes">Win32 Pipes</h4>

<p>On Win32, <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365152(v=vs.85).aspx">anonymous pipes</a> have no semantics regarding
interleaving. <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365150(v=vs.85).aspx">Named pipes</a> have per-client buffers that
prevent interleaving. However, the pipe buffer size is unspecified,
and requesting a particular size is only advisory, so it comes down to
trial and error, though the unstated limits should be comparatively
generous.</p>

<h3 id="interleaving-and-files">Interleaving and Files</h3>

<p>Suppose instead of a pipe we have an <code class="language-plaintext highlighter-rouge">O_APPEND</code> file on POSIX. Common
wisdom states that the same <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> atomic write rule applies.
While this often works, especially on Linux, this is not correct. The
POSIX specification doesn’t require it and <a href="http://www.notthewizard.com/2014/06/17/are-files-appends-really-atomic/">there are systems where it
doesn’t work</a>.</p>

<p>If you know the particular limits of your operating system <em>and</em>
filesystem, and you don’t care much about portability, then maybe you
can get away with interleaving appends. For full portability, pipes
are required.</p>

<p>On Win32, writes on local files up to the underlying drive’s sector
size (typically 512 bytes to 4kB) are atomic. Otherwise the only
options are deprecated Transactional NTFS (TxF), or manually
synchronizing your writes. All in all, it’s going to take more work to
get correct.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My true use case for mucking around with clean, atomic appends is to
compute giant CSV tables in parallel, with the intention of later
loading into a SQL database (i.e. SQLite) for analysis. A more robust
and traditional approach would be to write results directly into the
database as they’re computed. But I like the platform-neutral
intermediate CSV files — good for archival and sharing — and the
simplicity of programs generating the data — concerned only with
atomic write semantics rather than calling into a particular SQL
database API.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Mapping Multiple Memory Views in User Space</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/04/10/"/>
    <id>urn:uuid:373e602e-0d43-3e03-f02c-2d169eb14df5</id>
    <updated>2016-04-10T21:59:16Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Modern operating systems run processes within <em>virtual memory</em> using a
piece of hardware called a <em>memory management unit</em> (MMU). The MMU
contains a <em>page table</em> that defines how virtual memory maps onto
<em>physical memory</em>. The operating system is responsible for maintaining
this page table, mapping and unmapping virtual memory to physical
memory as needed by the processes it’s running. If a process accesses
a page that is not currently mapped, it will trigger a <em>page fault</em>
and the execution of the offending thread will be paused until the
operating system maps that page.</p>

<p>This functionality allows for a neat hack: A physical memory address
can be mapped to multiple virtual memory addresses at the same time. A
process running with such a mapping will see these regions of memory
as aliased — views of the same physical memory. A store to one of
these addresses will simultaneously appear across all of them.</p>

<p>Some useful applications of this feature include:</p>

<ul>
  <li>An extremely fast, large memory “copy” by mapping the source memory
overtop the destination memory.</li>
  <li>Trivial interoperability between code instrumented with <a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy
bounds checking</a> [PDF] and non-instrumented code. A few bits
of each pointer are reserved to tag the pointer with the size of its
memory allocation. For compactness, the stored size is rounded up to
a power of two, making it “baggy.” Instrumented code checks this tag
before making a possibly-unsafe dereference. Normally, instrumented
code would need to clear (or set) these bits before dereferencing or
before passing it to non-instrumented code. Instead, the allocation
could be mapped simultaneously at each location for every possible
tag, making the pointer valid no matter its tag bits.</li>
  <li>Two responses to <a href="/blog/2016/03/31/">my last post on hotpatching</a> suggested
that, instead of modifying the instruction directly, memory
containing the modification could be mapped over top of the code. I
would copy the code to another place in memory, safely modify it in
private, switch the page protections from write to execute (both for
W^X and for <a href="https://web.archive.org/web/20190323050330/http://stackoverflow.com/a/18905927">other hardware limitations</a>), then map it over
the target. Restoring the original behavior would be as simple as
unmapping the change.</li>
</ul>

<p>Both POSIX and Win32 allow user space applications to create these
aliased mappings. The original purpose for these APIs is for shared
memory between processes, where the same physical memory is mapped
into two different processes’ virtual memory. But the OS doesn’t stop
us from mapping the shared memory to a different address within the
same process.</p>

<h3 id="posix-memory-mapping">POSIX Memory Mapping</h3>

<p>On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are <code class="language-plaintext highlighter-rouge">shm_open(3)</code>, <code class="language-plaintext highlighter-rouge">ftruncate(2)</code>, and <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<p>First, create a file descriptor to shared memory using <code class="language-plaintext highlighter-rouge">shm_open</code>. It
has very similar semantics to <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">shm_open</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">oflag</span><span class="p">,</span> <span class="n">mode_t</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">name</code> works much like a filesystem path, but is actually a
different namespace (though on Linux it <em>is</em> a tmpfs mounted at
<code class="language-plaintext highlighter-rouge">/dev/shm</code>). Resources created here (<code class="language-plaintext highlighter-rouge">O_CREAT</code>) will persist until
explicitly deleted (<code class="language-plaintext highlighter-rouge">shm_unlink(3)</code>) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via <code class="language-plaintext highlighter-rouge">fork(2)</code> or through UNIX domain sockets, so a name
isn’t strictly required.</p>

<p>OpenBSD introduced <a href="http://man.openbsd.org/OpenBSD-current/man3/shm_mkstemp.3"><code class="language-plaintext highlighter-rouge">shm_mkstemp(3)</code></a> to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
<code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag may or may not provide a fix (<a href="http://comments.gmane.org/gmane.linux.man/9815">it’s
undocumented</a>).</p>

<p>The portable workaround is to attempt to choose a unique name, open
the file with <code class="language-plaintext highlighter-rouge">O_CREAT | O_EXCL</code> (either atomically create the file or
fail), <code class="language-plaintext highlighter-rouge">shm_unlink</code> the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="s">"/example"</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">handle_error</span><span class="p">();</span> <span class="c1">// non-local exit</span>
<span class="n">shm_unlink</span><span class="p">(</span><span class="s">"/example"</span><span class="p">);</span>
</code></pre></div></div>

<p>The shared memory object is brand new (<code class="language-plaintext highlighter-rouge">O_EXCL</code>) and is therefore of
zero size. <code class="language-plaintext highlighter-rouge">ftruncate</code> sets it to the desired size. This does <em>not</em>
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally <code class="language-plaintext highlighter-rouge">mmap</code> the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (<code class="language-plaintext highlighter-rouge">munmap(2)</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>At this point both <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mh">0xdeafbeef</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%p %p 0x%x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>Will print out something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x6ffffff0000 0x6fffffe0000 0xdeafbeef
</code></pre></div></div>

<p>It’s also possible to do all this only with <code class="language-plaintext highlighter-rouge">open(2)</code> and <code class="language-plaintext highlighter-rouge">mmap(2)</code> by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.</p>

<h3 id="windows-memory-mapping">Windows Memory Mapping</h3>

<p>Windows is very similar, but directly supports anonymous shared
memory. The key functions are <code class="language-plaintext highlighter-rouge">CreateFileMapping</code>, and
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code>.</p>

<p>First create a file mapping object from an invalid handle value. Like
POSIX, the word “file” is used without actually involving files.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">,</span>
                             <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                             <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s no truncate step because the space is allocated at creation
time via the two-part size argument.</p>

<p>Then, just like <code class="language-plaintext highlighter-rouge">mmap</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>If I wanted to choose the target address myself, I’d call
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code> instead, which takes the address as additional
argument.</p>

<p>From here on it’s the same as above.</p>

<h3 id="generalizing-the-api">Generalizing the API</h3>

<p>Having some fun with this, I came up with a general API to allocate an
aliased mapping at an arbitrary number of addresses.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
</code></pre></div></div>

<p>Values in the address array must either be page-aligned or NULL to
allow the operating system to choose, in which case the map address is
written to the array.</p>

<p>It returns 0 on success. It may fail if the size is too small (0), too
large, too many file descriptors, etc.</p>

<p>Pass the same pointers back to <code class="language-plaintext highlighter-rouge">memory_alias_unmap</code> to free the
mappings. When called correctly it cannot fail, so there’s no return
value.</p>

<p>The full source is here: <a href="/download/memalias.c">memalias.c</a></p>

<h4 id="posix">POSIX</h4>

<p>Starting with the simpler of the two functions, the POSIX
implementation looks like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">munmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The complex part is creating the mapping:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="s">"/%s(%lu,%p)"</span><span class="p">,</span>
             <span class="n">__FUNCTION__</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">getpid</span><span class="p">(),</span> <span class="n">addrs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">shm_unlink</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span>
                        <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span>
                        <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared object name includes the process ID and pointer array
address, so there really shouldn’t be any non-malicious name
collisions, even if called from multiple threads in the same process.</p>

<p>Otherwise it just walks the array setting up the mappings.</p>

<h4 id="windows">Windows</h4>

<p>The Windows version is very similar.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">size</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">UnmapViewOfFile</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since Windows tracks the size internally, it’s unneeded and ignored.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">m</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">,</span>
                                 <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                                 <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">m</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">MapViewOfFileEx</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the future I’d like to find some unique applications of these
multiple memory views.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Hotpatching a C Function on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/03/31/"/>
    <id>urn:uuid:49f6ea3c-d44a-3bed-1aad-70ad47e325c6</id>
    <updated>2016-03-31T23:59:59Z</updated>
    <category term="x86"/><category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In this post I’m going to do a silly, but interesting, exercise that
should never be done in any program that actually matters. I’m going
write a program that changes one of its function definitions while
it’s actively running and using that function. Unlike <a href="/blog/2014/12/23/">last
time</a>, this won’t involve shared libraries, but it will require
x86_64 and GCC. Most of the time it will work with Clang, too, but
it’s missing an important compiler option that makes it stable.</p>

<p>If you want to see it all up front, here’s the full source:
<a href="/download/hotpatch.c">hotpatch.c</a></p>

<p>Here’s the function that I’m going to change:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s dead simple, but that’s just for demonstration purposes. This
will work with any function of arbitrary complexity. The definition
will be changed to this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"goodbye %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">x</span><span class="o">++</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I was only going change the string, but I figured I should make it a
little more interesting.</p>

<p>Here’s how it’s going to work: I’m going to overwrite the beginning of
the function with an unconditional jump that immediately moves control
to the new definition of the function. It’s vital that the function
prototype does not change, since that would be a <em>far</em> more complex
problem.</p>

<p><strong>But first there’s some preparation to be done.</strong> The target needs to
be augmented with some GCC function attributes to prepare it for its
redefinition. As is, there are three possible problems that need to be
dealt with:</p>

<ul>
  <li>I want to hotpatch this function <em>while it is being used</em> by another
thread <em>without</em> any synchronization. It may even be executing the
function at the same time I clobber its first instructions with my
jump. If it’s in between these instructions, disaster will strike.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">ms_hook_prologue</code> function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.</p>

<ul>
  <li>The prologue NOP needs to be updated atomically. I can’t let the
other thread see a half-written instruction or, again, disaster. On
x86 this means I have an alignment requirement. Since I’m
overwriting an 8-byte instruction, I’m specifically going to need
8-byte alignment to get an atomic write.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">aligned</code> function attribute, ensuring the
hotpatch prologue is properly aligned.</p>

<ul>
  <li>The final problem is that there must be exactly one copy of this
function in the compiled program. It must never be inlined or
cloned, since these won’t be hotpatched.</li>
</ul>

<p>As you might have guessed, this is primarily fixed with the <code class="language-plaintext highlighter-rouge">noinline</code>
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the <code class="language-plaintext highlighter-rouge">noclone</code> attribute.</p>

<p>Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (<code class="language-plaintext highlighter-rouge">__asm("")</code>). Since <code class="language-plaintext highlighter-rouge">puts()</code> has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.</p>

<p>What does the function look like now?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span> <span class="p">((</span><span class="n">ms_hook_prologue</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noclone</span><span class="p">))</span>
<span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span><span class="s">""</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And what does the assembly look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -Mintel -d hotpatch
0000000000400848 &lt;hello&gt;:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 &lt;puts@plt&gt;
</code></pre></div></div>

<p>It’s 8-byte aligned and it has the 8-byte NOP: that <code class="language-plaintext highlighter-rouge">lea</code> instruction
does nothing. It copies <code class="language-plaintext highlighter-rouge">rsp</code> into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.</p>

<h3 id="hotpatching">Hotpatching</h3>

<p>Next, let’s take a look at the function that will perform the
hotpatch. I’ve written a generic patching function for this purpose.
This part is entirely specific to x86.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hotpatch</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">replacement</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 8-byte aligned?</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">page</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="o">~</span><span class="mh">0xfff</span><span class="p">);</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_WRITE</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">replacement</span> <span class="o">-</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="kt">uint8_t</span> <span class="n">bytes</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
        <span class="kt">uint64_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">instruction</span> <span class="o">=</span> <span class="p">{</span> <span class="p">{</span><span class="mh">0xe9</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">}</span> <span class="p">};</span>
    <span class="o">*</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">=</span> <span class="n">instruction</span><span class="p">.</span><span class="n">value</span><span class="p">;</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It takes the address of the function to be patched and the address of
the function to replace it. As mentioned, the target <em>must</em> be 8-byte
aligned (enforced by the assert). It’s also important this function is
only called by one thread at a time, even on different targets. If
that was a concern, I’d wrap it in a mutex to create a critical
section.</p>

<p>There are a number of things going on here, so let’s go through them
one at a time:</p>

<h4 id="make-the-function-writeable">Make the function writeable</h4>

<p>The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
<code class="language-plaintext highlighter-rouge">mprotect()</code>. If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
<code class="language-plaintext highlighter-rouge">hotpatch()</code>) I’d be in trouble.</p>

<p>It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hugepages). <em>Warning</em>: I’m being a
bad programmer and not checking the result of <code class="language-plaintext highlighter-rouge">mprotect()</code>. If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard <a href="https://marc.info/?t=145942649500004">in the
future</a>. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.</p>

<p>What if the function straddles pages? Well, I’m only patching the
first 8 bytes, which, thanks to alignment, will sit entirely inside
the page I just found. It’s not an issue.</p>

<p>At the end of the function, I <code class="language-plaintext highlighter-rouge">mprotect()</code> the page back to
non-writeable.</p>

<h4 id="create-the-instruction">Create the instruction</h4>

<p>I’m assuming the replacement function is within 2GB of the original in
virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s
no 64-bit relative jump, and I only have 8 bytes to work within
anyway. Looking that up in <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">the Intel manual</a>, I see this:</p>

<p><img src="/img/misc/jmp-e9.png" alt="" /></p>

<p>Fortunately it’s a really simple instruction. It’s opcode 0xE9 and
it’s followed immediately by the 32-bit displacement. The instruction
is 5 bytes wide.</p>

<p>To compute the relative jump, I take the difference between the
functions, minus 5. Why the 5? The jump address is computed from the
position <em>after</em> the jump instruction and, as I said, it’s 5 bytes
wide.</p>

<p>I put 0xE9 in a byte array, followed by the little endian
displacement. The astute may notice that the displacement is signed
(it can go “up” or “down”) and I used an unsigned integer. That’s
because it will overflow nicely to the right value and make those
shifts clean.</p>

<p>Finally, the instruction byte array I just computed is written over
the hotpatch NOP as a single, atomic, 64-bit store.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    *(uint64_t *)target = instruction.value;
</code></pre></div></div>

<p>Other threads will see either the NOP or the jump, nothing in between.
There’s no synchronization, so other threads may continue to execute
the NOP for a brief moment even through I’ve clobbered it, but that’s
fine.</p>

<h3 id="trying-it-out">Trying it out</h3>

<p>Here’s what my test program looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">worker</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">arg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">hello</span><span class="p">();</span>
        <span class="n">usleep</span><span class="p">(</span><span class="mi">100000</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">worker</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="n">getchar</span><span class="p">();</span>
    <span class="n">hotpatch</span><span class="p">(</span><span class="n">hello</span><span class="p">,</span> <span class="n">new_hello</span><span class="p">);</span>
    <span class="n">pthread_join</span><span class="p">(</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I fire off the other thread to keep it pinging at <code class="language-plaintext highlighter-rouge">hello()</code>. In the
main thread, it waits until I hit enter to give the program input,
after which it calls <code class="language-plaintext highlighter-rouge">hotpatch()</code> and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.</p>

<h3 id="further-reading">Further Reading</h3>

<p>These related articles have been shared with me since publishing this
article:</p>

<ul>
  <li><a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583">Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?</a></li>
  <li><a href="http://jbremer.org/x86-api-hooking-demystified/">x86 API Hooking Demystified</a></li>
  <li><a href="http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-living-on-the-edge-rapid-toggling-probes-with-cross-modification-on-x86">Living on the edge: Rapid-toggling probes with cross modification on x86</a></li>
  <li><a href="https://lwn.net/Articles/620640/">arm64: alternatives runtime patching</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Small, Freestanding Windows Executables</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/01/31/"/>
    <id>urn:uuid:8eddc701-52d3-3b0c-a8a8-dd13da6ead2c</id>
    <updated>2016-01-31T22:53:03Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>Update</strong>: This is old and <a href="/blog/2023/02/15/">was <strong>updated in 2023</strong></a>!</p>

<p>Recently I’ve been experimenting with freestanding C programs on
Windows. <em>Freestanding</em> refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and <a href="/blog/2014/12/09/">similar, bare metal
situations</a>. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size <code class="language-plaintext highlighter-rouge">memmove()</code> with move instructions. Since a freestanding
program would supply its own, it may have different semantics.</p>

<p>My usual go to for C/C++ on Windows is <a href="http://mingw-w64.org/">Mingw-w64</a>, which has
greatly suited my needs the past couple of years. It’s <a href="https://packages.debian.org/search?keywords=mingw-w64">packaged on
Debian</a>, and, when combined with Wine, allows me to fully develop
Windows applications on Linux. Being GCC, it’s also great for
cross-platform development since it’s essentially the same compiler as
the other platforms. The primary difference is the interface to the
operating system (POSIX vs. Win32).</p>

<p>However, it has one glaring flaw inherited from MinGW: it links
against msvcrt.dll, an ancient version of the Microsoft C runtime
library that currently ships with Windows. Besides being dated and
quirky, <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">it’s not an official part of Windows</a> and never has
been, despite its inclusion with every release since Windows 95.
Mingw-w64 doesn’t have a C library of its own, instead patching over
some of the flaws of msvcrt.dll and linking against it.</p>

<p>Since so much depends on msvcrt.dll despite its unofficial nature,
it’s unlikely Microsoft will ever drop it from future releases of
Windows. However, if strict correctness is a concern, we must ask
Mingw-w64 not to link against it. An alternative would be
<a href="http://plibc.sourceforge.net/">PlibC</a>, though the LGPL licensing is unfortunate. Another is
Cygwin, which is a very complete POSIX environment, but is heavy and
GPL-encumbered.</p>

<p>Sometimes I’d prefer to be more direct: <a href="https://hero.handmade.network/forums/code-discussion/t/94-guide_-_how_to_avoid_c_c++_runtime_on_windows">skip the C standard library
altogether</a> and talk directly to the operating system. On Windows
that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only
links against system DLLs.</p>

<h3 id="linux-vs-windows">Linux vs. Windows</h3>

<p>The most important benefit of a standard library like libc is a
portable, uniform interface to the host system. So long as the
standard library suits its needs, the same program can run anywhere.
Without it, the programs needs an implementation of each
host-specific interface.</p>

<p>On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (<code class="language-plaintext highlighter-rouge">int 0x80</code> on x86, <code class="language-plaintext highlighter-rouge">syscall</code> on
x86-64, <code class="language-plaintext highlighter-rouge">swi</code> on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.</p>

<p>For example, here’s a function for a 1-argument system call on x86-64.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">syscall1</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">result</span><span class="p">;</span>
    <span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">exit()</code> is implemented on top. Note: A <em>real</em> libc would do
cleanup before exiting, like calling registered <code class="language-plaintext highlighter-rouge">atexit()</code> functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;syscall.h&gt;</span><span class="c1">  // defines SYS_exit</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">code</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">code</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to <code class="language-plaintext highlighter-rouge">malloc()</code>). It’s not POSIX, but it has analogs to much of
the same functionality.</p>

<h3 id="program-entry">Program Entry</h3>

<p>The standard entry for a C program is <code class="language-plaintext highlighter-rouge">main()</code>. However, this is not
the application’s <em>true</em> entry. The entry is in the C library, which
does some initialization before calling your <code class="language-plaintext highlighter-rouge">main()</code>. When <code class="language-plaintext highlighter-rouge">main()</code>
returns, it performs cleanup and exits. Without a C library, programs
don’t start at <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<p>On Linux the default entry is the symbol <code class="language-plaintext highlighter-rouge">_start</code>. It’s prototype
would look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Returning from this function leads to a segmentation fault, so it’s up
to your application to perform the exit system call rather than
return.</p>

<p>On Windows, the entry depends on the type of application. The two
relevant subsystems today are the <em>console</em> and <em>windows</em> subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give <code class="language-plaintext highlighter-rouge">-mconsole</code> (default) or <code class="language-plaintext highlighter-rouge">-mwindows</code> to the linker to
choose the subsystem.</p>

<p>The default <a href="https://msdn.microsoft.com/en-us/library/f9t8842e.aspx">entry for each is slightly different</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike Linux’s <code class="language-plaintext highlighter-rouge">_start</code>, Windows programs can safely return from these
functions, similar to <code class="language-plaintext highlighter-rouge">main()</code>, hence the <code class="language-plaintext highlighter-rouge">int</code> return. The <code class="language-plaintext highlighter-rouge">WINAPI</code>
macro means the function may have a special calling convention,
depending on the platform.</p>

<p>On any system, you can choose a different entry symbol or address
using the <code class="language-plaintext highlighter-rouge">--entry</code> option to the GNU linker.</p>

<h3 id="disabling-libgcc">Disabling libgcc</h3>

<p>One problem I’ve run into is Mingw-w64 generating code that calls
<code class="language-plaintext highlighter-rouge">__chkstk_ms()</code> from libgcc. I believe this is a long-standing bug,
since <code class="language-plaintext highlighter-rouge">-ffreestanding</code> should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable <a href="https://metricpanda.com/rival-fortress-update-45-dealing-with-__chkstk-__chkstk_ms-when-cross-compiling-for-windows/">the stack
probe</a> and pre-commit the whole stack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
</code></pre></div></div>

<p>Alternatively you could link against libgcc (statically) with <code class="language-plaintext highlighter-rouge">-lgcc</code>,
but, again, I’m going for a tiny executable.</p>

<h3 id="a-freestanding-example">A freestanding example</h3>

<p>Here’s an example of a Windows “Hello, World” that doesn’t use a C
library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">WINAPI</span>
<span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">[]){</span><span class="mi">0</span><span class="p">},</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32
</code></pre></div></div>

<p>Notice I manually linked against kernel32.dll. The stripped final
result is only 4kB, mostly PE padding. There are <a href="http://www.phreedom.org/research/tinype/">techniques to trim
this down even further</a>, but for a substantial program it
wouldn’t make a significant difference.</p>

<p>From here you could create a GUI by linking against <code class="language-plaintext highlighter-rouge">user32.dll</code> and
<code class="language-plaintext highlighter-rouge">gdi32.dll</code> (both also part of Win32) and calling the appropriate
functions. I already <a href="/blog/2015/06/06/">ported my OpenGL demo</a> to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).</p>

<p>I may go this route for <a href="http://7drl.org/2016/01/13/7drl-2016-announced-for-5-13-march/">the upcoming 7DRL 2016</a> in March.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Raw Linux Threads via System Calls</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/05/15/"/>
    <id>urn:uuid:9d5de15b-9308-3715-2bd7-565d6649ab2f</id>
    <updated>2015-05-15T17:33:40Z</updated>
    <category term="x86"/><category term="linux"/><category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article has <a href="/blog/2016/09/23/">a followup</a>.</em></p>

<p>Linux has an elegant and beautiful design when it comes to threads:
threads are nothing more than processes that share a virtual address
space and file descriptor table. Threads spawned by a process are
additional child processes of the main “thread’s” parent process.
They’re manipulated through the same process management system calls,
eliminating the need for a separate set of thread-related system
calls. It’s elegant in the same way file descriptors are elegant.</p>

<p>Normally on Unix-like systems, processes are created with fork(). The
new process gets its own address space and file descriptor table that
starts as a copy of the original. (Linux uses copy-on-write to do this
part efficiently.) However, this is too high level for creating
threads, so Linux has a separate <a href="http://man7.org/linux/man-pages/man2/clone.2.html">clone()</a> system call. It
works just like fork() except that it accepts a number of flags to
adjust its behavior, primarily to share parts of the parent’s
execution context with the child.</p>

<p>It’s <em>so</em> simple that <strong>it takes less than 15 instructions to spawn a
thread with its own stack</strong>, no libraries needed, and no need to call
Pthreads! In this article I’ll demonstrate how to do this on x86-64.
All of the code with be written in <a href="http://www.nasm.us/">NASM</a> syntax since, IMHO,
it’s by far the best (see: <a href="/blog/2015/04/19/">nasm-mode</a>).</p>

<p>I’ve put the complete demo here if you want to see it all at once:</p>

<ul>
  <li><a href="https://github.com/skeeto/pure-linux-threads-demo">Pure assembly, library-free Linux threading demo</a></li>
</ul>

<h3 id="an-x86-64-primer">An x86-64 Primer</h3>

<p>I want you to be able to follow along even if you aren’t familiar with
x86_64 assembly, so here’s a short primer of the relevant pieces. If
you already know x86-64 assembly, feel free to skip to the next
section.</p>

<p>x86-64 has 16 64-bit <em>general purpose registers</em>, primarily used to
manipulate integers, including memory addresses. There are <em>many</em> more
registers than this with more specific purposes, but we won’t need
them for threading.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rsp</code> : stack pointer</li>
  <li><code class="language-plaintext highlighter-rouge">rbp</code> : “base” pointer (still used in debugging and profiling)</li>
  <li><code class="language-plaintext highlighter-rouge">rax</code> <code class="language-plaintext highlighter-rouge">rbx</code> <code class="language-plaintext highlighter-rouge">rcx</code> <code class="language-plaintext highlighter-rouge">rdx</code> : general purpose (notice: a, b, c, d)</li>
  <li><code class="language-plaintext highlighter-rouge">rdi</code> <code class="language-plaintext highlighter-rouge">rsi</code> : “destination” and “source”, now meaningless names</li>
  <li><code class="language-plaintext highlighter-rouge">r8</code> <code class="language-plaintext highlighter-rouge">r9</code> <code class="language-plaintext highlighter-rouge">r10</code> <code class="language-plaintext highlighter-rouge">r11</code> <code class="language-plaintext highlighter-rouge">r12</code> <code class="language-plaintext highlighter-rouge">r13</code> <code class="language-plaintext highlighter-rouge">r14</code> <code class="language-plaintext highlighter-rouge">r15</code> : added for x86-64</li>
</ul>

<p><img src="/img/x86/register.png" alt="" /></p>

<p>The “r” prefix indicates that they’re 64-bit registers. It won’t be
relevant in this article, but the same name prefixed with “e”
indicates the lower 32-bits of these same registers, and no prefix
indicates the lowest 16 bits. This is because x86 was <a href="/blog/2014/12/09/">originally a
16-bit architecture</a>, extended to 32-bits, then to 64-bits.
Historically each of of these registers had a specific, unique
purpose, but on x86-64 they’re almost completely interchangeable.</p>

<p>There’s also a “rip” instruction pointer register that conceptually
walks along the machine instructions as they’re being executed, but,
unlike the other registers, it can only be manipulated indirectly.
Remember that data and code <a href="http://en.wikipedia.org/wiki/Von_Neumann_architecture">live in the same address space</a>, so
rip is not much different than any other data pointer.</p>

<h4 id="the-stack">The Stack</h4>

<p>The rsp register points to the “top” of the call stack. The stack
keeps track of who called the current function, in addition to local
variables and other function state (a <em>stack frame</em>). I put “top” in
quotes because the stack actually grows <em>downward</em> on x86 towards
lower addresses, so the stack pointer points to the lowest address on
the stack. This piece of information is critical when talking about
threads, since we’ll be allocating our own stacks.</p>

<p>The stack is also sometimes used to pass arguments to another
function. This happens much less frequently on x86-64, especially with
the <a href="http://wiki.osdev.org/System_V_ABI">System V ABI</a> used by Linux, where the first 6 arguments are
passed via registers. The return value is passed back via rax. When
calling another function function, integer/pointer arguments are
passed in these registers in this order:</p>

<ul>
  <li>rdi, rsi, rdx, rcx, r8, r9</li>
</ul>

<p>So, for example, to perform a function call like <code class="language-plaintext highlighter-rouge">foo(1, 2, 3)</code>, store
1, 2 and 3 in rdi, rsi, and rdx, then <code class="language-plaintext highlighter-rouge">call</code> the function. The <code class="language-plaintext highlighter-rouge">mov</code>
instruction stores the source (second) operand in its destination
(first) operand. The <code class="language-plaintext highlighter-rouge">call</code> instruction pushes the current value of
rip onto the stack, then sets rip (<em>jumps</em>) to the address of the
target function. When the callee is ready to return, it uses the <code class="language-plaintext highlighter-rouge">ret</code>
instruction to <em>pop</em> the original rip value off the stack and back
into rip, returning control to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="mi">2</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">3</span>
    <span class="nf">call</span> <span class="nv">foo</span>
</code></pre></div></div>

<p>Called functions <em>must</em> preserve the contents of these registers (the
same value must be stored when the function returns):</p>

<ul>
  <li>rbx, rsp, rbp, r12, r13, r14, r15</li>
</ul>

<h4 id="system-calls">System Calls</h4>

<p>When making a <em>system call</em>, the argument registers are <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">slightly
different</a>. Notice rcx has been changed to r10.</p>

<ul>
  <li>rdi, rsi, rdx, r10, r8, r9</li>
</ul>

<p>Each system call has an integer identifying it. This number is
different on each platform, but, in Linux’s case, <a href="https://www.youtube.com/watch?v=1Mg5_gxNXTo#t=8m28">it will <em>never</em>
change</a>. Instead of <code class="language-plaintext highlighter-rouge">call</code>, rax is set to the number of the
desired system call and the <code class="language-plaintext highlighter-rouge">syscall</code> instruction makes the request to
the OS kernel. Prior to x86-64, this was done with an old-fashioned
interrupt. Because interrupts are slow, a special,
statically-positioned “vsyscall” page (now deprecated as a <a href="http://en.wikipedia.org/wiki/Return-oriented_programming">security
hazard</a>), later <a href="https://lwn.net/Articles/446528/">vDSO</a>, is provided to allow certain system
calls to be made as function calls. We’ll only need the <code class="language-plaintext highlighter-rouge">syscall</code>
instruction in this article.</p>

<p>So, for example, the write() system call has this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86-64, the write() system call is at the top of <a href="https://filippo.io/linux-syscall-table/">the system call
table</a> as call 1 (read() is 0). Standard output is file
descriptor 1 by default (standard input is 0). The following bit of
code will write 10 bytes of data from the memory address <code class="language-plaintext highlighter-rouge">buffer</code> (a
symbol defined elsewhere in the assembly program) to standard output.
The number of bytes written, or -1 for error, will be returned in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; fd</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">buffer</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">10</span>       <span class="c1">; 10 bytes</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; SYS_write</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<h4 id="effective-addresses">Effective Addresses</h4>

<p>There’s one last thing you need to know: registers often hold a memory
address (i.e. a pointer), and you need a way to read the data behind
that address. In NASM syntax, wrap the register in brackets (e.g.
<code class="language-plaintext highlighter-rouge">[rax]</code>), which, if you’re familiar with C, would be the same as
<em>dereferencing</em> the pointer.</p>

<p>These bracket expressions, called an <em>effective address</em>, may be
limited mathematical expressions to offset that <em>base</em> address
entirely within a single instruction. This expression can include
another register (<em>index</em>), a power-of-two <em>scalar</em> (bit shift), and
an immediate signed <em>offset</em>. For example, <code class="language-plaintext highlighter-rouge">[rax + rdx*8 + 12]</code>. If
rax is a pointer to a struct, and rdx is an array index to an element
in array on that struct, only a single instruction is needed to read
that element. NASM is smart enough to allow the assembly programmer to
break this mold a little bit with more complex expressions, so long as
it can reduce it to the <code class="language-plaintext highlighter-rouge">[base + index*2^exp + offset]</code> form.</p>

<p>The details of addressing aren’t important this for this article, so
don’t worry too much about it if that didn’t make sense.</p>

<h3 id="allocating-a-stack">Allocating a Stack</h3>

<p>Threads share everything except for registers, a stack, and
thread-local storage (TLS). The OS and underlying hardware will
automatically ensure that registers are per-thread. Since it’s not
essential, I won’t cover thread-local storage in this article. In
practice, the stack is often used for thread-local data anyway. The
leaves the stack, and before we can span a new thread, we need to
allocate a stack, which is nothing more than a memory buffer.</p>

<p>The trivial way to do this would be to reserve some fixed .bss
(zero-initialized) storage for threads in the executable itself, but I
want to do it the Right Way and allocate the stack dynamically, just
as Pthreads, or any other threading library, would. Otherwise the
application would be limited to a compile-time fixed number of
threads.</p>

<p>You <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">can’t just read from and write to arbitrary addresses</a> in
virtual memory, you first <a href="/blog/2015/03/19/">have to ask the kernel to allocate
pages</a>. There are two system calls this on Linux to do this:</p>

<ul>
  <li>
    <p>brk(): Extends (or shrinks) the heap of a running process, typically
located somewhere shortly after the .bss segment. Many allocators
will do this for small or initial allocations. This is a less
optimal choice for thread stacks because the stacks will be very
near other important data, near other stacks, and lack a guard page
(by default). It would be somewhat easier for an attacker to exploit
a buffer overflow. A guard page is a locked-down page just past the
absolute end of the stack that will trigger a segmentation fault on
a stack overflow, rather than allow a stack overflow to trash other
memory undetected. A guard page could still be created manually with
mprotect(). Also, there’s also no room for these stacks to grow.</p>
  </li>
  <li>
    <p>mmap(): Use an anonymous mapping to allocate a contiguous set of
pages at some randomized memory location. As we’ll see, you can even
tell the kernel specifically that you’re going to use this memory as
a stack. Also, this is simpler than using brk() anyway.</p>
  </li>
</ul>

<p>On x86-64, mmap() is system call 9. I’ll define a function to allocate
a stack with this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>The mmap() system call takes 6 arguments, but when creating an
anonymous memory map the last two arguments are ignored. For our
purposes, it looks like this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">mmap</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">,</span> <span class="kt">int</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">);</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">flags</code>, we’ll choose a private, anonymous mapping that, being a
stack, grows downward. Even with that last flag, the system call will
still return the bottom address of the mapping, which will be
important to remember later. It’s just a simple matter of setting the
arguments in the registers and making the system call.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_mmap	9
%define STACK_SIZE	(4096 * 1024)	</span><span class="c1">; 4 MB
</span>
<span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Now we can allocate new stacks (or stack-sized buffers) as needed.</p>

<h3 id="spawning-a-thread">Spawning a Thread</h3>

<p>Spawning a thread is so simple that it doesn’t even require a branch
instruction! It’s a call to clone() with two arguments: clone flags
and a pointer to the new thread’s stack. It’s important to note that,
as in many cases, the glibc wrapper function has the arguments in a
different order than the system call. With the set of flags we’re
using, it takes two arguments.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">sys_clone</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">child_stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Our thread spawning function will have this C prototype. It takes a
function as its argument and starts the thread running that function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">thread_create</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="p">)(</span><span class="kt">void</span><span class="p">));</span>
</code></pre></div></div>

<p>The function pointer argument is passed via rdi, per the ABI. Store
this for safekeeping on the stack (<code class="language-plaintext highlighter-rouge">push</code>) in preparation for calling
stack_create(). When it returns, the address of the low end of stack
will be in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">thread_create:</span>
    <span class="nf">push</span> <span class="nb">rdi</span>
    <span class="nf">call</span> <span class="nv">stack_create</span>
    <span class="nf">lea</span> <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nv">STACK_SIZE</span> <span class="o">-</span> <span class="mi">8</span><span class="p">]</span>
    <span class="nf">pop</span> <span class="kt">qword</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">CL</span><span class="nv">ONE_VM</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FS</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FILES</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_SIGHAND</span> <span class="o">|</span> <span class="err">\</span>
             <span class="nf">CLONE_PARENT</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_THREAD</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_IO</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_clone</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The second argument to clone() is a pointer to the <em>high address</em> of
the stack (specifically, just above the stack). So we need to add
<code class="language-plaintext highlighter-rouge">STACK_SIZE</code> to rax to get the high end. This is done with the <code class="language-plaintext highlighter-rouge">lea</code>
instruction: <strong>l</strong>oad <strong>e</strong>ffective <strong>a</strong>ddress. Despite the brackets,
it doesn’t actually read memory at that address, but instead stores
the address in the destination register (rsi). I’ve moved it back by 8
bytes because I’m going to place the thread function pointer at the
“top” of the new stack in the next instruction. You’ll see why in a
moment.</p>

<p><img src="/img/x86/clone.png" alt="" /></p>

<p>Remember that the function pointer was pushed onto the stack for
safekeeping. This is popped off the current stack and written to that
reserved space on the new stack.</p>

<p>As you can see, it takes a lot of flags to create a thread with
clone(). Most things aren’t shared with the callee by default, so lots
of options need to be enabled. See the clone(2) man page for full
details on these flags.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CLONE_THREAD</code>: Put the new process in the same thread group.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_VM</code>: Runs in the same virtual memory space.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_PARENT</code>: Share a parent with the callee.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_SIGHAND</code>: Share signal handlers.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_FS</code>, <code class="language-plaintext highlighter-rouge">CLONE_FILES</code>, <code class="language-plaintext highlighter-rouge">CLONE_IO</code>: Share filesystem information.</li>
</ul>

<p>A new thread will be created and the syscall will return in each of
the two threads at the same instruction, <em>exactly</em> like fork(). All
registers will be identical between the threads, except for rax, which
will be 0 in the new thread, and rsp which has the same value as rsi
in the new thread (the pointer to the new stack).</p>

<p><strong>Now here’s the really cool part</strong>, and the reason branching isn’t
needed. There’s no reason to check rax to determine if we are the
original thread (in which case we return to the caller) or if we’re
the new thread (in which case we jump to the thread function).
Remember how we seeded the new stack with the thread function? When
the new thread returns (<code class="language-plaintext highlighter-rouge">ret</code>), it will jump to the thread function
with a completely empty stack. The original thread, using the original
stack, will return to the caller.</p>

<p>The value returned by thread_create() is the process ID of the new
thread, which is essentially the thread object (e.g. Pthread’s
<code class="language-plaintext highlighter-rouge">pthread_t</code>).</p>

<h3 id="cleaning-up">Cleaning Up</h3>

<p>The thread function has to be careful not to return (<code class="language-plaintext highlighter-rouge">ret</code>) since
there’s nowhere to return. It will fall off the stack and terminate
the program with a segmentation fault. Remember that threads are just
processes? It must use the exit() syscall to terminate. This won’t
terminate the other threads.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_exit	60
</span>
<span class="nl">exit:</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_exit</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<p>Before exiting, it should free its stack with the munmap() system
call, so that no resources are leaked by the terminated thread. The
equivalent of pthread_join() by the main parent would be to use the
wait4() system call on the thread process.</p>

<h3 id="more-exploration">More Exploration</h3>

<p>If you found this interesting, be sure to check out the full demo link
at the top of this article. Now with the ability to spawn threads,
it’s a great opportunity to explore and experiment with x86’s
synchronization primitives, such as the <code class="language-plaintext highlighter-rouge">lock</code> instruction prefix,
<code class="language-plaintext highlighter-rouge">xadd</code>, and <a href="/blog/2014/09/02/">compare-and-exchange</a> (<code class="language-plaintext highlighter-rouge">cmpxchg</code>). I’ll discuss
these in a future article.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
