<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged posix at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/posix/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/posix/feed/"/>
  <updated>2026-03-30T21:58:42Z</updated>
  <id>urn:uuid:bba994f0-2941-4a75-b490-04cd48f44829</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Conventions for Command Line Options</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/08/01/"/>
    <id>urn:uuid:9be2ce0e-298e-4085-8789-49674aecfeeb</id>
    <updated>2020-08-01T00:34:23Z</updated>
    <category term="tutorial"/><category term="posix"/><category term="c"/><category term="python"/><category term="go"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=24020952">on Hacker News</a> and critiqued <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/MyOptionsConventions">on
Wandering Thoughts</a> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/UnixOptionsConventions">2</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/ArgparseSomeUnixNotes">3</a>).</em></p>

<p>Command line interfaces have varied throughout their brief history but
have largely converged to some common, sound conventions. The core
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap12.html">originates from unix</a>, and the Linux ecosystem extended it,
particularly via the GNU project. Unfortunately some tools initially
<em>appear</em> to follow the conventions, but subtly get them wrong, usually
for no practical benefit. I believe in many cases the authors simply
didn’t know any better, so I’d like to review the conventions.</p>

<!--more-->

<h3 id="short-options">Short Options</h3>

<p>The simplest case is the <em>short option</em> flag. An option is a hyphen —
specifically HYPHEN-MINUS U+002D — followed by one alphanumeric
character. Capital letters are acceptable. The letters themselves <a href="http://www.catb.org/~esr/writings/taoup/html/ch10s05.html">have
conventional meanings</a> and are worth following if possible.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -c
</code></pre></div></div>

<p>Flags can be grouped together into one program argument. This is both
convenient and unambiguous. It’s also one of those often missed details
when programs use hand-coded argument parsers, and the lack of support
irritates me.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -abc
program -acb
</code></pre></div></div>

<p>The next simplest case are short options that take arguments. The
argument follows the option.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -i input.txt -o output.txt
</code></pre></div></div>

<p>The space is optional, so the option and argument can be packed together
into one program argument. Since the argument is required, this is still
unambiguous. This is another often-missed feature in hand-coded parsers.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -iinput.txt -ooutput.txt
</code></pre></div></div>

<p>This does not prohibit grouping. When grouped, the option accepting an
argument must be last.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -abco output.txt
program -abcooutput.txt
</code></pre></div></div>

<p>This technique is used to create another category, <em>optional option
arguments</em>. The option’s argument can be optional but still unambiguous
so long as the space is always omitted when the argument is present.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"
</code></pre></div></div>

<p>Optional option arguments should be used judiciously since they can be
surprising, but they have their uses.</p>

<p>Options can typically appear in any order — something parsers often
achieve via <em>permutation</em> — but non-options typically follow options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b foo bar
program -b -a foo bar
</code></pre></div></div>

<p>GNU-style programs usually allow options and non-options to be mixed,
though I don’t consider this to be essential.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a foo -b bar
program foo -a -b bar
program foo bar -a -b
</code></pre></div></div>

<p>If a non-option looks like an option because it starts with a hyphen,
use <code class="language-plaintext highlighter-rouge">--</code> to demarcate options from non-options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -- -x foo bar
</code></pre></div></div>

<p>An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so <code class="language-plaintext highlighter-rouge">--</code> is less often
needed.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options
</code></pre></div></div>

<h3 id="long-options">Long options</h3>

<p>Since short options can be cryptic, and there are such a limited number
of them, more complex programs support long options. A long option
starts with two hyphens followed by one or more alphanumeric, lowercase
words. Hyphens separate words. Using two hyphens prevents long options
from being confused for grouped short options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --reverse --ignore-backups
</code></pre></div></div>

<p>Occasionally flags are paired with a mutually exclusive inverse flag
that begins with <code class="language-plaintext highlighter-rouge">--no-</code>. This avoids a future <em>flag day</em> where the
default is changed in the release that also adds the flag implementing
the original behavior.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --sort
program --no-sort
</code></pre></div></div>

<p>Long options can similarly accept arguments.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --output output.txt --block-size 1024
</code></pre></div></div>

<p>These may optionally be connected to the argument with an equals sign
<code class="language-plaintext highlighter-rouge">=</code>, much like omitting the space for a short option argument.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --output=output.txt --block-size=1024
</code></pre></div></div>

<p>Like before, this opens up the doors for optional option arguments. Due
to the required <code class="language-plaintext highlighter-rouge">=</code> this is still unambiguous.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --color --reverse
program --color=never --reverse
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">--</code> retains its original behavior of disambiguating option-like
non-option arguments:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --reverse -- --foo bar
</code></pre></div></div>

<h3 id="subcommands">Subcommands</h3>

<p>Some programs, such as Git, have subcommands each with their own
options. The main program itself may still have its own options distinct
from subcommand options. The program’s options come before the
subcommand and subcommand options follow the subcommand. Options are
never permuted around the subcommand.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz
</code></pre></div></div>

<p>Above, the <code class="language-plaintext highlighter-rouge">-a</code>, <code class="language-plaintext highlighter-rouge">-b</code>, and <code class="language-plaintext highlighter-rouge">-c</code> options are for <code class="language-plaintext highlighter-rouge">program</code>, and the
others are for <code class="language-plaintext highlighter-rouge">subcommand</code>. So, really, the subcommand is another
command line of its own.</p>

<h3 id="option-parsing-libraries">Option parsing libraries</h3>

<p>There’s little excuse for not getting these conventions right assuming
you’re interested in following the conventions. Short options can be
parsed correctly in <a href="https://github.com/skeeto/getopt">just ~60 lines of C code</a>. Long options are
<a href="https://github.com/skeeto/optparse">just slightly more complex</a>.</p>

<p>GNU’s <code class="language-plaintext highlighter-rouge">getopt_long()</code> supports long option abbreviation — with no way to
disable it (!) — but <a href="https://utcc.utoronto.ca/~cks/space/blog/python/ArgparseAbbreviatedOptions">this should be avoided</a>.</p>

<p>Go’s <a href="https://golang.org/pkg/flag/">flag package</a> intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with <code class="language-plaintext highlighter-rouge">=</code>. It’s sound, but I miss both features
every time I write programs in Go. That’s why I <a href="https://github.com/skeeto/optparse-go">wrote my own argument
parser</a>. Not only does it have a nicer feature set, I like the API a
lot more, too.</p>

<p>Python’s primary option parsing library is <code class="language-plaintext highlighter-rouge">argparse</code>, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention <em>and</em> its behavior is unsound. For instance, the following
program has two options, <code class="language-plaintext highlighter-rouge">--foo</code> and <code class="language-plaintext highlighter-rouge">--bar</code>. The <code class="language-plaintext highlighter-rouge">--foo</code> option accepts
an optional argument, and the <code class="language-plaintext highlighter-rouge">--bar</code> option is a simple flag.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">import</span> <span class="nn">sys</span>

<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="p">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--foo'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">nargs</span><span class="o">=</span><span class="s">'?'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s">'X'</span><span class="p">)</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--bar'</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s">'store_true'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:]))</span>
</code></pre></div></div>

<p>Here are some example runs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')
</code></pre></div></div>

<p>Everything looks good except the last. If the <code class="language-plaintext highlighter-rouge">--foo</code> argument is
optional then why did it consume <code class="language-plaintext highlighter-rouge">arg</code>? What happens if I follow it with
<code class="language-plaintext highlighter-rouge">--bar</code>? Will it consume it as the argument?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py --foo --bar
Namespace(bar=True, foo=None)
</code></pre></div></div>

<p>Nope! Unlike <code class="language-plaintext highlighter-rouge">arg</code>, it left <code class="language-plaintext highlighter-rouge">--bar</code> alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument <em>looks
like</em> an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since <code class="language-plaintext highlighter-rouge">argparse</code> does not properly support <code class="language-plaintext highlighter-rouge">--</code>,
that does not help.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg
</code></pre></div></div>

<p>Please, stick to the conventions unless you have <em>really</em> good reasons
to break them!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Fibers: the Most Elegant Windows API</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/03/28/"/>
    <id>urn:uuid:abad2340-99e5-4d72-857c-848e37b4af73</id>
    <updated>2019-03-28T22:26:05Z</updated>
    <category term="win32"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19520078">on Hacker News</a>.</em></p>

<p>The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly,
and lacking good taste. Microsoft has done a pretty commendable job with
backwards compatibility, but the trade-off is that the API is filled to
the brim with historical cruft. Every hasty, poor design over the
decades is carried forward forever, and, in many cases, even built upon,
which essentially doubles down on past mistakes. POSIX certainly has its
own ugly corners, but those are the exceptions. In the Windows API,
elegance is the exception.</p>

<!--more-->

<p>That’s why, when I recently revisited the <a href="https://docs.microsoft.com/en-us/windows/desktop/procthread/fibers">Fibers API</a>, I was
pleasantly surprised. It’s one of the exceptions — much cleaner than the
optional, deprecated, and now obsolete <a href="/blog/2017/06/21/#coroutines">POSIX equivalent</a>. It’s
not quite an apples-to-apples comparison since the POSIX version is
slightly more powerful, and more complicated as a result. I’ll cover the
difference in this article.</p>

<p>For the last part of this article, I’ll walk through an async/await
framework build on top of fibers. The framework allows coroutines in C
programs to await on arbitrary kernel objects.</p>

<p><a href="https://github.com/skeeto/fiber-await"><strong>Fiber Async/await Demo</strong></a></p>

<h3 id="fibers">Fibers</h3>

<p>Windows fibers are really just <a href="https://blog.varunramesh.net/posts/stackless-vs-stackful-coroutines/">stackful</a>, symmetric coroutines.
From a different point of view, they’re cooperatively scheduled threads,
which is the source of the analogous name, <em>fibers</em>. They’re symmetric
because all fibers are equal, and no fiber is the “main” fiber. If <em>any</em>
fiber returns from its start routine, the program exits. (Older versions
of Wine will crash when this happens, but it was recently fixed.) It’s
equivalent to the process’ main thread returning from <code class="language-plaintext highlighter-rouge">main()</code>. The
initial fiber is free to create a second fiber, yield to it, then the
second fiber destroys the first.</p>

<p>For now I’m going to focus on the core set of fiber functions. There are
some additional capabilities I’m going to ignore, including support for
<em>fiber local storage</em>. The important functions are just these five:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">CreateFiber</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">stack_size</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">proc</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">SwitchToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
<span class="n">bool</span>  <span class="nf">ConvertFiberToThread</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">ConvertThreadToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">DeleteFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
</code></pre></div></div>

<p>To emphasize its simplicity, I’ve shown them here with more standard
prototypes than seen in their formal documentation. That documentation
uses the clunky Windows API typedefs still burdened with its 16-bit
heritage — e.g. <code class="language-plaintext highlighter-rouge">LPVOID</code> being a “long pointer” from the segmented memory
of the 8086:</p>

<ul>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-createfiber">CreateFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-switchtofiber">SwitchToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-convertfibertothread">ConvertFiberToThread</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-convertthreadtofiber">ConvertThreadToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-deletefiber">DeleteFiber</a></li>
</ul>

<p>Fibers are represented using opaque, void pointers. Maybe that’s a little
<em>too</em> simple since it’s easy to misuse in C, but I like it. The return
values for <code class="language-plaintext highlighter-rouge">CreateFiber()</code> and <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code> are void pointers
since these both create fibers.</p>

<p>The fiber start routine returns nothing and takes a void “user pointer”.
That’s nearly what I’d expect, except that it would probably make more
sense for a fiber to return <code class="language-plaintext highlighter-rouge">int</code>, which is <a href="/blog/2016/01/31/">more in line with</a>
<code class="language-plaintext highlighter-rouge">main</code> / <code class="language-plaintext highlighter-rouge">WinMain</code> / <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> / <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>. As I said,
when any fiber returns from its start routine, it’s like returning from
the main function, so it should probably have returned an integer.</p>

<p>A fiber may delete itself, which is the same as exiting the thread.
However, a fiber cannot yield (e.g. <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>) to itself. That’s
undefined behavior.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">coup</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">king</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"Long live the king!"</span><span class="p">);</span>
    <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">king</span><span class="p">);</span>
    <span class="n">ConvertFiberToThread</span><span class="p">();</span> <span class="cm">/* seize the main thread */</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">king</span> <span class="o">=</span> <span class="n">ConvertThreadToFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">pretender</span> <span class="o">=</span> <span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">coup</span><span class="p">,</span> <span class="n">king</span><span class="p">);</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">pretender</span><span class="p">);</span>
    <span class="n">abort</span><span class="p">();</span> <span class="cm">/* unreachable */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Only fibers can yield to fibers, but when the program starts up, there
are no fibers. At least one thread must first convert itself into a
fiber using <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code>, which returns the fiber object
that represents itself. It takes one argument analogous to the last
argument of <code class="language-plaintext highlighter-rouge">CreateFiber()</code>, except that there’s no start routine to
accept it. The process is reversed with <code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code>.</p>

<p>Fibers don’t belong to any particular thread and can be scheduled on any
thread <em>if</em> properly synchronized. Obviously one should never yield to the
same fiber in two different threads at the same time.</p>

<h3 id="contrast-with-posix">Contrast with POSIX</h3>

<p>The equivalent POSIX systems was context switching. It’s also stackful
and symmetric, but it has just three important functions:
<a href="http://man7.org/linux/man-pages/man3/setcontext.3.html"><code class="language-plaintext highlighter-rouge">getcontext(3)</code></a>, <a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">makecontext(3)</code></a>, and
<a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">swapcontext</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">getcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">makecontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(),</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">int</span>  <span class="nf">swapcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">oucp</span><span class="p">,</span> <span class="k">const</span> <span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
</code></pre></div></div>

<p>These are roughly equivalent to <a href="https://docs.microsoft.com/en-us/windows/desktop/api/winnt/nf-winnt-getcurrentfiber"><code class="language-plaintext highlighter-rouge">GetCurrentFiber()</code></a>,
<code class="language-plaintext highlighter-rouge">CreateFiber()</code>, and <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>. There is no need for
<code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code> since threads can context switch without
preparation. There’s also no <code class="language-plaintext highlighter-rouge">DeleteFiber()</code> because the resources are
managed by the program itself. That’s where POSIX contexts are a little
bit more powerful.</p>

<p>The first argument to <code class="language-plaintext highlighter-rouge">CreateFiber()</code> is the desired stack size, with
zero indicating the default stack size. The stack is allocated and freed
by the operating system. The downside is that the caller doesn’t have a
choice in managing the lifetime of this stack and how it’s allocated. If
you’re frequently creating and destroying coroutines, those stacks are
constantly being allocated and freed.</p>

<p>In <code class="language-plaintext highlighter-rouge">makecontext(3)</code>, the caller allocates and supplies the stack. Freeing
that stack is equivalent to destroying the context. A program that
frequently creates and destroys contexts can maintain a stack pool or
otherwise more efficiently manage their allocation. This makes it more
powerful, but it also makes it a little more complicated. It would be hard
to remember how to do all this without a careful reading of the
documentation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Create a context */</span>
<span class="n">ucontext_t</span> <span class="n">ctx</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">SIGSTKSZ</span><span class="p">);</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_size</span> <span class="o">=</span> <span class="n">SIGSTKSZ</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_link</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">getcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">);</span>
<span class="n">makecontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">proc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="cm">/* Destroy a context */</span>
<span class="n">free</span><span class="p">(</span><span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span><span class="p">);</span>
</code></pre></div></div>

<p>Note how <code class="language-plaintext highlighter-rouge">makecontext(3)</code> is variadic (<code class="language-plaintext highlighter-rouge">...</code>), passing its arguments on
to the start routine of the context. This seems like it might be better
than a user pointer. Unfortunately it’s not, since those arguments are
strictly limited to <em>integers</em>.</p>

<p>Ultimately I like the fiber API better. The first time I tried it out, I
could guess my way through it without looking closely at the
documentation.</p>

<h3 id="async--await-with-fibers">Async / await with fibers</h3>

<p>Why was I looking at the Fiber API? I’ve known about coroutines for
years but I didn’t understand how they could be useful. Sure, the
function can yield, but what other coroutine should it yield to? It
wasn’t until I was <a href="/blog/2019/03/10/">recently bit by the async/await bug</a> that I
finally saw a “killer feature” that justified their use. Generators come
pretty close, though.</p>

<p>Windows fibers are a coroutine primitive suitable for async/await in C
programs, where <a href="/blog/2019/03/22/">it can also be useful</a>. To prove that it’s
possible, I built async/await on top of fibers in <a href="https://github.com/skeeto/fiber-await/blob/master/async.c">95 lines of code</a>.</p>

<p>The alternatives are to use a <a href="https://www.gnu.org/software/pth/">third-party coroutine library</a> or to
do it myself <a href="/blog/2015/05/15/">with some assembly programming</a>. However, having it
built into the operating system is quite convenient! It’s unfortunate
that it’s limited to Windows. Ironically, though, everything I wrote for
this article, including the async/await demonstration, was originally
written on Linux using Mingw-w64 and tested using <a href="https://www.winehq.org/">Wine</a>. Only
after I was done did I even try it on Windows.</p>

<p>Before diving into how it works, there’s a general concept about the
Windows API that must be understood: <strong>All kernel objects can be in
either a signaled or unsignaled state.</strong> The API provides functions that
block on a kernel object until it is signaled. The two important ones
are <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobject"><code class="language-plaintext highlighter-rouge">WaitForSingleObject()</code></a> and <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitformultipleobjects"><code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code></a>.
The latter behaves very much like <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX.</p>

<p>Usually the signal is tied to some useful event, like a process or
thread exiting, the completion of an I/O operation (i.e. asynchronous
overlapped I/O), a semaphore being incremented, etc. It’s a generic way
to wait for some event. <strong>However, instead of blocking the thread,
wouldn’t it be nice to <em>await</em> on the kernel object?</strong> In my <code class="language-plaintext highlighter-rouge">aio</code>
library for Emacs, the fundamental “wait” object was a promise. For this
API it’s a kernel object handle.</p>

<p>So, the await function will take a kernel object, register it with the
scheduler, then yield to the scheduler. The scheduler — which is a
global variable, so there’s only one scheduler per process — looks like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">main_fiber</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">handles</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">fibers</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">dead_fiber</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span> <span class="n">async_loop</span><span class="p">;</span>
</code></pre></div></div>

<p>While fibers are symmetric, coroutines in my async/await implementation
are not. One fiber is the scheduler, <code class="language-plaintext highlighter-rouge">main_fiber</code>, and the other fibers
always yield to it.</p>

<p>There is an array of kernel object handles, <code class="language-plaintext highlighter-rouge">handles</code>, and an array of
<code class="language-plaintext highlighter-rouge">fibers</code>. The elements in these arrays are paired with each other, but
it’s convenient to store them separately, as I’ll show soon. <code class="language-plaintext highlighter-rouge">fibers[0]</code>
is waiting on <code class="language-plaintext highlighter-rouge">handles[0]</code>, and so on.</p>

<p>The array is a fixed size, <code class="language-plaintext highlighter-rouge">MAXIMUM_WAIT_OBJECTS</code> (64), because there’s
a hard limit on the number of fibers that can wait at once. This
pathetically small limitation is an unfortunate, hard-coded restriction
of the Windows API. It kills most practical uses of my little library.
Fortunately there’s no limit on the number of handles we might want to
wait on, just the number of co-existing fibers.</p>

<p>When a fiber is about to return from its start routine, it yields one
last time and registers itself on the <code class="language-plaintext highlighter-rouge">dead_fiber</code> member. The scheduler
will delete this fiber as soon as it’s given control. Fibers never
<em>truly</em> return since that would terminate the program.</p>

<p>With this, the await function, <code class="language-plaintext highlighter-rouge">async_await()</code>, is pretty simple. It
registers the handle with the scheduler, then yields to the scheduler
fiber.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_await</span><span class="p">(</span><span class="n">HANDLE</span> <span class="n">h</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">h</span><span class="p">;</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">++</span><span class="p">;</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Caveat: The scheduler destroys this handle with <code class="language-plaintext highlighter-rouge">CloseHandle()</code> after it
signals, so don’t try to reuse it. This made my demonstration simpler,
but it might be better to not do this.</p>

<p>A fiber can exit at any time. Such an exit is inserted implicitly before
a fiber actually returns:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_exit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The start routine given to <code class="language-plaintext highlighter-rouge">async_start()</code> is actually wrapped in the
real start routine. This is how <code class="language-plaintext highlighter-rouge">async_exit()</code> is injected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">fiber_wrapper</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="o">*</span><span class="n">fw</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="n">fw</span><span class="o">-&gt;</span><span class="n">func</span><span class="p">(</span><span class="n">fw</span><span class="o">-&gt;</span><span class="n">arg</span><span class="p">);</span>
    <span class="n">async_exit</span><span class="p">();</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">async_start</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span> <span class="o">==</span> <span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="n">fw</span> <span class="o">=</span> <span class="p">{</span><span class="n">func</span><span class="p">,</span> <span class="n">arg</span><span class="p">};</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">fiber_wrapper</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fw</span><span class="p">));</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The library provides a single awaitable function, <code class="language-plaintext highlighter-rouge">async_sleep()</code>. It
creates a “waitable timer” object, starts the countdown, and returns it.
(Notice how <code class="language-plaintext highlighter-rouge">SetWaitableTimer()</code> is a typically-ugly Win32 function with
excessive parameters.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span>
<span class="nf">async_sleep</span><span class="p">(</span><span class="kt">double</span> <span class="n">seconds</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">promise</span> <span class="o">=</span> <span class="n">CreateWaitableTimer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">LARGE_INTEGER</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">t</span><span class="p">.</span><span class="n">QuadPart</span> <span class="o">=</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">long</span><span class="p">)(</span><span class="n">seconds</span> <span class="o">*</span> <span class="o">-</span><span class="mi">10000000</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">SetWaitableTimer</span><span class="p">(</span><span class="n">promise</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">t</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">promise</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A more realistic example would be overlapped I/O. For example, you’d
open a file (<code class="language-plaintext highlighter-rouge">CreateFile()</code>) in overlapped mode, then when you, say,
read from that file (<code class="language-plaintext highlighter-rouge">ReadFile()</code>) you create an event object
(<code class="language-plaintext highlighter-rouge">CreateEvent()</code>), populate an overlapped I/O structure with the event,
offset, and length, then finally await on the event object. The fiber
will be resumed when the operation is complete.</p>

<p>Side note: Unfortunately <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">overlapped I/O doesn’t work correctly for
files</a>, and many operations can’t be done asynchronously, like
opening files. When it comes to files, you’re <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">better off using
dedicated threads</a> as <a href="http://docs.libuv.org/en/v1.x/design.html#file-i-o">libuv does</a> instead of overlapped I/O.
You can still await on these operations. You’d just await on the signal
from the thread doing synchronous I/O, not from overlapped I/O.</p>

<p>The most complex part is the scheduler, and it’s really not complex at
all:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_run</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Wait for next event */</span>
        <span class="n">DWORD</span> <span class="n">nhandles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">;</span>
        <span class="n">HANDLE</span> <span class="o">*</span><span class="n">handles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">;</span>
        <span class="n">DWORD</span> <span class="n">r</span> <span class="o">=</span> <span class="n">WaitForMultipleObjects</span><span class="p">(</span><span class="n">nhandles</span><span class="p">,</span> <span class="n">handles</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">INFINITE</span><span class="p">);</span>

        <span class="cm">/* Remove event and fiber from waiting array */</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">];</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]);</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">--</span><span class="p">;</span>

        <span class="cm">/* Run the fiber */</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">fiber</span><span class="p">);</span>

        <span class="cm">/* Destroy the fiber if it exited */</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">);</span>
            <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is why the handles are in their own array. The array can be passed
directly to <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code>. The return value indicates which
handle was signaled. The handle is closed, the entry removed from the
scheduler, and then the fiber is resumed.</p>

<p>That <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> is what limits the number of fibers.
It’s not possible to wait on more than 64 handles at once! This is
hard-coded into the API. How? A return value of 64 is an error code, and
changing this would break the API. Remember what I said about being
locked into bad design decisions of the past?</p>

<p>To be fair, <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> was a doomed API anyway, just
like <code class="language-plaintext highlighter-rouge">select(2)</code> and <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX. It scales very poorly since the
entire array of objects being waited on must be traversed on each call.
That’s terribly inefficient when waiting on large numbers of objects.
This sort of problem is solved by interfaces like kqueue (BSD), epoll
(Linux), and IOCP (Windows). Unfortunately <a href="https://news.ycombinator.com/item?id=11866562">IOCP doesn’t really fit this
particular problem well</a> — awaiting on kernel objects — so I
couldn’t use it.</p>

<p>When the awaiting fiber count is zero and the scheduler has control, all
fibers must have completed and there’s nothing left to do. However, the
caller can schedule more fibers and then restart the scheduler if
desired.</p>

<p>That’s all there is to it. Have a look at <a href="https://github.com/skeeto/fiber-await/blob/master/demo.c"><code class="language-plaintext highlighter-rouge">demo.c</code></a> to see how
the API looks in some trivial examples. On Linux you can see it in
action with <code class="language-plaintext highlighter-rouge">make check</code>. On Windows, you just <a href="/blog/2016/06/13/">need to compile
it</a>, then run it like a normal program. If there was a better
function than <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> in the Windows API, I would
have considered turning this demonstration into a real library.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Endlessh: an SSH Tarpit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/03/22/"/>
    <id>urn:uuid:5429ee15-3d42-4af2-8690-f7f402870dd0</id>
    <updated>2019-03-22T17:26:45Z</updated>
    <category term="netsec"/><category term="python"/><category term="c"/><category term="posix"/><category term="asyncio"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19465967">on Hacker News</a> (<a href="https://news.ycombinator.com/item?id=24491453">later</a>), <a href="https://old.reddit.com/r/programming/comments/b4iq00/endlessh_an_ssh_tarpit/">on
reddit</a> (<a href="https://old.reddit.com/r/netsec/comments/b4dwjl/endlessh_an_ssh_tarpit/">also</a>), featured in <a href="https://www.youtube.com/watch?v=bM65iyRRW0A&amp;t=3m52s">BSD Now 294</a>.
Also check out <a href="https://github.com/bediger4000/ssh-tarpit-behavior">this Endlessh analysis</a>.</em></p>

<p>I’m a big fan of tarpits: a network service that intentionally inserts
delays in its protocol, slowing down clients by forcing them to wait.
This arrests the speed at which a bad actor can attack or probe the
host system, and it ties up some of the attacker’s resources that
might otherwise be spent attacking another host. When done well, a
tarpit imposes more cost on the attacker than the defender.</p>

<!--more-->

<p>The Internet is a very hostile place, and anyone who’s ever stood up
an Internet-facing IPv4 host has witnessed the immediate and
continuous attacks against their server. I’ve maintained <a href="/blog/2017/06/15/">such a
server</a> for nearly six years now, and more than 99% of my
incoming traffic has ill intent. One part of my defenses has been
tarpits in various forms. The latest addition is an SSH tarpit I wrote
a couple of months ago:</p>

<p><a href="https://github.com/skeeto/endlessh"><strong>Endlessh: an SSH tarpit</strong></a></p>

<p>This program opens a socket and pretends to be an SSH server. However,
it actually just ties up SSH clients with false promises indefinitely
— or at least until the client eventually gives up. After cloning the
repository, here’s how you can try it out for yourself (default port
2222):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make
$ ./endlessh &amp;
$ ssh -p2222 localhost
</code></pre></div></div>

<p>Your SSH client will hang there and wait for at least several days
before finally giving up. Like a mammoth in the La Brea Tar Pits, it
got itself stuck and can’t get itself out. As I write, my
Internet-facing SSH tarpit currently has 27 clients trapped in it. A
few of these have been connected for weeks. In one particular spike it
had 1,378 clients trapped at once, lasting about 20 hours.</p>

<p>My Internet-facing Endlessh server listens on port 22, which is the
standard SSH port. I long ago moved my real SSH server off to another
port where it sees a whole lot less SSH traffic — essentially none.
This makes the logs a whole lot more manageable. And (hopefully)
Endlessh convinces attackers not to look around for an SSH server on
another port.</p>

<p>How does it work? Endlessh exploits <a href="https://tools.ietf.org/html/rfc4253#section-4.2">a little paragraph in RFC
4253</a>, the SSH protocol specification. Immediately after the TCP
connection is established, and before negotiating the cryptography,
both ends send an identification string:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SSH-protoversion-softwareversion SP comments CR LF
</code></pre></div></div>

<p>The RFC also notes:</p>

<blockquote>
  <p>The server MAY send other lines of data before sending the version
string.</p>
</blockquote>

<p>There is no limit on the number of lines, just that these lines must
not begin with “SSH-“ since that would be ambiguous with the
identification string, and lines must not be longer than 255
characters including CRLF. So <strong>Endlessh sends and <em>endless</em> stream of
randomly-generated “other lines of data”</strong> without ever intending to
send a version string. By default it waits 10 seconds between each
line. This slows down the protocol, but prevents it from actually
timing out.</p>

<p>This means Endlessh need not know anything about cryptography or the
vast majority of the SSH protocol. It’s dead simple.</p>

<h3 id="implementation-strategies">Implementation strategies</h3>

<p>Ideally the tarpit’s resource footprint should be as small as
possible. It’s just a security tool, and the server does have an
actual purpose that doesn’t include being a tarpit. It should tie up
the attacker’s resources, not the server’s, and should generally be
unnoticeable. (Take note all those who write the awful “security”
products I have to tolerate at my day job.)</p>

<p>Even when many clients have been trapped, Endlessh spends more than
99.999% of its time waiting around, doing nothing. It wouldn’t even be
accurate to call it I/O-bound. If anything, it’s <em>timer-bound</em>,
waiting around before sending off the next line of data. <strong>The most
precious resource to conserve is <em>memory</em>.</strong></p>

<h4 id="processes">Processes</h4>

<p>The most straightforward way to implement something like Endlessh is a
fork server: accept a connection, fork, and the child simply alternates
between <code class="language-plaintext highlighter-rouge">sleep(3)</code> and <code class="language-plaintext highlighter-rouge">write(2)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">ssize_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">line</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>

    <span class="n">sleep</span><span class="p">(</span><span class="n">DELAY</span><span class="p">);</span>
    <span class="n">generate_line</span><span class="p">(</span><span class="n">line</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">line</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">line</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">!=</span> <span class="n">EINTR</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A process per connection is a lot of overhead when connections are
expected to be up hours or even weeks at a time. An attacker who knows
about this could exhaust the server’s resources with little effort by
opening up lots of connections.</p>

<h4 id="threads">Threads</h4>

<p>A better option is, instead of processes, to create a thread per
connection. On Linux <a href="/blog/2015/05/15/">this is practically the same thing</a>, but it’s
still better. However, you still have to allocate a stack for the thread
and the kernel will have to spend some resources managing the thread.</p>

<h4 id="poll">Poll</h4>

<p>For Endlessh I went for an even more lightweight version: a
single-threaded <code class="language-plaintext highlighter-rouge">poll(2)</code> server, analogous to stackless green threads.
The overhead per connection is about as low as it gets.</p>

<p>Clients that are being delayed are not registered in <code class="language-plaintext highlighter-rouge">poll(2)</code>. Their
only overhead is the socket object in the kernel, and another 78 bytes
to track them in Endlessh. Most of those bytes are used only for
accurate logging. Only those clients that are overdue for a new line
are registered for <code class="language-plaintext highlighter-rouge">poll(2)</code>.</p>

<p>When clients are waiting, but no clients are overdue, <code class="language-plaintext highlighter-rouge">poll(2)</code> is
essentially used in place of <code class="language-plaintext highlighter-rouge">sleep(3)</code>. Though since it still needs
to manage the <em>accept</em> server socket, it (almost) never actually waits
on <em>nothing</em>.</p>

<p>There’s an option to limit the total number of client connections so
that it doesn’t get out of hand. In this case it will stop polling the
accept socket until a client disconnects. I probably shouldn’t have
bothered with this option and instead relied on <code class="language-plaintext highlighter-rouge">ulimit</code>, a feature
already provided by the operating system.</p>

<p>I could have used epoll (Linux) or kqueue (BSD), which would be much
more efficient than <code class="language-plaintext highlighter-rouge">poll(2)</code>. The problem with <code class="language-plaintext highlighter-rouge">poll(2)</code> is that it’s
constantly registering and unregistering Endlessh on each of the
overdue sockets each time around the main loop. This is by far the
most CPU-intensive part of Endlessh, and it’s all inflicted on the
kernel. Most of the time, even with thousands of clients trapped in
the tarpit, only a small number of them at polled at once, so I opted
for better portability instead.</p>

<p>One consequence of not polling connections that are waiting is that
disconnections aren’t noticed in a timely fashion. This makes the logs
less accurate than I like, but otherwise it’s pretty harmless.
Unforunately even if I wanted to fix this, the <code class="language-plaintext highlighter-rouge">poll(2)</code> interface
isn’t quite equipped for it anyway.</p>

<h4 id="raw-sockets">Raw sockets</h4>

<p>With a <code class="language-plaintext highlighter-rouge">poll(2)</code> server, the biggest overhead remaining is in the
kernel, where it allocates send and receive buffers for each client
and manages the proper TCP state. The next step to reducing this
overhead is Endlessh opening a <em>raw socket</em> and speaking TCP itself,
bypassing most of the operating system’s TCP/IP stack.</p>

<p>Much of the TCP connection state doesn’t matter to Endlessh and doesn’t
need to be tracked. For example, it doesn’t care about any data sent by
the client, so no receive buffer is needed, and any data that arrives
could be dropped on the floor.</p>

<p>Even more, raw sockets would allow for some even nastier tarpit tricks.
Despite the long delays between data lines, the kernel itself responds
very quickly on the TCP layer and below. ACKs are sent back quickly and
so on. An astute attacker could detect that the delay is artificial,
imposed above the TCP layer by an application.</p>

<p>If Endlessh worked at the TCP layer, it could <a href="https://nyman.re/super-simple-ssh-tarpit/">tarpit the TCP protocol
itself</a>. It could introduce artificial “noise” to the connection
that requires packet retransmissions, delay ACKs, etc. It would look a
lot more like network problems than a tarpit.</p>

<p>I haven’t taken Endlessh this far, nor do I plan to do so. At the
moment attackers either have a hard timeout, so this wouldn’t matter,
or they’re pretty dumb and Endlessh already works well enough.</p>

<h3 id="asyncio-and-other-tarpits">asyncio and other tarpits</h3>

<p>Since writing Endless <a href="/blog/2019/03/10/">I’ve learned about Python’s <code class="language-plaintext highlighter-rouge">asyncio</code></a>, and
it’s actually a near perfect fit for this problem. I should have just
used it in the first place. The hard part is already implemented within
<code class="language-plaintext highlighter-rouge">asyncio</code>, and the problem isn’t CPU-bound, so being written in Python
<a href="/blog/2019/02/24/">doesn’t matter</a>.</p>

<p>Here’s a simplified (no logging, no configuration, etc.) version of
Endlessh implemented in about 20 lines of Python 3.7:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">_reader</span><span class="p">,</span> <span class="n">writer</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
            <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'%x</span><span class="se">\r\n</span><span class="s">'</span> <span class="o">%</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">))</span>
            <span class="k">await</span> <span class="n">writer</span><span class="p">.</span><span class="n">drain</span><span class="p">()</span>
    <span class="k">except</span> <span class="nb">ConnectionResetError</span><span class="p">:</span>
        <span class="k">pass</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">server</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">start_server</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="s">'0.0.0.0'</span><span class="p">,</span> <span class="mi">2222</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">server</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">server</span><span class="p">.</span><span class="n">serve_forever</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>Since Python coroutines are stackless, the per-connection memory
overhead is comparable to the C version. So it seems asyncio is
perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up
attackers trying to exploit HTTP servers. It slowly sends a random,
endless HTTP header:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">_reader</span><span class="p">,</span> <span class="n">writer</span><span class="p">):</span>
    <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'HTTP/1.1 200 OK</span><span class="se">\r\n</span><span class="s">'</span><span class="p">)</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
            <span class="n">header</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'X-%x: %x</span><span class="se">\r\n</span><span class="s">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">header</span><span class="p">,</span> <span class="n">value</span><span class="p">))</span>
            <span class="k">await</span> <span class="n">writer</span><span class="p">.</span><span class="n">drain</span><span class="p">()</span>
    <span class="k">except</span> <span class="nb">ConnectionResetError</span><span class="p">:</span>
        <span class="k">pass</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">server</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">start_server</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="s">'0.0.0.0'</span><span class="p">,</span> <span class="mi">8080</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">server</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">server</span><span class="p">.</span><span class="n">serve_forever</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>Try it out for yourself. Firefox and Chrome will spin on that server
for hours before giving up. I have yet to see curl actually timeout on
its own in the default settings (<code class="language-plaintext highlighter-rouge">--max-time</code>/<code class="language-plaintext highlighter-rouge">-m</code> does work
correctly, though).</p>

<p>Parting exercise for the reader: Using the examples above as a starting
point, implement an SMTP tarpit using asyncio. Bonus points for using
TLS connections and testing it against real spammers.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A JIT Compiler Skirmish with SELinux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/11/15/"/>
    <id>urn:uuid:d4fa35ad-05c3-3b86-1083-d533dfacfb15</id>
    <updated>2018-11-15T18:57:47Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>This is a debugging war story.</p>

<p>Once upon a time I wrote a fancy data conversion utility. The input
was a complex binary format defined by a data dictionary supplied at
run time by the user alongside the input data. Since the converter was
typically used to process massive quantities of input, and the nature
of that input wasn’t known until run time, I wrote <a href="/blog/2015/03/19/">an x86-64 JIT
compiler</a> to speed it up. The converter generated a fast, native
binary parser in memory according to the data dictionary
specification. Processing data now took much less time and everyone
rejoiced.</p>

<p>Then along came SELinux, Sheriff of Pedantry. Not liking all the
shenanigans with page protections, SELinux huffed and puffed and made
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> return <code class="language-plaintext highlighter-rouge">EACCES</code> (“Permission denied”). Believing I was
following all the rules and so this would never happen, I foolishly
did not check the result and the converter was now crashing for its
users. What made SELinux so unhappy, and could this somehow be
resolved?</p>

<h3 id="allocating-memory">Allocating memory</h3>

<p>Before going further, let’s back up and review how this works. Suppose I
want to generate code at run time and execute it. In the old days this
was as simple as writing some machine code into a buffer and jumping to
that buffer — e.g. by converting the buffer to a function pointer and
calling it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="nf">int</span> <span class="p">(</span><span class="o">*</span><span class="n">jit_func</span><span class="p">)(</span><span class="kt">void</span><span class="p">);</span>

<span class="cm">/* NOTE: This doesn't work anymore! */</span>
<span class="n">jit_func</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="kt">int</span> <span class="n">retval</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">6</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* mov eax, retval */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xb8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
        <span class="cm">/* ret */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xc3</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">jit_func</span> <span class="n">f</span> <span class="o">=</span> <span class="n">jit_compile</span><span class="p">(</span><span class="mi">1001</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"f() = %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">f</span><span class="p">());</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This situation was far too easy for malicious actors to abuse. An
attacker could supply instructions of their own choosing — i.e. <em>shell
code</em> — as input and exploit a buffer overflow vulnerability to execute
the input buffer. These exploits were trivial to craft.</p>

<p>Modern systems have hardware checks to prevent this from happening.
Memory containing instructions must have their execute protection bit
set before those instructions can be executed. This is useful both for
making attackers work harder and for catching bugs in programs — no more
executing data by accident.</p>

<p>This is further complicated by the fact that memory protections have
page granularity. You can’t adjust the protections for a 6-byte
buffer. You do it for the entire surrounding page — typically 4kB, but
sometimes as large as 2MB. This requires replacing that <code class="language-plaintext highlighter-rouge">malloc(3)</code>
with a more careful allocation strategy. There are a few ways to go
about this.</p>

<h4 id="anonymous-memory-mapping">Anonymous memory mapping</h4>

<p>The most common and most sensible is to create an anonymous memory
mapping: a file memory map that’s not actually backed by a file. The
<code class="language-plaintext highlighter-rouge">mmap(2)</code> function has a flag specifically for this purpose:
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">anon_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately, <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> not part of POSIX. If you’re being super
strict with your includes — <a href="/blog/2017/03/30/">as I tend to be</a> — this flag won’t be
defined, even on systems where it’s supported.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span><span class="c1">// MAP_ANONYMOUS undefined!</span>
</code></pre></div></div>

<p>To get the flag, you must use the <code class="language-plaintext highlighter-rouge">_BSD_SOURCE</code>, or, more recently,
the <code class="language-plaintext highlighter-rouge">_DEFAULT_SOURCE</code> feature test macro to explicitly enable that
feature.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE </span><span class="cm">/* for MAP_ANONYMOUS */</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>The POSIX way to do this is to instead map <code class="language-plaintext highlighter-rouge">/dev/zero</code>. <strong>So, wanting to
be Mr. Portable, this is what I did in my tool.</strong> Take careful note of
this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"/dev/zero"</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="aligned-allocation">Aligned allocation</h4>

<p>Another, less common (and less portable) strategy is to lean on the
existing C memory allocator, being careful to allocate on page
boundaries so that the page protections don’t affect other allocations.
The classic allocation functions, like <code class="language-plaintext highlighter-rouge">malloc(3)</code>, don’t allow for this
kind of control. However, there are a couple of aligned allocation
alternatives.</p>

<p>The first is <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">posix_memalign</span><span class="p">(</span><span class="kt">void</span> <span class="o">**</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">alignment</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>By choosing page alignment and a size that’s a multiple of the page
size, it’s guaranteed to return whole pages. When done, pages are freed
with <code class="language-plaintext highlighter-rouge">free(3)</code>. Though, unlike unmapping, the original page protections
must first be restored since those pages may be reused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">pagesize</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGE_SIZE</span><span class="p">);</span> <span class="c1">// TODO: cache this</span>
    <span class="kt">size_t</span> <span class="n">roundup</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span> <span class="o">+</span> <span class="n">pagesize</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">pagesize</span> <span class="o">*</span> <span class="n">pagesize</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">posix_memalign</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">pagesize</span><span class="p">,</span> <span class="n">roundup</span><span class="p">)</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re using C11, there’s also <code class="language-plaintext highlighter-rouge">aligned_alloc(3)</code>. This is the most
uncommon of all since most C programmers refuse to switch to a new
standard until it’s at least old enough to drive a car.</p>

<h3 id="changing-page-protections">Changing page protections</h3>

<p>So we’ve allocated our memory, but it’s not going to start in an
executable state. Why? Because a <a href="https://en.wikipedia.org/wiki/W%5EX">W^X</a> (“write xor execute”)
policy is becoming increasingly common. Attempting to set both write
and execute protections at the same time may be denied. (In fact,
there’s an SELinux policy for this.)</p>

<p>As a JIT compiler, we need to write to a page <em>and</em> execute it. Again,
there are two strategies. The complicated strategy is to <a href="/blog/2016/04/10/">map the same
memory at two different places</a>, one with the execute protection,
one with the write protection. This allows the page to be modified as
it’s being executed without violating W^X.</p>

<p>The simpler and more secure strategy is to write the machine
instructions, then swap the page over to executable using <code class="language-plaintext highlighter-rouge">mprotect(2)</code>
once it’s ready. This is what I was doing in my tool.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">anon_alloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="cm">/* ... write instructions into the buffer ... */</span>
<span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="n">jit_func</span> <span class="n">func</span> <span class="o">=</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="n">func</span><span class="p">();</span>
</code></pre></div></div>

<p>At a high level, That’s pretty close to what I was actually doing. That
includes neglecting to check the result of <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. This worked
fine and dandy for several years, when suddenly (shown here in the style
<a href="/blog/2018/06/23/">of strace</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)
</code></pre></div></div>

<p>Then the program would crash trying to execute the buffer. Suddenly it
wasn’t allowed to make this buffer executable. My program hadn’t
changed. What <em>had</em> changed was the SELinux security policy on this
particular system.</p>

<h3 id="asking-for-help">Asking for help</h3>

<p>The problem is that I don’t administer this (Red Hat) system. I can’t
access the logs and I didn’t set the policy. I don’t have any insight
on <em>why</em> this call was suddenly being denied. To make this more
challenging, the folks that manage this system didn’t have the
necessary knowledge to help with this either.</p>

<p>So to figure this out, I need to treat it like a black box and probe
at system calls until I can figure out just what SELinux policy I’m up
against. I only have practical experience administrating Debian
systems (and its derivatives like Ubuntu), which means I’ve hardly
ever had to deal with SELinux. I’m flying fairly blind here.</p>

<p>Since my real application is large and complicated, I code up a
minimal example, around a dozen lines of code: allocate a single page
of memory, write a single return (<code class="language-plaintext highlighter-rouge">ret</code>) instruction into it, set it
as executable, and call it. The program checks for errors, and I can
run it under strace if that’s not insightful enough. This program is
also something simple I could provide to the system administrators,
since they were willing to turn some of the knobs to help narrow down
the problem.</p>

<p>However, <strong>here’s where I made a major mistake</strong>. Assuming the problem
was solely in <code class="language-plaintext highlighter-rouge">mprotect(2)</code>, and wanting to keep this as absolutely
simple as possible, I used <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code> to allocate that page. I
saw the same <code class="language-plaintext highlighter-rouge">EACCES</code> as before, and assumed I was demonstrating the
same problem. Take note of this, too.</p>

<h3 id="finding-a-resolution">Finding a resolution</h3>

<p>Eventually I’d need to figure out what policy was blocking my JIT
compiler, then see if there was an alternative route. The system
loader still worked after all, and I could plainly see that with
strace. So it wasn’t a blanket policy that completely blocked the
execute protection. Perhaps the loader was given an exception?</p>

<p>However, the very first order of business was to actually check the
result from <code class="language-plaintext highlighter-rouge">mprotect(2)</code> and do something more graceful rather than
crash. In my case, that meant falling back to executing a byte-code
virtual machine. I added the check, and now the program ran slower
instead of crashing.</p>

<p>The program runs on both Linux and Windows, and the allocation and
page protection management is abstracted. On Windows it uses
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> and <code class="language-plaintext highlighter-rouge">VirtualProtect()</code> instead of <code class="language-plaintext highlighter-rouge">mmap(2)</code> and
<code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Neither implementation checked that the protection
change succeeded, so I fixed the Windows implementation while I was at
it.</p>

<p>Thanks to Mingw-w64, I actually do most of my <a href="/blog/2016/06/13/">Windows
development</a> on Linux. And, thanks to <a href="https://www.winehq.org/">Wine</a>, I mean
everything, including running and debugging. Calling
<code class="language-plaintext highlighter-rouge">VirtualProtect()</code> in Wine would ultimately call <code class="language-plaintext highlighter-rouge">mprotect(2)</code> in the
background, which I expected would be denied. So running the Windows
version with Wine under this SELinux policy would be the perfect test.
Right?</p>

<p><strong>Except that <code class="language-plaintext highlighter-rouge">mprotect(2)</code> succeeded under Wine!</strong> The Windows version
of my JIT compiler was working just fine on Linux. Huh?</p>

<p>This system doesn’t have Wine installed. I had built <a href="/blog/2018/03/27/">and packaged it
myself</a>. This Wine build definitely has no SELinux exceptions.
Not only did the Wine loader work correctly, it can change page
protections in ways my own Linux programs could not. What’s different?</p>

<p>Debugging this with all these layers is starting to look silly, but
this is exactly why doing Windows development on Linux is so useful. I
run my program under Wine under strace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace wine ./mytool.exe
</code></pre></div></div>

<p>I study the system calls around <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Perhaps there’s some
stricter alignment issue? No. Perhaps I need to include <code class="language-plaintext highlighter-rouge">PROT_READ</code>?
No. The only difference I can find is they’re using the
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag. So, armed with this knowledge, <strong>I modify my
minimal example to allocate 1024 pages instead of just one, and
suddenly it works correctly</strong>. I was most of the way to figuring this
all out.</p>

<h3 id="inside-glibc-allocation">Inside glibc allocation</h3>

<p>Why did increasing the allocation size change anything? This is a
typical Linux system, so my program is linked against the GNU C
library, glibc. This library allocates memory from two places
depending on the allocation size.</p>

<p>For small allocations, glibc uses <code class="language-plaintext highlighter-rouge">brk(2)</code> to extend the executable
image — i.e. to extend the <code class="language-plaintext highlighter-rouge">.bss</code> section. These resources are not
returned to the operating system after they’re freed with <code class="language-plaintext highlighter-rouge">free(3)</code>.
They’re reused.</p>

<p>For large allocations, glibc uses <code class="language-plaintext highlighter-rouge">mmap(2)</code> to create a new, anonymous
mapping for that allocation. When freed with <code class="language-plaintext highlighter-rouge">free(3)</code>, that memory is
unmapped and its resources are returned to the operating system.</p>

<p>By increasing the allocation size, it became a “large” allocation and
was backed by an anonymous mapping. Even though I didn’t use <code class="language-plaintext highlighter-rouge">mmap(2)</code>,
to the operating system this would be indistinguishable to what Wine was
doing (and succeeding at).</p>

<p>Consider this little example program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When <em>not</em> compiled as a Position Independent Executable (PIE), here’s
what the output looks like. The first pointer is near where the program
was loaded, low in memory. The second pointer is a randomly selected
address high in memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x1077010
0x7fa9b998e010
</code></pre></div></div>

<p>And if you run it under strace, you’ll see that the first allocation
comes from <code class="language-plaintext highlighter-rouge">brk(2)</code> and the second comes from <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<h3 id="two-selinux-policies">Two SELinux policies</h3>

<p>With a little bit of research, I found the <a href="https://akkadia.org/drepper/selinux-mem.html">two SELinux policies</a>
at play here. In my minimal example, I was blocked by <code class="language-plaintext highlighter-rouge">allow_execheap</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execheap
</code></pre></div></div>

<p>This prohibits programs from setting the execute protection on any
“heap” page.</p>

<blockquote>
  <p>The POSIX specification does not permit it, but the Linux
implementation of <code class="language-plaintext highlighter-rouge">mprotect</code> allows changing the access protection of
memory on the heap (e.g., allocated using <code class="language-plaintext highlighter-rouge">malloc</code>). This error
indicates that heap memory was supposed to be made executable. Doing
this is really a bad idea. If anonymous, executable memory is needed
it should be allocated using <code class="language-plaintext highlighter-rouge">mmap</code> which is the only portable
mechanism.</p>
</blockquote>

<p>Obviously this is pretty loose since I was still able to do it with
<code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>, which, technically speaking, allocates from the
heap. So this policy applies to pages mapped by <code class="language-plaintext highlighter-rouge">brk(2)</code>.</p>

<p>The second policy was <code class="language-plaintext highlighter-rouge">allow_execmod</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execmod
</code></pre></div></div>

<blockquote>
  <p>The program mapped from a file with <code class="language-plaintext highlighter-rouge">mmap</code> and the <code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> flag
and write permission. Then the memory region has been written to,
resulting in copy-on-write (COW) of the affected page(s). This memory
region is then made executable […]. The <code class="language-plaintext highlighter-rouge">mprotect</code> call will fail
with <code class="language-plaintext highlighter-rouge">EACCES</code> in this case.</p>
</blockquote>

<p>I don’t understand what purpose this policy serves, but this is what
was causing my original problem. Pages mapped to <code class="language-plaintext highlighter-rouge">/dev/zero</code> are not
<em>actually</em> considered anonymous by Linux, at least as far as this
policy is concerned. I think this is a mistake, and that mapping the
special <code class="language-plaintext highlighter-rouge">/dev/zero</code> device should result in effectively anonymous
pages.</p>

<p>From this I learned a little lesson about baking assumptions — that
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> was solely at fault — into my minimal debugging examples.
And the fix was ultimately easy: I just had to suck it up and use the
slightly less pure <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Crude Personal Package Manager</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/03/27/"/>
    <id>urn:uuid:b100f50f-c8f8-3a08-e149-a04b2308226b</id>
    <updated>2018-03-27T02:10:35Z</updated>
    <category term="c"/><category term="posix"/><category term="linux"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>For the past couple of months I’ve been using a custom package manager
to manage a handful of software packages within various unix-like
environments. Packages are <a href="/blog/2017/06/19/">installed in my home directory</a> under
<code class="language-plaintext highlighter-rouge">~/.local/bin</code>, and the package manager itself is just a 110 line Bourne
shell script. It’s is not intended to replace the system’s package
manager but, instead, compliment it in some cases where I need more
flexibility. I use it to run custom versions of specific pieces of
software — newer or older than the system-installed versions, or with my
own patches and modifications — without interfering with the rest of
system, and without a need for root access. It’s worked out <em>really</em>
well so far and I expect to continue making heavy use of it in the
future.</p>

<p>It’s so simple that I haven’t even bothered putting the script in its
own repository. It sits unadorned within my dotfiles repository with the
name <em>qpkg</em> (“quick package”):</p>

<ul>
  <li><a href="https://github.com/skeeto/dotfiles/blob/master/bin/qpkg">https://github.com/skeeto/dotfiles/blob/master/bin/qpkg</a></li>
</ul>

<p>Sitting alongside my dotfiles means it’s always there when I need it,
just as if it was a built-in command.</p>

<p>I say it’s crude because its “install” (<code class="language-plaintext highlighter-rouge">-I</code>) procedure is little more
than a wrapper around tar. It doesn’t invoke libtool after installing a
library, and there’s no post-install script — or <code class="language-plaintext highlighter-rouge">postinst</code> as Debian
calls it. It doesn’t check for conflicts between packages, though
there’s a command for doing so manually ahead of time. It doesn’t manage
dependencies, nor even have them as a concept. That’s all on the user to
screw up.</p>

<p>In other words, it doesn’t attempt solve most of the hard problems
tackled by package managers… <em>except</em> for three important issues:</p>

<ol>
  <li>
    <p>It provides a clean, guaranteed-to-work uninstall procedure. Some
Makefiles <em>do</em> have a token “uninstall” target, but it’s often
unreliable.</p>
  </li>
  <li>
    <p>Unlike blindly using a Makefile “install” target, I can check for
conflicts <em>before</em> installing the software. I’ll know if and how a
package clobbers an already-installed package, and I can manage, or
ignore, that conflict manually as needed.</p>
  </li>
  <li>
    <p>It produces a compact, reusable package file that I can reinstall
later, even on a different machine (with a couple of caveats). I
don’t need to keep around the original source and build directories
should I want to install or uninstall later. I can also rapidly
switch back and forth between different builds of the same software.</p>
  </li>
</ol>

<p>The first caveat is that the package will be configured for exactly my
own home directory, so I usually can’t share it with other users, or
install it on machines where I have a different home directory. Though I
could still create packages for different installation prefixes.</p>

<p>The second caveat is that some builds tailor themselves by default to
the host (e.g. <code class="language-plaintext highlighter-rouge">-march=native</code>). If care isn’t taken, those packages may
not be very portable. This is more common than I had expected and has
mildly annoyed me.</p>

<h3 id="birth-of-a-package-manager">Birth of a package manager</h3>

<p>While the package manager is new, I’ve been building and installing
software in my home directory for years. I’d follow the normal process
of setting the install <em>prefix</em> to <code class="language-plaintext highlighter-rouge">$HOME/.local</code>, running the build,
and then letting the “install” target do its thing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
</code></pre></div></div>

<p>This worked well enough for years. However, I’ve come to rely a lot on
this technique, and I’m using it for increasingly sophisticated
purposes, such as building custom cross-compiler toolchains.</p>

<p>A common difficulty has been handling the release of new versions of
software. I’d like to upgrade to the new version, but lack a way to
cleanly uninstall the previous version. Simply clobbering the old
version by installing it on top <em>usually</em> works. Occasionally it
wouldn’t, and I’d have to blow away <code class="language-plaintext highlighter-rouge">~/.local</code> and start all over again.
With more and more software installed in my home directory, restarting
has become more and more of a chore that I’d like to avoid.</p>

<p>What I needed was a way to track exactly which files were installed so
that I could remove them later when I needed to uninstall. Fortunately
there’s a widely-used convention for exactly this purpose: <code class="language-plaintext highlighter-rouge">DESTDIR</code>.</p>

<p>It’s expected that when a Makefile provides an “install” target, it
prefixes the installation path with the <code class="language-plaintext highlighter-rouge">DESTDIR</code> macro, which is
assigned to the empty string by default. This allows the user to install
the software to a temporary location for the purposes of packaging.
Unlike the installation prefix (<code class="language-plaintext highlighter-rouge">--prefix</code>) configured before the build
began, the software is not expected to function properly when run in the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> location.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install
</code></pre></div></div>

<p>A different tool will used to copy these files into place and actually
install it. This tool can track what files were installed, allowing them
to be removed later when uninstalling. My package manager uses the tar
program for both purposes. First it creates a package by packing up the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> (at the root of the actual install prefix):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar czf package.tgz -C $DESTDIR$HOME/.local .
</code></pre></div></div>

<p>So a package is nothing more than a gzipped tarball. To install, it
unpacks the tarball in <code class="language-plaintext highlighter-rouge">~/.local</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd $HOME/.local
$ tar xzf ~/package.tgz
</code></pre></div></div>

<p>But how does it uninstall a package? It didn’t keep track of what was
installed. Easy! The tarball itself contains the package list, and it’s
printed with tar’s <code class="language-plaintext highlighter-rouge">t</code> mode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">grep</code> to skip directories, which are conveniently listed with
a trailing slash. Note that in the example above, there are a couple of
issues with file names containing whitespace. If the file contains a
space character, it will word split incorrectly in the <code class="language-plaintext highlighter-rouge">for</code> loop. A
Makefile couldn’t handle such a file in the first place, but, in case
it’s still necessary, my package manager sets <code class="language-plaintext highlighter-rouge">IFS</code> to just a newline.</p>

<p>If the file name contains a newline, then my package manager relies on
<a href="http://dinaburg.org/bitsquatting.html">a cosmic ray striking just the right bit at just the right
instant</a> to make it all work out, because no version of tar can
unambiguously print such file names. Crossing your fingers during this
process may help.</p>

<h3 id="commands">Commands</h3>

<p>There are five commands, each assigned to a capital letter: <code class="language-plaintext highlighter-rouge">-B</code>, <code class="language-plaintext highlighter-rouge">-C</code>,
<code class="language-plaintext highlighter-rouge">-I</code>, <code class="language-plaintext highlighter-rouge">-V</code>,  and <code class="language-plaintext highlighter-rouge">-U</code>. It’s an interface pattern inspired by <a href="https://www.openbsd.org/papers/bsdcan-signify.html">Ted
Unangst’s signify</a> (see <a href="https://man.openbsd.org/signify.1"><code class="language-plaintext highlighter-rouge">signify(1)</code></a>). I also used this
pattern with <a href="/blog/2017/09/15/">Blowpipe</a> and, in retrospect, wish I had also used
with <a href="/blog/2017/03/12/">Enchive</a>.</p>

<h4 id="build--b">Build (<code class="language-plaintext highlighter-rouge">-B</code>)</h4>

<p>Unlike the other three commands, the “build” command isn’t essential,
and is just for convenience. It assumes the build uses an Autoconfg-like
configure script and runs it automatically, followed by <code class="language-plaintext highlighter-rouge">make</code> with the
appropriate <code class="language-plaintext highlighter-rouge">-j</code> (jobs) option. It automatically sets the <code class="language-plaintext highlighter-rouge">--prefix</code>
argument when running the configure script.</p>

<p>If the build uses something other and an Autoconf-like configure script,
such as CMake, then you can’t use the “build” command and must perform
the build yourself. For example, I must do this when building LLVM and
Clang.</p>

<p>Before using the “build” command, the package must first be unpacked and
patched if necessary. Then the package manager can take over to run the
build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 &lt; ../0001.patch
$ patch -p1 &lt; ../0002.patch
$ patch -p1 &lt; ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/
</code></pre></div></div>

<p>In this example I’m doing an out-of-source build by invoking the
configure script from a different directory. Did you know Autoconf
scripts support this? I didn’t know until recently! Unfortunately some
hand-written Autoconf-like scripts don’t, though this will
be immediately obvious.</p>

<p>Once <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the program will be fully built — or stuck on a
build error if you’re unlucky. If you need to pass custom configure
options, just tack them on the <code class="language-plaintext highlighter-rouge">qpkg</code> command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses
</code></pre></div></div>

<p>Since the second and third steps — creating the build directory and
moving into it — is so common, there’s an optional switch for it: <code class="language-plaintext highlighter-rouge">-d</code>.
This option’s argument is the build directory. <code class="language-plaintext highlighter-rouge">qpkg</code> creates that
directory and runs the build inside it. In practice I just use “x” for
the build directory since it’s so quick to add “dx” to the command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/
</code></pre></div></div>

<p>With the software compiled, the next step is creating the package.</p>

<h4 id="create--c">Create (<code class="language-plaintext highlighter-rouge">-C</code>)</h4>

<p>The “create” command creates the <code class="language-plaintext highlighter-rouge">DESTDIR</code> (<code class="language-plaintext highlighter-rouge">_destdir</code> in the working
directory) and runs the “install” Makefile target to fill it with files.
Continuing with the example above and its <code class="language-plaintext highlighter-rouge">x/</code> build directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -Cdx name
</code></pre></div></div>

<p>Where “name” is the name of the package, without any file name
extension. Like with “build”, extra arguments after the package name are
passed to <code class="language-plaintext highlighter-rouge">make</code> in case there needs to be any additional tweaking.</p>

<p>When the “create” command finishes, there will be new package named
<code class="language-plaintext highlighter-rouge">name.tgz</code> in the working directory. At this point the source and build
directories are no longer needed, assuming everything went fine.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf name-version/
$ rm -rf x/
</code></pre></div></div>

<p>This package is ready to install, though you may want to verify it
first.</p>

<h4 id="verify--v">Verify (<code class="language-plaintext highlighter-rouge">-V</code>)</h4>

<p>The “verify” command checks for collisions against installed packages.
It works like uninstallation, but rather than deleting files, it checks
if any of the files already exist. If they do, it means there’s a
conflict with an existing package. These file names are printed.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -V name.tgz
</code></pre></div></div>

<p>The most common conflict I’ve seen is in the info index (<code class="language-plaintext highlighter-rouge">info/dir</code>)
file, which is safe to ignore since I don’t care about it.</p>

<p>If the package has already been installed, there will of course be tons
of conflicts. This is the easiest way to check if a package has been
installed.</p>

<h4 id="install--i">Install (<code class="language-plaintext highlighter-rouge">-I</code>)</h4>

<p>The “install” command is just the dumb <code class="language-plaintext highlighter-rouge">tar xzf</code> explained above. It
will clobber anything in its way without warning, which is why, if that
matters, “verify” should be used first.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -I name.tgz
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the package has been installed and is probably
ready to go. A lot of packages complain that you need to run libtool to
finalize an installation, but I’ve never had a problem skipping it. This
dumb unpacking generally works fine.</p>

<h4 id="uninstall--u">Uninstall (<code class="language-plaintext highlighter-rouge">-U</code>)</h4>

<p>Obviously the last command is “uninstall”. As explained above, this
needs the original package that was given to the “install” command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -U name.tgz
</code></pre></div></div>

<p>Just as “install” is dumb, so is “uninstall,” blindly deleting anything
listed in the tarball. One thing I like about dumb tools is that there
are no surprises.</p>

<p>I typically suffix the package name with the version number to help keep
the packages organized. When upgrading to a new version of a piece of
software, I build the new package, which, thanks to the version suffix,
will have a distinct name. Then I uninstall the old package, and,
finally, install the new one in its place. So far I’ve been keeping the
old package around in case I still need it, though I could always
rebuild it in a pinch.</p>

<h3 id="package-by-accumulation">Package by accumulation</h3>

<p>Building a GCC cross-compiler toolchain is a tricky case that doesn’t
fit so well with the build, create, and install process illustrated
above. It would be nice for the cross-compiler to be a single, big
package, but due to the way it’s built, it would need to be five or so
packages, a couple of which will conflict (one being a subset of
another):</p>

<ol>
  <li>binutils</li>
  <li>C headers</li>
  <li>core GCC</li>
  <li>C runtime</li>
  <li>rest of GCC</li>
</ol>

<p>Each step needs to be installed before the next step will work. (I don’t
even want to think about cross-compiling a cross-compiler.)</p>

<p>To deal with this, I added a “keep” (<code class="language-plaintext highlighter-rouge">-k</code>) option that leaves the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> around after creating the package. To keep things tidy, the
intermediate packages exist and are installed, but the final, big
cross-compiler package <em>accumulates</em> into the <code class="language-plaintext highlighter-rouge">DESTDIR</code>. The final
package at the end is actually the whole cross compiler in one package,
a superset of them all.</p>

<p>Complicated situations like these are where I can really understand the
value of Debian’s <a href="https://wiki.debian.org/FakeRoot">fakeroot</a> tool.</p>

<h3 id="my-use-case-and-an-alternative">My use case, and an alternative</h3>

<p>The role filled by my package manager is actually pretty well suited for
<a href="https://www.pkgsrc.org/">pkgsrc</a>, which is NetBSD’s ports system made available to other
unix-like systems. However, I just need something really lightweight
that gives me absolute control — even more than I get with pkgsrc — in
the dozen or so cases where I <em>really</em> need it.</p>

<p>All I need is a standard C toolchain in a unix-like environment (even a
really old one), the source tarballs for the software I need, my 110
line shell script package manager, and one to two cans of elbow grease.
From there I can bootstrap everything I might need without root access,
even <a href="/blog/2017/04/01/">in a disaster</a>. If the software I need isn’t written in C, it
can ultimately get bootstrapped from some crusty old C compiler, which
might even involve building some newer C compilers in between. After a
certain point it’s C all the way down.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Blowpipe: a Blowfish-encrypted, Authenticated Pipe</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/15/"/>
    <id>urn:uuid:1cddecb9-44b1-346c-ded6-c099069ce013</id>
    <updated>2017-09-15T23:59:59Z</updated>
    <category term="crypto"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p><a href="https://github.com/skeeto/blowpipe"><strong>Blowpipe</strong></a> is a <em>toy</em> crypto tool that creates a
<a href="https://www.schneier.com/academic/blowfish/">Blowfish</a>-encrypted pipe. It doesn’t open any files and instead
encrypts and decrypts from standard input to standard output. This
pipe can encrypt individual files or even encrypt a network
connection (à la netcat).</p>

<p>Most importantly, since Blowpipe is intended to be used as a pipe
(duh), it will <em>never</em> output decrypted plaintext that hasn’t been
<em>authenticated</em>. That is, it will detect tampering of the encrypted
stream and truncate its output, reporting an error, without producing
the manipulated data. Some very similar tools that <em>aren’t</em> considered
toys lack this important feature, such as <a href="http://loop-aes.sourceforge.net/aespipe.README">aespipe</a>.</p>

<h3 id="purpose">Purpose</h3>

<p>Blowpipe came about because I wanted to study Blowfish, a 64-bit block
cipher designed by Bruce Schneier in 1993. It’s played an important
role in the history of cryptography and has withstood cryptanalysis
for 24 years. Its major weakness is its small block size, leaving it
vulnerable to birthday attacks regardless of any other property of the
cipher. Even in 1993 the 64-bit block size was a bit on the small
side, but Blowfish was intended as a drop-in replacement for the Data
Encryption Standard (DES) and the International Data Encryption
Algorithm (IDEA), other 64-bit block ciphers.</p>

<p>The main reason I’m calling this program a toy is that, outside of
legacy interfaces, it’s simply <a href="https://sweet32.info/">not appropriate to deploy a 64-bit
block cipher in 2017</a>. Blowpipe shouldn’t be used to encrypt
more than a few tens of GBs of data at a time. Otherwise I’m <em>fairly</em>
confident in both my message construction and my implementation. One
detail is a little uncertain, and I’ll discuss it later when
describing message format.</p>

<p>A tool that I <em>am</em> confident about is <a href="https://github.com/skeeto/enchive">Enchive</a>, though since
it’s <a href="/blog/2017/03/12/">intended for file encryption</a>, it’s not appropriate for use
as a pipe. It doesn’t authenticate until after it has produced most of
its output. Enchive does try its best to delete files containing
unauthenticated output when authentication fails, but this doesn’t
prevent you from consuming this output before it can be deleted,
particularly if you pipe the output into another program.</p>

<h3 id="usage">Usage</h3>

<p>As you might expect, there are two modes of operation: encryption (<code class="language-plaintext highlighter-rouge">-E</code>)
and decryption (<code class="language-plaintext highlighter-rouge">-D</code>). The simplest usage is encrypting and decrypting a
file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ blowpipe -E &lt; data.gz &gt; data.gz.enc
$ blowpipe -D &lt; data.gz.enc | gunzip &gt; data.txt
</code></pre></div></div>

<p>In both cases you will be prompted for a passphrase which can be up to
72 bytes in length. The only verification for the key is the first
Message Authentication Code (MAC) in the datastream, so Blowpipe
cannot tell the difference between damaged ciphertext and an incorrect
key.</p>

<p>In a script it would be smart to check Blowpipe’s exit code after
decrypting. The output will be truncated should authentication fail
somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t
clean up for you.</p>

<p>Another use case is securely transmitting files over a network with
netcat. In this example I’ll use a pre-shared key file, <code class="language-plaintext highlighter-rouge">keyfile</code>.
Rather than prompt for a key, Blowpipe will use the raw bytes of a given
file. Here’s how I would create a key file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ head -c 32 /dev/urandom &gt; keyfile
</code></pre></div></div>

<p>First the receiver listens on a socket (<code class="language-plaintext highlighter-rouge">bind(2)</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nc -lp 2000 | blowpipe -D -k keyfile &gt; data.zip
</code></pre></div></div>

<p>Then the sender connects (<code class="language-plaintext highlighter-rouge">connect(2)</code>) and pipes Blowpipe through:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ blowpipe -E -k keyfile &lt; data.zip | nc -N hostname 2000
</code></pre></div></div>

<p>If all went well, Blowpipe will exit with 0 on the receiver side.</p>

<p>Blowpipe doesn’t buffer its output (but see <code class="language-plaintext highlighter-rouge">-w</code>). It performs one
<code class="language-plaintext highlighter-rouge">read(2)</code>, encrypts whatever it got, prepends a MAC, and calls
<code class="language-plaintext highlighter-rouge">write(2)</code> on the result. This means it can comfortably transmit live
sensitive data across the network:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000
</code></pre></div></div>

<p>Kernel messages will appear on the other end as they’re produced by
<code class="language-plaintext highlighter-rouge">dmesg</code>. Though keep in mind that the size of each line will be known to
eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to
disguise the length. Those lengths may leak useful information.</p>

<h3 id="blowfish">Blowfish</h3>

<p>This whole project started when I wanted to <a href="/blog/2017/09/21/">play with Blowfish</a>
as a small drop-in library. I wasn’t satisfied with <a href="https://www.schneier.com/academic/blowfish/download.html">the
selection</a>, so I figured it would be a good exercise to write my
own. Besides, the <a href="https://www.schneier.com/academic/archives/1994/09/description_of_a_new.html">specification</a> is both an enjoyable and easy
read (and recommended). It justifies the need for a new cipher and
explains the various design decisions.</p>

<p>I coded from the specification, including writing <a href="https://github.com/skeeto/blowpipe/blob/master/gen-tables.sh">a script</a>
to generate the subkey initialization tables. Subkeys are initialized
to the binary representation of pi (the first ~10,000 decimal digits).
After a couple hours of work I hooked up the official test vectors to
see how I did, and all the tests passed on the first run. This wasn’t
reasonable, so I spent awhile longer figuring out how I screwed up my
tests. Turns out I absolutely <em>nailed it</em> on my first shot. It’s a
really great sign for Blowfish that it’s so easy to implement
correctly.</p>

<p>Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of
storage. The key schedule is unusually complex: Subkeys are repeatedly
encrypted with themselves as they are being computed. This complexity
inspired the <a href="https://www.usenix.org/legacy/events/usenix99/provos/provos_html/node1.html">bcrypt</a> password hashing scheme, which
essentially works by iterating the key schedule many times in a loop,
then encrypting a constant 24-byte string. My bcrypt implementation
wasn’t nearly as successful on my first attempt, and it took hours of
debugging in order to match OpenBSD’s outputs.</p>

<p>The encryption and decryption algorithms are nearly identical, as is
typical for, and a feature of, Feistel ciphers. There are no branches
(preventing some side-channel attacks), and the only operations are
32-bit XOR and 32-bit addition. This makes it ideal for implementation
on 32-bit computers.</p>

<p>One tricky point is that encryption and decryption operate on a pair
of 32-bit integers (another giveaway that it’s a Feistel cipher). To
put the cipher to practical use, these integers have to be <a href="/blog/2016/11/22/">serialized
into a byte stream</a>. The specification doesn’t choose a byte
order, even for mixing the key into the subkeys. The official test
vectors are also 32-bit integers, not byte arrays. An implementer
could choose little endian, big endian, or even something else.</p>

<p>However, there’s one place in which this decision <em>is</em> formally made:
the official test vectors mix the key into the first subkey in big
endian byte order. By luck I happened to choose big endian as well,
which is why my tests passed on the first try. OpenBSD’s version of
bcrypt also uses big endian for all integer encoding steps, further
cementing big endian as the standard way to encode Blowfish integers.</p>

<h3 id="blowfish-library">Blowfish library</h3>

<p>The Blowpipe repository contains a ready-to-use, public domain Blowfish
library written in strictly conforming C99. The interface is just three
functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">blowfish_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">blowfish_encrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">blowfish_decrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Technically the key can be up to 72 bytes long, but the last 16 bytes
have an incomplete effect on the subkeys, so only the first 56 bytes
should matter. Since bcrypt runs the key schedule multiple times, all
72 bytes have full effect.</p>

<p>The library also includes a bcrypt implementation, though it will only
produce the raw password hash, not the base-64 encoded form. The main
reason for including bcrypt is to support Blowpipe.</p>

<h3 id="message-format">Message format</h3>

<p>The main goal of Blowpipe was to build a robust, authenticated
encryption tool using <em>only</em> Blowfish as a cryptographic primitive.</p>

<ol>
  <li>
    <p>It uses bcrypt with a moderately-high cost as a key derivation
function (KDF). Not terrible, but this is not a memory hard KDF,
which is important for protecting against cheap hardware brute force
attacks.</p>
  </li>
  <li>
    <p>Encryption is Blowfish in “counter” <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29">CTR mode</a>. A 64-bit
counter is incremented and encrypted, producing a keystream. The
plaintext is XORed with this keystream like a stream cipher. This
allows the last block to be truncated when output and eliminates
some padding issues. Since CRT mode is trivially malleable, the MAC
becomes even more important. In CTR mode, <code class="language-plaintext highlighter-rouge">blowfish_decrypt()</code> is
never called. In fact, Blowpipe never uses it.</p>
  </li>
  <li>
    <p>The authentication scheme is Blowfish-CBC-MAC with a unique key and
<a href="https://moxie.org/blog/the-cryptographic-doom-principle/">encrypt-then-authenticate</a> (something I harmlessly got wrong
with Enchive). It essentially encrypts the ciphertext again with a
different key, but in <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CBC">Cipher Block Chaining mode</a> (CBC), but
it only saves the final block. The final block is prepended to the
ciphertext as the MAC. On decryption the same block is computed
again to ensure that it matches. Only someone who knows the MAC key
can compute it.</p>
  </li>
</ol>

<p>Of all three Blowfish uses, I’m least confident about authentication.
<a href="https://blog.cryptographyengineering.com/2013/02/15/why-i-hate-cbc-mac/">CBC-MAC is tricky to get right</a>, though I am following the
rules: fixed length messages using a different key than encryption.</p>

<p>Wait a minute. Blowpipe is pipe-oriented and can output data without
buffering the entire pipe. How can there be fixed-length messages?</p>

<p>The pipe datastream is broken into 64kB <em>chunks</em>. Each chunk is
authenticated with its own MAC. Both the MAC and chunk length are
written in the chunk header, and the length is authenticated by the
MAC. Furthermore, just like the keystream, the MAC is continued from
previous chunk, preventing chunks from being reordered. Blowpipe can
output the content of a chunk and discard it once it’s been
authenticated. If any chunk fails to authenticate, it aborts.</p>

<p><img src="/img/diagram/blowpipe.svg" alt="" /></p>

<p>This also leads to another useful trick: The pipe is terminated with a
zero length chunk, preventing an attacker from appending to the
datastream. Everything after the zero-length chunk is discarded. Since
the length is authenticated by the MAC, the attacker also cannot
truncate the pipe since that would require knowledge of the MAC key.</p>

<p>The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1
byte for the bcrypt cost. The salt is like an initialization vector (IV)
that allows keys to be safely reused in different Blowpipe instances.
The cost byte is the only distinguishing byte in the stream. Since even
the chunk lengths are encrypted, everything else in the datastream
should be indistinguishable from random data.</p>

<h3 id="portability">Portability</h3>

<p>Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I
initially wrote it for POSIX (on Linux) of course, but I took an unusual
approach when it came time to port it to Windows. Normally I’d <a href="/blog/2017/03/01/">invent a
generic OS interface</a> that makes the appropriate host system
calls. This time I kept the POSIX interface (<code class="language-plaintext highlighter-rouge">read(2)</code>, <code class="language-plaintext highlighter-rouge">write(2)</code>,
<code class="language-plaintext highlighter-rouge">open(2)</code>, etc.) and implemented the tiny subset of POSIX that I needed
in terms of Win32. That implementation can be found under <code class="language-plaintext highlighter-rouge">w32-compat/</code>.
I even dropped in a copy of <a href="https://github.com/skeeto/getopt">my own <code class="language-plaintext highlighter-rouge">getopt()</code></a>.</p>

<p>One really cool feature of this technique is that, on Windows, Blowpipe
will still “open” <code class="language-plaintext highlighter-rouge">/dev/urandom</code>. It’s intercepted by my own <code class="language-plaintext highlighter-rouge">open(2)</code>,
which in response to that filename actually calls
<a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa379886(v=vs.85).aspx"><code class="language-plaintext highlighter-rouge">CryptAcquireContext()</code></a> and pretends like it’s a file. It’s all
hidden behind the file descriptor. That’s the unix way.</p>

<p>I’m considering giving Enchive the same treatment since it would simply
and reduce much of the interface code. In fact, this project has taught
me a number of ways that Enchive could be improved. That’s the value of
writing “toys” such as Blowpipe.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Tutorial on Portable Makefiles</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/08/20/"/>
    <id>urn:uuid:dc6580f0-1703-389b-7bb2-ac29899fd22c</id>
    <updated>2017-08-20T03:03:51Z</updated>
    <category term="tutorial"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>In my first decade writing Makefiles, I developed the bad habit of
liberally using GNU Make’s extensions. I didn’t know the line between
GNU Make and the portable features guaranteed by POSIX. Usually it
didn’t matter much, but it would become an annoyance when building on
non-Linux systems, such as on the various BSDs. I’d have to specifically
install GNU Make, then remember to invoke it (i.e. as <code class="language-plaintext highlighter-rouge">gmake</code>) instead
of the system’s make.</p>

<p>I’ve since become familiar and comfortable with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html">make’s official
specification</a>, and I’ve spend the last year writing strictly
portable Makefiles. Not only has are my builds now portable across all
unix-like systems, my Makefiles are cleaner and more robust. Many of the
common make extensions — conditionals in particular — lead to fragile,
complicated Makefiles and are best avoided anyway. It’s important to be
able to trust your build system to do its job correctly.</p>

<p><strong>This tutorial should be suitable for make beginners who have never
written their own Makefiles before, as well as experienced developers
who want to learn how to write portable Makefiles.</strong> Regardless, in
order to understand the examples you must be familiar with the usual
steps for building programs on the command line (compiler, linker,
object files, etc.). I’m not going to suggest any fancy tricks nor
provide any sort of standard starting template. Makefiles should be dead
simple when the project is small, and grow in a predictable, clean
fashion alongside the project.</p>

<p>I’m not going to cover every feature. You’ll need to read the
specification for yourself to learn it all. This tutorial will go over
the important features as well as the common conventions. It’s important
to follow established conventions so that people using your Makefiles
will know what to expect and how to accomplish the basic tasks.</p>

<p>If you’re running Debian, or a Debian derivative such as Ubuntu, the
<code class="language-plaintext highlighter-rouge">bmake</code> and <code class="language-plaintext highlighter-rouge">freebsd-buildutils</code> packages will provide the <code class="language-plaintext highlighter-rouge">bmake</code> and
<code class="language-plaintext highlighter-rouge">fmake</code> programs respectively. These alternative make implementations
are very useful for testing your Makefiles’ portability, should you
accidentally make use of a GNU Make feature. It’s not perfect since each
implements some of the same extensions as GNU Make, but it will catch
some common mistakes.</p>

<h3 id="whats-in-a-makefile">What’s in a Makefile?</h3>

<blockquote>
  <p>I am free, no matter what rules surround me. If I find them tolerable,
I tolerate them; if I find them too obnoxious, I break them. I am free
because I know that I alone am morally responsible for everything I
do. ―Robert A. Heinlein</p>
</blockquote>

<p>At make’s core are one or more dependency trees, constructed from
<em>rules</em>. Each vertex in the tree is called a <em>target</em>. The final
products of the build (executable, document, etc.) are the tree roots. A
Makefile specifies the dependency trees and supplies the shell commands
to produce a target from its <em>prerequisites</em>.</p>

<p><img src="/img/make/game.svg" alt="" /></p>

<p>In this illustration, the “.c” files are source files that are written
by hand, not generated by commands, so they have no prerequisites. The
syntax for specifying one or more edges in this dependency tree is
simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>target [target...]: [prerequisite...]
</code></pre></div></div>

<p>While technically multiple targets can be specified in a single rule,
this is unusual. Typically each target is specified in its own rule. To
specify the tree in the illustration above:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>The order of these rules doesn’t matter. The entire Makefile is parsed
before any actions are taken, so the tree’s vertices and edges can be
specified in any order. There’s one exception: the first non-special
target in a Makefile is the <em>default target</em>. This target is selected
implicitly when make is invoked without choosing a target. It should be
something sensible, so that a user can blindly run make and get a useful
result.</p>

<p>A target can be specified more than once. Any new prerequisites are
appended to the previously-given prerequisites. For example, this
Makefile is identical to the previous, though it’s typically not written
this way:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">physics.o</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>There are six <em>special targets</em> that are used to change the behavior
of make itself. All have uppercase names and start with a period.
Names fitting this pattern are reserved for use by make. According to
the standard, in order to get reliable POSIX behavior, the first
non-comment line of the Makefile <em>must</em> be <code class="language-plaintext highlighter-rouge">.POSIX</code>. Since this is a
special target, it’s not a candidate for the default target, so <code class="language-plaintext highlighter-rouge">game</code>
will remain the default target:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>In practice, even a simple program will have header files, and sources
that include a header file should also have an edge on the dependency
tree for it. If the header file changes, targets that include it should
also be rebuilt.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
</code></pre></div></div>

<h3 id="adding-commands-to-rules">Adding commands to rules</h3>

<p>We’ve constructed a dependency tree, but we still haven’t told make how
to actually build any targets from its prerequisites. The rules also
need to specify the shell commands that produce a target from its
prerequisites.</p>

<p>If you were to create the source files in the example and invoke make,
you will find that it actually <em>does</em> know how to build the object
files. This is because make is initially configured with certain
<em>inference rules</em>, a topic which will be covered later. For now, we’ll
add the <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> special target to the top, erasing all the built-in
inference rules.</p>

<p>Commands immediately follow the target/prerequisite line in a rule. Each
command line must start with a tab character. This can be awkward if
your text editor isn’t configured for it, and it will be awkward if you
try to copy the examples from this page.</p>

<p>Each line is run in its own shell, so be mindful of using commands like
<code class="language-plaintext highlighter-rouge">cd</code>, which won’t affect later lines.</p>

<p>The simplest thing to do is literally specify the same commands you’d
type at the shell:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">cc</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">input.c</span>
</code></pre></div></div>

<h3 id="invoking-make-and-choosing-targets">Invoking make and choosing targets</h3>

<blockquote>
  <p>I tried to walk into Target, but I missed. ―Mitch Hedberg</p>
</blockquote>

<p>When invoking make, it accepts zero or more targets from the dependency
tree, and it will build these targets — e.g. run the commands in the
target’s rule — if the target is <em>out-of-date</em>. A target is out-of-date
if it is older than any of its prerequisites.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># build the "game" binary (default target)
$ make

# build just the object files
$ make graphics.o physics.o input.o
</code></pre></div></div>

<p>This effect cascades up the dependency tree and causes further targets
to be rebuilt until all of the requested targets are up-to-date. There’s
a lot of room for parallelism since different branches of the tree can
be updated independently. It’s common for make implementations to
support parallel builds with the <code class="language-plaintext highlighter-rouge">-j</code> option. This is non-standard, but
it’s a fantastic feature that doesn’t require anything special in the
Makefile to work correctly.</p>

<p>Similar to parallel builds is make’s <code class="language-plaintext highlighter-rouge">-k</code> (“keep going”) option, which
<em>is</em> standard. This tells make not to stop on the first error, and to
continue updating targets that are unaffected by the error. This is nice
for fully populating <a href="http://vimdoc.sourceforge.net/htmldoc/quickfix.html">Vim’s quickfix list</a> or <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Compilation.html">Emacs’ compilation
buffer</a>.</p>

<p>It’s common to have multiple targets that should be built by default. If
the first rule selects the default target, how do we solve the problem
of needing multiple default targets? The convention is to use <em>phony
targets</em>. These are called “phony” because there is no corresponding
file, and so phony targets are never up-to-date. It’s convention for a
phony “all” target to be the default target.</p>

<p>I’ll make <code class="language-plaintext highlighter-rouge">game</code> a prerequisite of a new “all” target. More real targets
could be added as necessary to turn them into defaults. Users of this
Makefile will also expect <code class="language-plaintext highlighter-rouge">make all</code> to build the entire project.</p>

<p>Another common phony target is “clean” which removes all of the built
files. Users will expect <code class="language-plaintext highlighter-rouge">make clean</code> to delete all generated files.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">cc</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">input.c</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<h3 id="customize-the-build-with-macros">Customize the build with macros</h3>

<p>So far the Makefile hardcodes <code class="language-plaintext highlighter-rouge">cc</code> as the compiler, and doesn’t use any
compiler flags (warnings, optimization, hardening, etc.). The user
should be able to easily control all these things, but right now they’d
have to edit the entire Makefile to do so. Perhaps the user has both
<code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code> installed, and wants to choose one or the other
without changing which is installed as <code class="language-plaintext highlighter-rouge">cc</code>.</p>

<p>To solve this, make has <em>macros</em> that expand into strings when
referenced. The convention is to use the macro named <code class="language-plaintext highlighter-rouge">CC</code> when talking
about the C compiler, <code class="language-plaintext highlighter-rouge">CFLAGS</code> when talking about flags passed to the C
compiler, <code class="language-plaintext highlighter-rouge">LDFLAGS</code> for flags passed to the C compiler when linking, and
<code class="language-plaintext highlighter-rouge">LDLIBS</code> for flags about libraries when linking. The Makefile should
supply defaults as needed.</p>

<p>A macro is expanded with <code class="language-plaintext highlighter-rouge">$(...)</code>. It’s valid (and normal) to reference
a macro that hasn’t been defined, which will be an empty string. This
will be the case with <code class="language-plaintext highlighter-rouge">LDFLAGS</code> below.</p>

<p>Macro values can contain other macros, which will be expanded
recursively each time the macro is expanded. Some make implementations
allow the name of the macro being expanded to itself be a macro, which
<a href="/blog/2016/04/30/">is turing complete</a>, but this behavior is non-standard.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">input.c</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<p>Macros are overridden by macro definitions given as command line
arguments in the form <code class="language-plaintext highlighter-rouge">name=value</code>. This allows the user to select their
own build configuration. <strong>This is one of make’s most powerful and
under-appreciated features.</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=clang CFLAGS='-O3 -march=native'
</code></pre></div></div>

<p>If the user doesn’t want to specify these macros on every invocation,
they can (cautiously) use make’s <code class="language-plaintext highlighter-rouge">-e</code> flag to set overriding macros
definitions from the environment.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export CC=clang
$ export CFLAGS=-O3
$ make -e all
</code></pre></div></div>

<p>Some make implementations have other special kinds of macro assignment
operators beyond simple assignment (<code class="language-plaintext highlighter-rouge">=</code>). These are unnecessary, so
don’t worry about them.</p>

<h3 id="inference-rules-so-that-you-can-stop-repeating-yourself">Inference rules so that you can stop repeating yourself</h3>

<blockquote>
  <p>The road itself tells us far more than signs do. ―Tom Vanderbilt,
Traffic: Why We Drive the Way We Do</p>
</blockquote>

<p>There’s repetition across the three different object files. Wouldn’t it
be nice if there was a way to communicate this pattern? Fortunately
there is, in the form of <em>inference rules</em>. It says that a target with
a certain extension, with a prerequisite with another certain extension,
is built a certain way. This will make more sense with an example.</p>

<p>In an inference rule, the target indicates the extensions. The <code class="language-plaintext highlighter-rouge">$&lt;</code>
macro expands to the prerequisite, which is essential to making
inference rules work generically. Unfortunately this macro is not
available in target rules, as much as that would be useful.</p>

<p>For example, here’s an inference rule that teaches make how to build an
object file from a C source file. This particular rule is one that
is pre-defined by make, so you’ll never need to write this one yourself.
I’ll include it for completeness.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.c.o</span><span class="o">:</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">-c</span> <span class="err">$&lt;</span>
</code></pre></div></div>

<p>These extensions must be added to <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> before they will work.
With that, the commands for the rules about object files can be omitted.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>

<span class="nl">.SUFFIXES</span><span class="o">:</span> <span class="nf">.c .o</span>
<span class="nl">.c.o</span><span class="o">:</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">-c</span> <span class="err">$&lt;</span>
</code></pre></div></div>

<p>The first empty <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> clears the suffix list. The second one adds
<code class="language-plaintext highlighter-rouge">.c</code> and <code class="language-plaintext highlighter-rouge">.o</code> to the now-empty suffix list.</p>

<h3 id="other-target-conventions">Other target conventions</h3>

<blockquote>
  <p>Conventions are, indeed, all that shield us from the shivering void,
though often they do so but poorly and desperately. ―Robert Aickman</p>
</blockquote>

<p>Users usually expect an “install” target that installs the built
program, libraries, man pages, etc. By convention this target should use
the <code class="language-plaintext highlighter-rouge">PREFIX</code> and <code class="language-plaintext highlighter-rouge">DESTDIR</code> macros.</p>

<p>The <code class="language-plaintext highlighter-rouge">PREFIX</code> macro should default to <code class="language-plaintext highlighter-rouge">/usr/local</code>, and since it’s a
macro the user can override it to install elsewhere, <a href="/blog/2017/06/19/">such as in their
home directory</a>. The user should override it for both building and
installing, since the prefix may need to be built into the binary (e.g.
<code class="language-plaintext highlighter-rouge">-DPREFIX=$(PREFIX)</code>).</p>

<p>The <code class="language-plaintext highlighter-rouge">DESTDIR</code> is macro is used for <em>staged builds</em>, so that it gets
installed under a fake root directory for the sake of packaging. Unlike
PREFIX, it will not actually be run from this directory.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>
<span class="nv">PREFIX</span> <span class="o">=</span> /usr/local

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">install</span><span class="o">:</span> <span class="nf">game</span>
    <span class="err">mkdir</span> <span class="err">-p</span> <span class="err">$(DESTDIR)$(PREFIX)/bin</span>
    <span class="err">mkdir</span> <span class="err">-p</span> <span class="err">$(DESTDIR)$(PREFIX)/share/man/man1</span>
    <span class="err">cp</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">$(DESTDIR)$(PREFIX)/bin</span>
    <span class="err">gzip</span> <span class="err">&lt;</span> <span class="err">game.1</span> <span class="err">&gt;</span> <span class="err">$(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<p>You may also want to provide an “uninstall” phony target that does the
opposite.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make PREFIX=$HOME/.local install
</code></pre></div></div>

<p>Other common targets are “mostlyclean” (like “clean” but don’t delete
some slow-to-build targets), “distclean” (delete even more than
“clean”), “test” or “check” (run the test suite), and “dist” (create a
package).</p>

<h3 id="complexity-and-growing-pains">Complexity and growing pains</h3>

<p>One of make’s big weak points is scaling up as a project grows in size.</p>

<h4 id="recursive-makefiles">Recursive Makefiles</h4>

<p>As your growing project is broken into subdirectories, you may be
tempted to put a Makefile in each subdirectory and invoke them
recursively.</p>

<p><a href="http://aegis.sourceforge.net/auug97.pdf"><strong>Don’t use recursive Makefiles</strong></a>. It breaks the dependency
tree across separate instances of make and typically results in a
fragile build. There’s nothing good about it. Have one Makefile at the
root of your project and invoke make there. You may have to teach your
text editor how to do this.</p>

<p>When talking about files in subdirectories, just include the
subdirectory in the name. Everything will work the same as far as make
is concerned, including inference rules.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">src/graphics.o</span><span class="o">:</span> <span class="nf">src/graphics.c</span>
<span class="nl">src/physics.o</span><span class="o">:</span> <span class="nf">src/physics.c</span>
<span class="nl">src/input.o</span><span class="o">:</span> <span class="nf">src/input.c</span>
</code></pre></div></div>

<h4 id="out-of-source-builds">Out-of-source builds</h4>

<p>Keeping your object files separate from your source files is a nice
idea. When it comes to make, there’s good news and bad news.</p>

<p>The good news is that make can do this. You can pick whatever file names
you like for targets and prerequisites.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">obj/input.o</span><span class="o">:</span> <span class="nf">src/input.c</span>
</code></pre></div></div>

<p>The bad news is that inference rules are not compatible with
out-of-source builds. You’ll need to repeat the same commands for each
rule as if inference rules didn’t exist. This is tedious for large
projects, so you may want to have some sort of “configure” script, even
if hand-written, to generate all this for you. This is essentially what
CMake is all about. That, plus dependency management.</p>

<h4 id="dependency-management">Dependency management</h4>

<p>Another problem with scaling up is tracking the project’s ever-changing
dependencies across all the source files. Missing a dependency means the
build may not be correct unless you <code class="language-plaintext highlighter-rouge">make clean</code> first.</p>

<p>If you go the route of using a script to generate the tedious parts of
the Makefile, both GCC and Clang have a nice feature for generating all
the Makefile dependencies for you (<code class="language-plaintext highlighter-rouge">-MM</code>, <code class="language-plaintext highlighter-rouge">-MT</code>), at least for C and
C++. There are lots of tutorials for doing this dependency generation on
the fly as part of the build, but it’s fragile and slow. Much better to
do it all up front and “bake” the dependencies into the Makefile so that
make can do its job properly. If the dependencies change, rebuild your
Makefile.</p>

<p>For example, here’s what it looks like invoking gcc’s dependency
generator against the imaginary <code class="language-plaintext highlighter-rouge">input.c</code> for an out-of-source build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h
</code></pre></div></div>

<p>Notice the output is in Makefile’s rule format.</p>

<p>Unfortunately this feature strips the leading paths from the target, so,
in practice, using it is always more complicated than it should be (e.g.
it requires the use of <code class="language-plaintext highlighter-rouge">-MT</code>).</p>

<h4 id="microsofts-nmake">Microsoft’s Nmake</h4>

<p>Microsoft has an implementation of make called Nmake, which <a href="/blog/2016/06/13/">comes with
Visual Studio</a>. It’s <em>nearly</em> a POSIX-compatible make, but
necessarily breaks from the standard in some places. Their cl.exe
compiler uses <code class="language-plaintext highlighter-rouge">.obj</code> as the object file extension and <code class="language-plaintext highlighter-rouge">.exe</code> for
binaries, both of which differ from the unix world, so it has different
built-in inference rules. Windows also lacks a Bourne shell and the
standard unix tools, so all of the commands will necessarily be
different.</p>

<p>There’s no equivalent of <code class="language-plaintext highlighter-rouge">rm -f</code> on Windows, so good luck writing a
proper “clean” target. No, <code class="language-plaintext highlighter-rouge">del /f</code> isn’t the same.</p>

<p>So while it’s close to POSIX make, it’s not practical to write a
Makefile that will simultaneously work properly with both POSIX make
and Nmake. These need to be separate Makefiles.</p>

<h3 id="may-your-makefiles-be-portable">May your Makefiles be portable</h3>

<p>It’s nice to have reliable, portable Makefiles that just work anywhere.
<a href="/blog/2017/03/30/">Code to the standards</a> and you don’t need feature tests or
other sorts of special treatment.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Stack Clashing for Fun and Profit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/06/21/"/>
    <id>urn:uuid:43402771-3340-3dff-c18f-7110caeedb7d</id>
    <updated>2017-06-21T05:28:56Z</updated>
    <category term="c"/><category term="posix"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Stack clashing</em> has been in the news lately due to <a href="https://blog.qualys.com/securitylabs/2017/06/19/the-stack-clash">some recently
discovered vulnerablities</a> along with proof-of-concept
exploits. As the announcement itself notes, this is not a new issue,
though this appears to be the first time it’s been given this
particular name. I do know of one “good” use of stack clashing, where
it’s used for something productive than as part of an attack. In this
article I’ll explain how it works.</p>

<p>You can find the complete code for this article here, ready to run:</p>

<ul>
  <li><a href="https://github.com/skeeto/stack-clash-coroutine">https://github.com/skeeto/stack-clash-coroutine</a></li>
</ul>

<p>But first, what is a stack clash? Here’s a rough picture of the
typical way process memory is laid out. The stack starts at a high
memory address and grows downwards. Code and static data sit at low
memory, with a <code class="language-plaintext highlighter-rouge">brk</code> pointer growing upward to make small allocations.
In the middle is the heap, where large allocations and memory mappings
take place.</p>

<p><img src="/img/diagram/process-memory.svg" alt="" /></p>

<p>Below the stack is a slim <em>guard page</em> that divides the stack and the
region of memory reserved for the heap. Reading or writing to that
memory will trap, causing the program to crash or some special action
to be taken. The goal is to prevent the stack from growing into the
heap, which could cause all sorts of trouble, like security issues.</p>

<p>The problem is that this thin guard page isn’t enough. It’s possible to
put a large allocation on the stack, never read or write to it, and
completely skip over the guard page, such that the heap and stack
overlap without detection.</p>

<p>Once this happens, writes into the heap will change memory on the
stack and vice versa. If an attacker can cause the program to make
such a large allocation on the stack, then legitimate writes into
memory on the heap can manipulate local variables or <a href="/blog/2017/01/21/">return pointers,
changing the program’s control flow</a>. This can bypass buffer
overflow protections, such as stack canaries.</p>

<h3 id="binary-trees-and-coroutines">Binary trees and coroutines</h3>

<p><img src="/img/diagram/binary-search-tree.svg" alt="" /></p>

<p>Now, I’m going to abruptly change topics to discuss binary search
trees. We’ll get back to stack clash in a bit. Suppose we have a
binary tree which we would like to iterate depth-first. For this
demonstration, here’s the C interface to the binary tree.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">left</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">right</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">value</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">void</span>  <span class="nf">tree_insert</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">**</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">);</span>
<span class="kt">char</span> <span class="o">*</span><span class="nf">tree_find</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">tree_visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">));</span>
<span class="kt">void</span>  <span class="nf">tree_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>An empty tree is the NULL pointer, hence the double-pointer for
insert. In the demonstration it’s an unbalanced search tree, but this
could very well be a balanced search tree with the addition of another
field on the structure.</p>

<p>For the traversal, first visit the root node, then traverse its left
tree, and finally traverse its right tree. It makes for a simple,
recursive definition — the sort of thing you’d teach a beginner.
Here’s a definition that accepts a callback, which the caller will use
to <em>visit</em> each key/value in the tree. This really is as simple as it
gets.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">tree_visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">f</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">);</span>
        <span class="n">tree_visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
        <span class="n">tree_visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this isn’t so convenient for the caller, who has to
split off a callback function that <a href="/blog/2017/01/08/">lacks context</a>, then hand
over control to the traversal function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">printer</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%s = %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">print_tree</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">tree</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">tree_visit</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">printer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usually it’s much nicer for the caller if instead it’s provided an
<em>iterator</em>, which the caller can invoke at will. Here’s an interface
for it, just two functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">int</span>             <span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">);</span>
</code></pre></div></div>

<p>The first constructs an iterator object, and the second one visits a
key/value pair each time it’s called. It returns 0 when traversal is
complete, automatically freeing any resources associated with the
iterator.</p>

<p>The caller now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">tree_iterator</span><span class="p">(</span><span class="n">tree</span><span class="p">);</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">tree_next</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">k</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">v</span><span class="p">))</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%s = %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">);</span>
</code></pre></div></div>

<p>Notice I haven’t defined <code class="language-plaintext highlighter-rouge">struct tree_it</code>. That’s because I’ve got
four different implementations, each taking a different approach. The
last one will use stack clashing.</p>

<h4 id="manual-state-tracking">Manual State Tracking</h4>

<p>With just the standard facilities provided by C, there’s a some manual
bookkeeping that has to take place in order to convert the recursive
definition into an iterator. Depth-first traversal is a stack-oriented
process, and with recursion the stack is implicit in the call stack.
As an iterator, the traversal stack needs to be <a href="/blog/2016/11/13/">managed
explicitly</a>. The iterator needs to keep track of the path it
took so that it can backtrack, which means keeping track of parent
nodes as well as which branch was taken.</p>

<p>Here’s my little implementation, which, to keep things simple, has a
hard depth limit of 32. It’s structure definition includes a stack of
node pointers, and 2 bits of information per visited node, stored
across a 64-bit integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">stack</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">state</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">nstack</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The 2 bits track three different states for each visited node:</p>

<ol>
  <li>Visit the current node</li>
  <li>Traverse the left tree</li>
  <li>Traverse the right tree</li>
</ol>

<p>It works out to the following. Don’t worry too much about trying to
understand how this works. My point is to demonstrate that converting
the recursive definition into an iterator complicates the
implementation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">3u</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">+=</span> <span class="mi">1ull</span> <span class="o">&lt;&lt;</span> <span class="n">shift</span><span class="p">;</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>
                <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
                <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">;</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="p">(</span><span class="mi">3ull</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">shift</span> <span class="o">+</span> <span class="mi">2</span><span class="p">));</span>
                <span class="p">}</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">;</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="p">(</span><span class="mi">3ull</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">shift</span> <span class="o">+</span> <span class="mi">2</span><span class="p">));</span>
                <span class="p">}</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
                <span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">--</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Wouldn’t it be nice to keep both the recursive definition while also
getting an iterator? There’s an exact solution to that: coroutines.</p>

<h4 id="coroutines">Coroutines</h4>

<p>C doesn’t come with coroutines, but there are a number of libraries
available. We can also build our own coroutines. One way to do that is
with <em>user contexts</em> (<code class="language-plaintext highlighter-rouge">&lt;ucontext.h&gt;</code>) provided by the X/Open System
Interfaces Extension (XSI), an extension to POSIX. This set of
functions allow programs to create their own call stacks and switch
between them. That’s the key ingredient for coroutines. Caveat: These
functions aren’t widely available, and probably shouldn’t be used in
new code.</p>

<p>Here’s my iterator structure definition.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _XOPEN_SOURCE 600
#include</span> <span class="cpf">&lt;ucontext.h&gt;</span><span class="cp">
</span>
<span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="n">ucontext_t</span> <span class="n">coroutine</span><span class="p">;</span>
    <span class="n">ucontext_t</span> <span class="n">yield</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>It needs one context for the original stack and one context for the
iterator’s stack. Each time the iterator is invoked, it the program
will switch to the other stack, find the next value, then switch back.
This process is called <em>yielding</em>. Values are passed between context
using the <code class="language-plaintext highlighter-rouge">k</code> (key) and <code class="language-plaintext highlighter-rouge">v</code> (value) fields on the iterator.</p>

<p>Before I get into initialization, here’s the actual traversal
coroutine. It’s nearly the same as the original recursive definition
except for the <code class="language-plaintext highlighter-rouge">swapcontext()</code>. This is the <em>yield</em>, pausing execution
and sending control back to the caller. The current context is saved
in the first argument, and the second argument becomes the current
context.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="n">swapcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While the actual traversal is simple again, initialization is more
complicated. The first problem is that there’s no way to pass pointer
arguments to the coroutine. Technically only <code class="language-plaintext highlighter-rouge">int</code> arguments are
permitted. (All the online tutorials get this wrong.) To work around
this problem, I smuggle the arguments in as global variables. This
would cause problems should two different threads try to create
iterators at the same time, even on different trees.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">tree_arg</span><span class="p">;</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">tree_it_arg</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">coroutine</span><span class="p">(</span><span class="n">tree_arg</span><span class="p">,</span> <span class="n">tree_it_arg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The stack has to be allocated manually, which I do with a call to
<code class="language-plaintext highlighter-rouge">malloc()</code>. Nothing <a href="/blog/2015/05/15/">fancy is needed</a>, though this means the new
stack won’t have a guard page. For the stack size, I use the suggested
value of <code class="language-plaintext highlighter-rouge">SIGSTKSZ</code>. The <code class="language-plaintext highlighter-rouge">makecontext()</code> function is what creates the
new context from scratch, but the new context must first be
initialized with <code class="language-plaintext highlighter-rouge">getcontext()</code>, even though that particular snapshot
won’t actually be used.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">SIGSTKSZ</span><span class="p">);</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_size</span> <span class="o">=</span> <span class="n">SIGSTKSZ</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_link</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">;</span>
    <span class="n">getcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">);</span>
    <span class="n">makecontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="n">coroutine_init</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">tree_arg</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">tree_it_arg</span> <span class="o">=</span> <span class="n">it</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice I gave it a function pointer, a lot like I’m starting a new
thread. This is no coincidence. There’s a lot of similarity between
coroutines and multiple threads, as you’ll soon see.</p>

<p>Finally the iterator function itself. Since NULL isn’t a valid key, it
initializes the key to NULL before yielding to the iterator context.
If the iterator has no more nodes to visit, it doesn’t set the key,
which can be detected when control returns.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">swapcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s all it takes to create and operate a coroutine in C, provided
you’re on a system with these XSI extensions.</p>

<h4 id="semaphores">Semaphores</h4>

<p>Instead of a coroutine, we could just use actual threads and a couple
of semaphores to synchronize them. This is a heavy implementation and
also probably shouldn’t be used in practice, but at least it’s fully
portable.</p>

<p>Here’s the structure definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="n">sem_t</span> <span class="n">visitor</span><span class="p">;</span>
    <span class="n">sem_t</span> <span class="n">main</span><span class="p">;</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The main thread will wait on one semaphore and the iterator thread
will wait on the other. This <a href="/blog/2017/02/14/">should sound very familiar</a>.</p>

<p>The actual traversal function looks the same, but with <code class="language-plaintext highlighter-rouge">sem_post()</code>
and <code class="language-plaintext highlighter-rouge">sem_wait()</code> as the yield.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
        <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
        <span class="n">visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s a separate function to initialize the iterator context again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">thread_entrance</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
    <span class="n">visit</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">t</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Creating the iterator only requires initializing the semaphores and
creating the thread:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="kr">thread</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">thread_entrance</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The iterator function looks just like the coroutine version.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
    <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">pthread_join</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="kr">thread</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">sem_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
        <span class="n">sem_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Overall, this is almost identical to the coroutine version.</p>

<h4 id="coroutines-using-stack-clashing">Coroutines using stack clashing</h4>

<p>Finally I can tie this back into the topic at hand. Without either XSI
extensions or Pthreads, we can (usually) create coroutines by abusing
<code class="language-plaintext highlighter-rouge">setjmp()</code> and <code class="language-plaintext highlighter-rouge">longjmp()</code>. Technically this violates two of the C’s
rules and relies on undefined behavior, but it generally works. This
<a href="http://fanf.livejournal.com/105413.html">is not my own invention</a>, and it dates back to at least 2010.</p>

<p>From the very beginning, C has provided a crude “exception” mechanism
that allows the stack to be abruptly unwound back to a previous state.
It’s a sort of non-local goto. Call <code class="language-plaintext highlighter-rouge">setjmp()</code> to capture an opaque
<code class="language-plaintext highlighter-rouge">jmp_buf</code> object to be used in the future. This function returns 0
this first time. Hand that value to <code class="language-plaintext highlighter-rouge">longjmp()</code> later, even in a
different function, and <code class="language-plaintext highlighter-rouge">setjmp()</code> will return again, this time with a
non-zero value.</p>

<p>It’s technically unsuitable for coroutines because the jump is a
one-way trip. The unwound stack invalidates any <code class="language-plaintext highlighter-rouge">jmp_buf</code> that was
created after the target of the jump. In practice, though, you can
still use these jumps, which is one rule being broken.</p>

<p>That’s where stack clashing comes into play. In order for it to be a
proper coroutine, it needs to have its own stack. But how can we do
that with these primitive C utilities? <strong>Extend the stack to overlap
the heap, call <code class="language-plaintext highlighter-rouge">setjmp()</code> to capture a coroutine on it, then return.</strong>
Generally we can get away with using <code class="language-plaintext highlighter-rouge">longjmp()</code> to return to this
heap-allocated stack.</p>

<p>Here’s my iterator definition for this one. Like the XSI context
struct, this has two <code class="language-plaintext highlighter-rouge">jmp_buf</code> “contexts.” The <code class="language-plaintext highlighter-rouge">stack</code> holds the
iterator’s stack buffer so that it can be freed, and the <code class="language-plaintext highlighter-rouge">gap</code> field
will be used to prevent the optimizer from spoiling our plans.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">stack</span><span class="p">;</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="o">*</span><span class="n">gap</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">coroutine</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">yield</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The coroutine looks familiar again. This time the yield is performed
with <code class="language-plaintext highlighter-rouge">setjmmp()</code> and <code class="language-plaintext highlighter-rouge">longjmp()</code>, just like <code class="language-plaintext highlighter-rouge">swapcontext()</code>. Remember
that <code class="language-plaintext highlighter-rouge">setjmp()</code> returns twice, hence the branch. The <code class="language-plaintext highlighter-rouge">longjmp()</code> never
returns.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">))</span>
            <span class="n">longjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next is the tricky part to cause the stack clash. First, allocate the
new stack with <code class="language-plaintext highlighter-rouge">malloc()</code> so that we can get its address. Then use a
local variable on the stack to determine how much the stack needs to
grow in order to overlap with the allocation. Taking the difference
between these pointers is illegal as far as the language is concerned,
making this the second rule I’m breaking. I can <a href="/blog/2017/05/03/">imagine an
implementation</a> where the stack and heap are in two separate
kinds of memory, and it would be meaningless to take the difference. I
don’t actually have to imagine very hard, because this is actually how
it used to work on the 8086 with its <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">segmented memory
architecture</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">STACK_SIZE</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">marker</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">gap</span><span class="p">[</span><span class="o">&amp;</span><span class="n">marker</span> <span class="o">-</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">-</span> <span class="n">STACK_SIZE</span><span class="p">];</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">gap</span> <span class="o">=</span> <span class="n">gap</span><span class="p">;</span> <span class="c1">// prevent optimization</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">))</span>
        <span class="n">coroutine_init</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m using a variable-length array (VLA) named <code class="language-plaintext highlighter-rouge">gap</code> to indirectly
control the stack pointer, moving it over the heap. I’m assuming the
stack grows downward, since otherwise the sign would be wrong.</p>

<p>The compiler is smart and will notice I’m not actually using <code class="language-plaintext highlighter-rouge">gap</code>,
and it’s happy to throw it away. In fact, it’s vitally important that
I <em>don’t</em> touch it since the guard page, along with a bunch of
unmapped memory, is actually somewhere in the middle of that array. I
only want the array for its side effect, but that side effect isn’t
officially supported, which means the optimizer doesn’t need to
consider it in its decisions. To inhibit the optimizer, I store the
array’s address where someone might potentially look at it, meaning
the array has to exist.</p>

<p>Finally, the iterator function looks just like the others, again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">))</span>
        <span class="n">longjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it: a nasty hack using a stack clash to create a context
for a <code class="language-plaintext highlighter-rouge">setjmp()</code>+<code class="language-plaintext highlighter-rouge">longjmp()</code> coroutine.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to Write Portable C Without Complicating Your Build</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/30/"/>
    <id>urn:uuid:e1651834-8033-3bfa-6eaf-00bc38a0584a</id>
    <updated>2017-03-30T04:06:58Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re writing a non-GUI C application intended to run on a
number of operating systems: Linux, the various BSDs, macOS, <a href="https://en.wikipedia.org/wiki/Illumos">classical
unix</a>, and perhaps even something as exotic as Windows. It might
sound like a rather complicated problem. These operating systems have
slightly different interfaces (or <em>very</em> different in one case), and they
run different variants of the standard unix tools — a problem for
portable builds.</p>

<p>With some up-front attention to detail, this is actually not terribly
difficult. <strong>Unix-like systems are probably the least diverse and least
buggy they’ve ever been.</strong> Writing portable code is really just a matter
of <strong>coding to the standards</strong> and ignoring extensions unless
<em>absolutely</em> necessary. Knowing what’s standard and what’s extension is
the tricky part, but I’ll explain how to find this information.</p>

<p>You might be tempted to reach for an <a href="https://undeadly.org/cgi?action=article;sid=20170930133438">overly complicated</a> solution
such as GNU Autoconf. Sure, it creates a configure script with the
familiar, conventional interface. This has real value. But do you
<em>really</em> need to run a single-threaded gauntlet of hundreds of
feature/bug tests for things that sometimes worked incorrectly in some
weird unix variant back in the 1990s? On a machine with many cores
(parallel build, <code class="language-plaintext highlighter-rouge">-j</code>), this may very well be the slowest part of the
whole build process.</p>

<p>For example, the configure script for Emacs checks that the compiler
supplies <code class="language-plaintext highlighter-rouge">stdlib.h</code>, <code class="language-plaintext highlighter-rouge">string.h</code>, and <code class="language-plaintext highlighter-rouge">getenv</code> — things that were
standardized nearly 30 years ago. It also checks for a slew of POSIX
functions that have been standard since 2001.</p>

<p>There’s a much easier solution: Document that the application requires,
say, C99 and POSIX.1-2001. It’s the responsibility of the person
building the application to supply these implementations, so there’s no
reason to waste time testing for it.</p>

<h3 id="how-to-code-to-the-standards">How to code to the standards</h3>

<p>Suppose there’s some function you want to use, but you’re not sure if
it’s standard or an extension. Or maybe you don’t know what standard it
comes from. Luckily the man pages document this stuff very well,
especially on Linux. Check the friendly “CONFORMING TO” section. For
example, look at <a href="https://manpages.debian.org/jessie/manpages-dev/getenv.3.en.html">getenv(3)</a>. Here’s what that section has to
say:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.
</code></pre></div></div>

<p>This says this function comes from the original C standard. It’s <em>always</em>
available on anything that claims to be a C implementation. The man page
also documents <code class="language-plaintext highlighter-rouge">secure_getenv()</code>, which is a GNU extension: to be avoided
in anything intended to be portable.</p>

<p>What about <a href="https://manpages.debian.org/jessie/manpages-dev/sleep.3.en.html">sleep(3)</a>?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    POSIX.1-2001.
</code></pre></div></div>

<p>This function isn’t part of standard C, but it’s available on any system
claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the
program needs to run on an operating system not implementing this POSIX
standard (i.e. Windows), you’ll need to call an alternative function,
probably inside a different <code class="language-plaintext highlighter-rouge">#if .. #endif</code> branch. More on this in a
moment.</p>

<p>If you’re coding to POSIX, you <a href="http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_02.html"><em>must</em> define the <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code>
feature test macro</a> to the standard you intend to use prior to
any system header includes:</p>

<blockquote>
  <p>A POSIX-conforming application should ensure that the feature test
macro <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> is defined before inclusion of any header.</p>
</blockquote>

<p>For example, to properly access POSIX.1-2001 functions in your
application, define <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> to <code class="language-plaintext highlighter-rouge">200112L</code>. With this defined,
it’s safe to assume access to all of C and everything from that standard
of POSIX. You can do this at the top of your sources, but I personally
like the tidiness of a global <code class="language-plaintext highlighter-rouge">config.h</code> that gets included before
everything.</p>

<h3 id="how-to-create-a-portable-build">How to create a portable build</h3>

<p>So you’ve written clean, portable C to the standards. How do you build
this application? The natural choice is <code class="language-plaintext highlighter-rouge">make</code>. It’s available
everywhere and it’s part of POSIX.</p>

<p>Again, the tricky part is teasing apart the standard from the extension.
I’m a long-time sinner in this regard, having far too often written
Makefiles that depend on GNU Make extensions. This is a real pain when
building programs on systems without the GNU utilities. I’ve been making
amends (and <a href="https://marc.info/?l=openbsd-bugs&amp;m=148815538325392&amp;w=2">finding</a> some <a href="https://marc.info/?l=openbsd-bugs&amp;m=148734102504016&amp;w=2">bugs</a> as a result).</p>

<p>No implementation makes the division clear in its documentation, and
especially don’t bother looking at the GNU Make manual. Your best
resource is <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">the standard itself</a>. If you’re already familiar with
<code class="language-plaintext highlighter-rouge">make</code>, coding to the standard is largely a matter of <em>unlearning</em> the
various extensions you know.</p>

<p>Outside of <a href="/blog/2016/04/30/">some hacks</a>, this means you don’t get conditionals
(<code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">else</code>, etc.). With some practice, both with sticking to portable
code and writing portable Makefiles, you’ll find that you <em>don’t really
need them</em>. Following the macro conventions will cover most situations.
For example:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CC</code>: the C compiler program</li>
  <li><code class="language-plaintext highlighter-rouge">CFLAGS</code>: flags to pass to the C compiler</li>
  <li><code class="language-plaintext highlighter-rouge">LDFLAGS</code>: flags to pass to the linker (via the C compiler)</li>
  <li><code class="language-plaintext highlighter-rouge">LDLIBS</code>: libraries to pass to the linker</li>
</ul>

<p>You don’t need to do anything weird with the assignments. The user
invoking <code class="language-plaintext highlighter-rouge">make</code> can override them easily. For example, here’s part of a
Makefile:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CC     = c99
CFLAGS = -Wall -Wextra -Os
</code></pre></div></div>

<p>But the user wants to use <code class="language-plaintext highlighter-rouge">clang</code>, and their system needs to explicitly
link <code class="language-plaintext highlighter-rouge">-lsocket</code> (e.g. Solaris). The user can override the macro
definitions on the command line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=clang LDLIBS=-lsocket
</code></pre></div></div>

<p>The same rules apply to the programs you invoke from the Makefile. Read
the standards documents and ignore your system’s man pages as to avoid
accidentally using an extension. It’s especially valuable to learn <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html">the
Bourne shell language</a> and avoid any accidental bashisms in your
Makefiles and scripts. The <code class="language-plaintext highlighter-rouge">dash</code> shell is good for testing your scripts.</p>

<p>Makefiles conforming to the standard will, unfortunately, be more verbose
than those taking advantage of a particular implementation. If you know
how to code Bourne shell — which is not terribly difficult to learn —
then you might even consider hand-writing a <code class="language-plaintext highlighter-rouge">configure</code> script to
generate the Makefile (a la metaprogramming). This gives you a more
flexible language with conditionals, and, being generated, redundancy in
the Makefile no longer matters.</p>

<p>As someone who frequently dabbles with BSD systems, my life has gotten a
lot easier since learning to write portable Makefiles and scripts.</p>

<h3 id="but-what-about-windows">But what about Windows</h3>

<p>It’s the elephant in the room and I’ve avoided talking about it so far.
If you want to <a href="/blog/2016/06/13/">build with Visual Studio’s command line tools</a> —
something I do on occasion — build portability goes out the window.
Visual Studio has <code class="language-plaintext highlighter-rouge">nmake.exe</code>, which nearly conforms to POSIX <code class="language-plaintext highlighter-rouge">make</code>.
However, without the standard unix utilities and with the completely
foreign compiler interface for <code class="language-plaintext highlighter-rouge">cl.exe</code>, there’s absolutely no hope of
writing a Makefile portable to this situation.</p>

<p>The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the
unix utilities, though it has <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">the problem</a> of linking against
<code class="language-plaintext highlighter-rouge">msvcrt.dll</code>. Another option is a separate Makefile dedicated to
<code class="language-plaintext highlighter-rouge">nmake.exe</code> and the Visual Studio toolchain. Good luck defining a
correctly working “clean” target with <code class="language-plaintext highlighter-rouge">del.exe</code>.</p>

<p>My preferred approach lately is an amalgamation build (as seen in
<a href="https://github.com/skeeto/enchive">Enchive</a>): Carefully concatenate all the application’s sources
into one giant source file. First concatenate all the headers in the
right order, followed by all the C files. Use <code class="language-plaintext highlighter-rouge">sed</code> to remove and local
includes. You can do this all on a unix system with the nice utilities,
then point <code class="language-plaintext highlighter-rouge">cl.exe</code> at the amalgamation for the Visual Studio build.
It’s not very useful for actual development (i.e. you don’t want to edit
the amalgamation), but that’s what MinGW-w64 resolves.</p>

<p>What about all those POSIX functions? You’ll need to find Win32
replacements on MSDN. I prefer to do this is by abstracting those
operating system calls. For example, compare POSIX <code class="language-plaintext highlighter-rouge">sleep(3)</code> and <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx">Win32
<code class="language-plaintext highlighter-rouge">Sleep()</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if defined(_WIN32)
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Sleep</span><span class="p">(</span><span class="n">s</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">);</span>  <span class="c1">// TODO: handle overflow, maybe</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* __unix__ */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sleep</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>  <span class="c1">// TODO: fix signal interruption</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Then the rest of the program calls <code class="language-plaintext highlighter-rouge">my_sleep()</code>. There’s another example
in <a href="/blog/2017/03/01/">the OpenMP article</a> with <code class="language-plaintext highlighter-rouge">pwrite(2)</code> and <code class="language-plaintext highlighter-rouge">WriteFile()</code>. This
demonstrates that supporting a bunch of different unix-like systems is
really easy, but introducing Windows portability adds a disproportionate
amount of complexity.</p>

<h4 id="caveat-paths-and-filenames">Caveat: paths and filenames</h4>

<p>There’s one major complication with filenames for applications portable
to Windows. In the unix world, filenames are null-terminated bytestrings.
Typically these are Unicode strings encoded as UTF-8, but it’s not
necessarily so. The kernel just sees bytestrings. A bytestring doesn’t
necessarily have a formal Unicode representation, which can be a problem
for <a href="https://www.python.org/dev/peps/pep-0383/">languages that want filenames to be Unicode strings</a>
(<a href="http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html">also</a>).</p>

<p>On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up
being neither. They’re really null-terminated unsigned 16-bit integer
arrays. It’s <em>almost</em> UTF-16 except that Windows allows unpaired
surrogates. This means Windows filenames <em>also</em> don’t have a formal
Unicode representation, but in a completely different way than unix. Some
<a href="https://simonsapin.github.io/wtf-8/">heroic efforts have gone into working around this issue</a>.</p>

<p>As a result, it’s highly non-trivial to correctly support all possible
filenames on both systems in the same program, <em>especially</em> when they’re
passed as command line arguments.</p>

<h3 id="summary">Summary</h3>

<p>The key points are:</p>

<ol>
  <li>Document the standards your application requires and strictly stick
to them.</li>
  <li>Ignore the vendor documentation if it doesn’t clearly delineate
extensions.</li>
</ol>

<p>This was all a discussion of non-GUI applications, and I didn’t really
touch on libraries. Many libraries are simple to access in the build
(just add it to <code class="language-plaintext highlighter-rouge">LDLIBS</code>), but some libraries — GUIs in particular — are
particularly complicated to manage portably and will require a more
complex solution (pkg-config, CMake, Autoconf, etc.).</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>OpenMP and pwrite()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/01/"/>
    <id>urn:uuid:dfdf8ca6-51aa-3a15-6bf0-98b39f20652a</id>
    <updated>2017-03-01T21:22:24Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>The most common way I introduce multi-threading to <a href="/blog/2015/07/10/">small C
programs</a> is with OpenMP (Open Multi-Processing). It’s typically
used as compiler pragmas to parallelize computationally expensive
loops — iterations are processed by different threads in some
arbitrary order.</p>

<p>Here’s an example that computes the <a href="/blog/2011/11/28/">frames of a video</a> in
parallel. Despite being computed out of order, each frame is written
in order to a large buffer, then written to standard output all at
once at the end.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_frames</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">output</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">DEFAULT_BETA</span><span class="p">;</span>

<span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="o">&amp;</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
</code></pre></div></div>

<p>Adding OpenMP to this program is much simpler than introducing
low-level threading semantics with, say, Pthreads. With care, there’s
often no need for explicit thread synchronization. It’s also fairly
well supported by many vendors, even Microsoft (up to OpenMP 2.0), so
a multi-threaded OpenMP program is quite portable without <code class="language-plaintext highlighter-rouge">#ifdef</code>.</p>

<p>There’s real value this pragma API: <strong>The above example would still
compile and run correctly even when OpenMP isn’t available.</strong> The
pragma is ignored and the program just uses a single core like it
normally would. It’s a slick fallback.</p>

<p>When a program really <em>does</em> require synchronization there’s
<code class="language-plaintext highlighter-rouge">omp_lock_t</code> (mutex lock) and the expected set of functions to operate
on them. This doesn’t have the nice fallback, so I don’t like to use
it. Instead, I prefer <code class="language-plaintext highlighter-rouge">#pragma omp critical</code>. It nicely maintains the
OpenMP-unsupported fallback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="cp">#pragma omp critical
</span>    <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would append the output to some output file in an arbitrary
order. The critical section <a href="/blog/2016/08/03/">prevents interleaving of
outputs</a>.</p>

<p>There are a couple of problems with this example:</p>

<ol>
  <li>
    <p>Only one thread can write at a time. If the write takes too long,
other threads will queue up behind the critical section and wait.</p>
  </li>
  <li>
    <p>The output frames will be out of order, which is probably
inconvenient for consumers. If the output is seekable this can be
solved with <code class="language-plaintext highlighter-rouge">lseek()</code>, but that only makes the critical section
even more important.</p>
  </li>
</ol>

<p>There’s an easy fix for both, and eliminates the need for a critical
section: POSIX <code class="language-plaintext highlighter-rouge">pwrite()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">pwrite</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">off_t</span> <span class="n">offset</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s like <code class="language-plaintext highlighter-rouge">write()</code> but has an offset parameter. Unlike <code class="language-plaintext highlighter-rouge">lseek()</code>
followed by a <code class="language-plaintext highlighter-rouge">write()</code>, multiple threads and processes can, in
parallel, safely write to the same file descriptor at different file
offsets. The catch is that <strong>the output must be a file, not a pipe</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">size</span> <span class="o">*</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s no critical section, the writes can interleave, and the output
is in order.</p>

<p>If you’re concerned about standard output not being seekable (it often
isn’t), keep in mind that it will work just fine when invoked like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./compute_frames &gt; frames.ppm
</code></pre></div></div>

<h3 id="windows-portability">Windows Portability</h3>

<p>I talked about OpenMP being really portable, then used POSIX
functions. Fortunately the Win32 <code class="language-plaintext highlighter-rouge">WriteFile()</code> function has an
“overlapped” parameter that works just like <code class="language-plaintext highlighter-rouge">pwrite()</code>. Typically
rather than call either directly, I’d wrap the write like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">out</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">written</span><span class="p">;</span>
    <span class="n">OVERLAPPED</span> <span class="n">offset</span> <span class="o">=</span> <span class="p">{.</span><span class="n">Offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">};</span>
    <span class="k">return</span> <span class="n">WriteFile</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">offset</span><span class="p">);</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* POSIX */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">offset</span><span class="p">)</span> <span class="o">==</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Except for switching to <code class="language-plaintext highlighter-rouge">write_frame()</code>, the OpenMP part remains
untouched.</p>

<h3 id="real-world-example">Real World Example</h3>

<p>Here’s an example in a real program:</p>

<p><a href="https://gist.github.com/skeeto/d7e17bb2aa40907a3405c3933cb1f936" class="download">julia.c</a></p>

<p>Notice because of <code class="language-plaintext highlighter-rouge">pwrite()</code> there’s no piping directly into
<code class="language-plaintext highlighter-rouge">ppmtoy4m</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./julia &gt; output.ppm
$ ppmtoy4m -F 60:1 &lt; output.ppm &gt; output.y4m
$ x264 -o output.mp4 output.y4m
</code></pre></div></div>

<p><a href="/video/?v=julia-256" class="download">output.mp4</a></p>

<video src="https://skeeto.s3.amazonaws.com/share/julia-256.mp4" controls="" loop="" crossorigin="anonymous">
</video>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Portable Structure Access with Member Offset Constants</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/22/"/>
    <id>urn:uuid:81ff4064-17f1-3a9b-a5ec-61acb03385b9</id>
    <updated>2016-11-22T12:55:29Z</updated>
    <category term="c"/><category term="posix"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you need to write a C program to access a long sequence of
structures from a binary file in a specified format. These structures
have different lengths and contents, but also a common header
identifying its type and size. Here’s the definition of that header
(no padding):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">time</span><span class="p">;</span>   <span class="c1">// unix epoch (microseconds)</span>
    <span class="kt">uint32_t</span> <span class="n">size</span><span class="p">;</span>   <span class="c1">// including this header (bytes)</span>
    <span class="kt">uint16_t</span> <span class="n">source</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">type</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">size</code> member is used to find the offset of the next structure in
the file without knowing anything else about the current structure.
Just add <code class="language-plaintext highlighter-rouge">size</code> to the offset of the current structure.</p>

<p>The <code class="language-plaintext highlighter-rouge">type</code> member indicates what kind of data follows this structure.
The program is likely to <code class="language-plaintext highlighter-rouge">switch</code> on this value.</p>

<p>The actual structures might look something like this (in the spirit of
<a href="http://openxcom.org/">X-COM</a>). Note how each structure begins with <code class="language-plaintext highlighter-rouge">struct event</code> as
header. All angles are expressed using <a href="https://en.wikipedia.org/wiki/Binary_scaling">binary scaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30
</span>
<span class="k">struct</span> <span class="n">observer</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">latitude</span><span class="p">;</span>   <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">longitude</span><span class="p">;</span>  <span class="c1">//</span>
    <span class="kt">uint16_t</span> <span class="n">source_id</span><span class="p">;</span>  <span class="c1">// later used for event source</span>
    <span class="kt">uint16_t</span> <span class="n">name_size</span><span class="p">;</span>  <span class="c1">// not including null terminator</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[];</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">ufo_sighting</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">azimuth</span><span class="p">;</span>    <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">elevation</span><span class="p">;</span>  <span class="c1">//</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">suspicious_signal</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">num_channels</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">sample_rate</span><span class="p">;</span>  <span class="c1">// Hz</span>
    <span class="kt">uint32_t</span> <span class="n">num_samples</span><span class="p">;</span>  <span class="c1">// per channel</span>
    <span class="kt">int16_t</span> <span class="n">samples</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>If all integers are stored in little endian byte order (least
significant byte first), there’s a strong temptation to lay the
structures directly over the data. After all, this will work correctly
on most computers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="n">header</span><span class="p">;</span>
<span class="n">fread</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">header</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">header</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This code will not work correctly when:</p>

<ol>
  <li>
    <p>The host machine doesn’t use little endian byte order, though this
is now uncommon. Sometimes developers will attempt to detect the
byte order at compile time and use the preprocessor to byte-swap if
needed. This is a mistake.</p>
  </li>
  <li>
    <p>The host machine has different alignment requirements and so
introduces additional padding to the structure. Sometimes this can
be resolved with a non-standard <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.4/gcc/Structure_002dPacking-Pragmas.html"><code class="language-plaintext highlighter-rouge">#pragma pack</code></a>.</p>
  </li>
</ol>

<h3 id="integer-extraction-functions">Integer extraction functions</h3>

<p>Fortunately it’s easy to write fast, correct, portable code for this
situation. First, define some functions to extract little endian
integers from an octet buffer (<code class="language-plaintext highlighter-rouge">uint8_t</code>). These will work correctly
regardless of the host’s alignment and byte order.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint16_t</span>
<span class="nf">extract_u16le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint64_t</span>
<span class="nf">extract_u64le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The big endian version is identical, but with shifts in reverse order.</p>

<p>A common concern is that these functions are a lot less efficient than
they could be. On x86 where alignment is very relaxed, each could be
implemented as a single load instruction. However, on GCC 4.x and
earlier, <code class="language-plaintext highlighter-rouge">extract_u32le</code> compiles to something like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">24</span>
        <span class="nf">mov</span>     <span class="nb">edx</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">16</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">edx</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>It’s tempting to fix the problem with the following definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: Don't do this.</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s unportable, it’s undefined behavior, and worst of all, it <a href="http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html">might
not work correctly even on x86</a>. Fortunately I have some great
news. On GCC 5.x and above, the correct definition compiles to the
desired, fast version. It’s the best of both worlds.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="n">mov</span>     <span class="n">eax</span><span class="p">,</span> <span class="p">[</span><span class="n">rdi</span><span class="p">]</span>
        <span class="n">ret</span>
</code></pre></div></div>

<p>It’s even smart about the big endian version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32be</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Is compiled to exactly what you’d want:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">mov</span>     <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">bswap</span>   <span class="nb">eax</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Or, even better, if your system supports <code class="language-plaintext highlighter-rouge">movbe</code> (<code class="language-plaintext highlighter-rouge">gcc -mmovbe</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">movbe</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Unfortunately, Clang/LLVM is <em>not</em> this smart as of 3.9, but I’m
betting it will eventually learn how to do this, too.</p>

<h3 id="member-offset-constants">Member offset constants</h3>

<p>For this next technique, that <code class="language-plaintext highlighter-rouge">struct event</code> from above need not
actually be in the source. It’s purely documentation. Instead, let’s
define the structure in terms of <em>member offset constants</em> — a term I
just made up for this article. I’ve included the integer types as part
of the name to aid in their correct use.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14
</span></code></pre></div></div>

<p>Given a buffer, the integer extraction functions, and these offsets,
structure members can be plucked out on demand.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="kt">uint64_t</span> <span class="n">time</span>   <span class="o">=</span> <span class="n">extract_u64le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U64LE_TIME</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">size</span>   <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">;</span>
<span class="kt">uint16_t</span> <span class="n">source</span> <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_SOURCE</span><span class="p">);</span>
<span class="kt">uint16_t</span> <span class="n">type</span>   <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_TYPE</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86 with GCC 5.x, each member access will be inlined and compiled
to a one-instruction extraction. As far as performance is concerned,
it’s identical to using a structure overlay, but this time the C code
is clean and portable. A slight downside is the lack of type checking
on member access: it’s easy to mismatch the types and accidentally
read garbage.</p>

<h3 id="memory-mapping-and-iteration">Memory mapping and iteration</h3>

<p>There’s a real advantage to memory mapping the input file and using
its contents directly. On a system with a huge virtual address space,
such as x86-64 or AArch64, this memory is almost “free.” Already being
backed by a file, paging out this memory costs nothing (i.e. it’s
discarded). The input file can comfortably be much larger than
physical memory without straining the system.</p>

<p>Unportable structure overlay can take advantage of memory mapping this
way, but has the previously-described issues. An approach with member
offset constants will take advantage of it just as well, all while
remaining clean and portable.</p>

<p>I like to wrap the memory mapping code into a simple interface, which
makes porting to non-POSIX platforms, such Windows, easier. Caveat:
This won’t work with files whose size exceeds the available contiguous
virtual memory of the system — a real problem for 32-bit systems.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/stat.h&gt;</span><span class="cp">
</span>
<span class="kt">uint8_t</span> <span class="o">*</span>
<span class="nf">map_file</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">size_t</span> <span class="o">*</span><span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">struct</span> <span class="n">stat</span> <span class="n">stat</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fstat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">stat</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="o">*</span><span class="n">length</span> <span class="o">=</span> <span class="n">stat</span><span class="p">.</span><span class="n">st_size</span><span class="p">;</span>  <span class="c1">// TODO: possible overflow</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">*</span><span class="n">length</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">unmap_file</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next, here’s an example that iterates over all the structures in
<code class="language-plaintext highlighter-rouge">input_file</code>, in this case counting each. The <code class="language-plaintext highlighter-rouge">size</code> member is
extracted in order to stride to the next structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">length</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">map_file</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">length</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">data</span><span class="p">)</span>
    <span class="n">FATAL</span><span class="p">();</span>

<span class="kt">size_t</span> <span class="n">event_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">&lt;</span> <span class="n">data</span> <span class="o">+</span> <span class="n">length</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">event_count</span><span class="o">++</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">p</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">length</span> <span class="o">-</span> <span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">data</span><span class="p">))</span>
        <span class="n">FATAL</span><span class="p">();</span>  <span class="c1">// invalid size</span>
    <span class="n">p</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"I see %zu events.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">event_count</span><span class="p">);</span>

<span class="n">unmap_file</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
</code></pre></div></div>

<p>This is the basic structure for navigating this kind of data. A deeper
dive would involve a <code class="language-plaintext highlighter-rouge">switch</code> inside the loop, extracting the relevant
members for whatever use is needed.</p>

<p>Fast, correct, simple. Pick three.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Appending to a File from Multiple Processes</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/03/"/>
    <id>urn:uuid:93473b6d-3be3-3d0c-d7d5-6ad485c1e9a0</id>
    <updated>2016-08-03T16:17:44Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you have multiple processes appending output to the same file
without explicit synchronization. These processes might be working in
parallel on different parts of the same problem, or these might be
threads blocked individually reading different external inputs. There
are two concerns that come into play:</p>

<p>1) <strong>The append must be atomic</strong> such that it doesn’t clobber previous
    appends by other threads and processes. For example, suppose a
    write requires two separate operations: first moving the file
    pointer to the end of the file, then performing the write. There
    would be a race condition should another process or thread
    intervene in between with its own write.</p>

<p>2) <strong>The output will be interleaved.</strong> The primary solution is to
   design the data format as atomic records, where the ordering of
   records is unimportant — like rows in a relational database. This
   could be as simple as a text file with each line as a record. The
   concern is then ensuring records are written atomically.</p>

<p>This article discusses processes, but the same applies to threads when
directly dealing with file descriptors.</p>

<h3 id="appending">Appending</h3>

<p>The first concern is solved by the operating system, with one caveat.
On POSIX systems, opening a file with the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag will
guarantee that <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">writes always safely append</a>.</p>

<blockquote>
  <p>If the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and
no intervening file modification operation shall occur between
changing the file offset and the write operation.</p>
</blockquote>

<p>However, this says nothing about interleaving. <strong>Two processes
successfully appending to the same file will result in all their bytes
in the file in order, but not necessarily contiguously.</strong></p>

<p>The caveat is that not all filesystems are POSIX-compatible. Two
famous examples are NFS and the Hadoop Distributed File System (HDFS).
On these networked filesystems, appends are simulated and subject to
race conditions.</p>

<p>On POSIX systems, fopen(3) with the <code class="language-plaintext highlighter-rouge">a</code> flag <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html">will use
<code class="language-plaintext highlighter-rouge">O_APPEND</code></a>, so you don’t necessarily need to use open(2). On
Linux this can be verified for any language’s standard library with
strace.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/null"</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the result of the trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
</code></pre></div></div>

<p>For Win32, the equivalent is the <code class="language-plaintext highlighter-rouge">FILE_APPEND_DATA</code> access right, and
similarly <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/gg258116(v=vs.85).aspx">only applies to “local files.”</a></p>

<h3 id="interleaving-and-pipes">Interleaving and Pipes</h3>

<p>The interleaving problem has two layers, and gets more complicated the
more correct you want to be. Let’s start with pipes.</p>

<p>On POSIX, a pipe is unseekable and doesn’t have a file position, so
appends are the only kind of write possible. When writing to a pipe
(or FIFO), writes less than the system-defined <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> are
guaranteed to be atomic and non-interleaving.</p>

<blockquote>
  <p>Write requests of <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes or less shall not be interleaved
with data from other processes doing writes on the same pipe. Writes
of greater than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes may have data interleaved, on
arbitrary boundaries, with writes by other processes, […]</p>
</blockquote>

<p>The minimum value for <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> for POSIX systems is 512 bytes. On
Linux it’s 4kB, and on other systems <a href="http://ar.to/notes/posix">it’s as high as 32kB</a>.
As long as each record is less than 512 bytes, a simple write(2) will
due. None of this depends on a filesystem since no files are involved.</p>

<p>If more than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes isn’t enough, the POSIX writev(2) can be
used to <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/writev.html">atomically write up to <code class="language-plaintext highlighter-rouge">IOV_MAX</code> buffers</a> of
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes. The minimum value for <code class="language-plaintext highlighter-rouge">IOV_MAX</code> is 16, but is
typically 1024. This means the maximum safe atomic write size for
pipes — and therefore the largest record size — for a perfectly
portable program is 8kB (16✕512). On Linux it’s 4MB.</p>

<p>That’s all at the system call level. There’s another layer to contend
with: buffered I/O in your language’s standard library. Your program
may pass data in appropriately-sized pieces for atomic writes to the
I/O library, but it may be undoing your hard work, concatenating all
these writes into a buffer, splitting apart your records. For this
part of the article, I’ll focus on single-threaded C programs.</p>

<p>Suppose you’re writing a simple space-separated format with one line
per record.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">baz</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">condition</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d %d %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">,</span> <span class="n">baz</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Whether or not this works depends on how <code class="language-plaintext highlighter-rouge">stdout</code> is buffered. C
standard library streams (<code class="language-plaintext highlighter-rouge">FILE *</code>) have three buffering modes:
unbuffered, line buffered, and fully buffered. Buffering is configured
through setbuf(3) and setvbuf(3), and the initial buffering state of a
stream depends on various factors. For buffered streams, the default
buffer is at least <code class="language-plaintext highlighter-rouge">BUFSIZ</code> bytes, itself at least 256 (C99
§7.19.2¶7). Note: threads share this buffer.</p>

<p>Since each record in the above program easily fits inside 256 bytes,
if stdout is a line buffered pipe then this program will interleave
correctly on any POSIX system without further changes.</p>

<p>If instead your output is comma-separated values (CSV) and <a href="https://tools.ietf.org/html/rfc4180">your
records may contain new line characters</a>, there are two
approaches. In each, the record must still be no larger than
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes.</p>

<ul>
  <li>
    <p>Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3))
and output the entire buffer in a single fwrite(3). While I believe
this will always work in practice, it’s not guaranteed by the C
specification, which defines fwrite(3) as a series of fputc(3) calls
(C99 §7.19.8.2¶2).</p>
  </li>
  <li>
    <p>Fully buffered pipe: set a sufficiently large stream buffer and
follow each record with a fflush(3). Unlike fwrite(3) on an
unbuffered stream, the specification says the buffer will be
“transmitted to the host environment as a block” (C99 §7.19.3¶3),
so this should be perfectly correct on any POSIX system.</p>
  </li>
</ul>

<p>If your situation is more complicated than this, you’ll probably have
to bypass your standard library buffered I/O and call write(2) or
writev(2) yourself.</p>

<h4 id="practical-application">Practical Application</h4>

<p>If interleaving writes to a pipe stdout sounds contrived, here’s the
real life scenario: GNU xargs with its <code class="language-plaintext highlighter-rouge">--max-procs</code> (<code class="language-plaintext highlighter-rouge">-P</code>) option to
process inputs in parallel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xargs -n1 -P$(nproc) myprogram &lt; inputs.txt | cat &gt; outputs.csv
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">| cat</code> ensures the output of each <code class="language-plaintext highlighter-rouge">myprogram</code> process is
connected to the same pipe rather than to the same file.</p>

<p>A non-portable alternative to <code class="language-plaintext highlighter-rouge">| cat</code>, especially if you’re
dispatching processes and threads yourself, is the splice(2) system
call on Linux. It efficiently moves the output from the pipe to the
output file without an intermediate copy to userspace. GNU Coreutils’
cat doesn’t use this.</p>

<h4 id="win32-pipes">Win32 Pipes</h4>

<p>On Win32, <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365152(v=vs.85).aspx">anonymous pipes</a> have no semantics regarding
interleaving. <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365150(v=vs.85).aspx">Named pipes</a> have per-client buffers that
prevent interleaving. However, the pipe buffer size is unspecified,
and requesting a particular size is only advisory, so it comes down to
trial and error, though the unstated limits should be comparatively
generous.</p>

<h3 id="interleaving-and-files">Interleaving and Files</h3>

<p>Suppose instead of a pipe we have an <code class="language-plaintext highlighter-rouge">O_APPEND</code> file on POSIX. Common
wisdom states that the same <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> atomic write rule applies.
While this often works, especially on Linux, this is not correct. The
POSIX specification doesn’t require it and <a href="http://www.notthewizard.com/2014/06/17/are-files-appends-really-atomic/">there are systems where it
doesn’t work</a>.</p>

<p>If you know the particular limits of your operating system <em>and</em>
filesystem, and you don’t care much about portability, then maybe you
can get away with interleaving appends. For full portability, pipes
are required.</p>

<p>On Win32, writes on local files up to the underlying drive’s sector
size (typically 512 bytes to 4kB) are atomic. Otherwise the only
options are deprecated Transactional NTFS (TxF), or manually
synchronizing your writes. All in all, it’s going to take more work to
get correct.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My true use case for mucking around with clean, atomic appends is to
compute giant CSV tables in parallel, with the intention of later
loading into a SQL database (i.e. SQLite) for analysis. A more robust
and traditional approach would be to write results directly into the
database as they’re computed. But I like the platform-neutral
intermediate CSV files — good for archival and sharing — and the
simplicity of programs generating the data — concerned only with
atomic write semantics rather than calling into a particular SQL
database API.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Makefile Assignments are Turing-Complete</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/04/30/"/>
    <id>urn:uuid:49f54bce-b7da-374e-1e0e-1724b92e3e1f</id>
    <updated>2016-04-30T03:01:22Z</updated>
    <category term="lang"/><category term="compsci"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>For over a decade now, GNU Make has almost exclusively been my build
system of choice, either directly or indirectly. Unfortunately this
means I unnecessarily depend on some GNU extensions — an annoyance when
porting to the BSDs. In an effort to increase the portability of my
Makefiles, I recently read <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html">the POSIX make specification</a>. I
learned two important things: 1) <del>POSIX make is so barren it’s not
really worth striving for</del> (<em>update</em>: I’ve <a href="/blog/2017/08/20/">changed my mind</a>),
and 2) <strong>make’s macro assignment mechanism is Turing-complete</strong>.</p>

<p>If you want to see it in action for yourself before reading further,
here’s a Makefile that implements Conway’s Game of Life (40x40) using
only macro assignments.</p>

<ul>
  <li><a href="/download/life.mak"><strong>life.mak</strong></a> (174kB) [<a href="https://github.com/skeeto/makefile-game-of-life">or generate your own</a>]</li>
</ul>

<p>Run it with any make program in an ANSI terminal. It <em>must</em> literally
be named <code class="language-plaintext highlighter-rouge">life.mak</code>. Beware: if you run it longer than a few minutes,
your computer may begin thrashing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make -f life.mak
</code></pre></div></div>

<p>It’s 100% POSIX-compatible except for the <code class="language-plaintext highlighter-rouge">sleep 0.1</code> (fractional
sleep), which is only needed for visual effect.</p>

<h3 id="a-posix-workaround">A POSIX workaround</h3>

<p>Unlike virtually every real world implementation, POSIX make doesn’t
support conditional parts. For example, you might want your Makefile’s
behavior to change depending on the value of certain variables. In GNU
Make it looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
else
    EXTRA_FLAGS = -Wbar
endif
</code></pre></div></div>

<p>Or BSD-style:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
.else
    EXTRA_FLAGS = -Wbar
.endif
</code></pre></div></div>

<p>If the goal is to write a strictly POSIX Makefile, how could I work
around the lack of conditional parts and maintain a similar interface?
The selection of macro/variable to evaluate can be dynamically
selected, allowing for some useful tricks. First define the option’s
default:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>USE_FOO = 0
</code></pre></div></div>

<p>Then define both sets of flags:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EXTRA_FLAGS_0 = -Wbar
EXTRA_FLAGS_1 = -ffoo -lfoo
</code></pre></div></div>

<p>Now dynamically select one of these macros for assignment to
<code class="language-plaintext highlighter-rouge">EXTRA_FLAGS</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EXTRA_FLAGS = $(EXTRA_FLAGS_$(USE_FOO))
</code></pre></div></div>

<p>The assignment on the command line overrides the assignment in the
Makefile, so the user gets to override <code class="language-plaintext highlighter-rouge">USE_FOO</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make              # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0    # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1    # EXTRA_FLAGS = -ffoo -lfoo
</code></pre></div></div>

<p>Before reading the POSIX specification, I didn’t realize that the
<em>left</em> side of an assignment can get the same treatment. For example,
if I really want the “if defined” behavior back, I can use the macro
to mangle the left-hand side. For example,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EXTRA_FLAGS = -O0 -g3
EXTRA_FLAGS$(DEBUG) = -O3 -DNDEBUG
</code></pre></div></div>

<p>Caveat: If <code class="language-plaintext highlighter-rouge">DEBUG</code> is set to empty, it may still result in true for
<code class="language-plaintext highlighter-rouge">ifdef</code> depending on which make flavor you’re using, but will always
<em>appear</em> to be unset in this hack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make             # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes   # EXTRA_FLAGS = -O0 -g3
</code></pre></div></div>

<p>This last case had me thinking: This is very similar to the (ab)use of
the x86 <code class="language-plaintext highlighter-rouge">mov</code> instruction in <a href="https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf">mov is Turing-complete</a>. These
macro assignments alone should be enough to compute <em>any</em> algorithm.</p>

<h3 id="macro-operations">Macro Operations</h3>

<p>Macro names are just keys to a global associative array. This can be
used to build lookup tables. Here’s a Makefile to “compute” the square
root of integers between 0 and 10.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sqrt_0  = 0.000000
sqrt_1  = 1.000000
sqrt_2  = 1.414214
sqrt_3  = 1.732051
sqrt_4  = 2.000000
sqrt_5  = 2.236068
sqrt_6  = 2.449490
sqrt_7  = 2.645751
sqrt_8  = 2.828427
sqrt_9  = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))
</code></pre></div></div>

<p>The BSD flavors of make have a <code class="language-plaintext highlighter-rouge">-V</code> option for printing variables,
which is an easy way to retrieve output. I used an “immediate”
assignment (<code class="language-plaintext highlighter-rouge">:=</code>) for <code class="language-plaintext highlighter-rouge">result</code> since some versions of make won’t
evaluate the expression before <code class="language-plaintext highlighter-rouge">-V</code> printing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make -f sqrt.mak -V result n=8
2.828427
</code></pre></div></div>

<p>Without <code class="language-plaintext highlighter-rouge">-V</code>, a default target could be used instead:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output :
        @printf "$(result)\n"
</code></pre></div></div>

<p>There are no math operators, so performing arithmetic <a href="/blog/2008/03/15/">requires some
creativity</a>. For example, integers could be represented as a
series of x characters. The number 4 is <code class="language-plaintext highlighter-rouge">xxxx</code>, the number 6 is
<code class="language-plaintext highlighter-rouge">xxxxxx</code>, etc. Addition is concatenation (note: macros can have <code class="language-plaintext highlighter-rouge">+</code> in
their names):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A      = xxx
B      = xxxx
A+B    = $(A)$(B)
</code></pre></div></div>

<p>However, since there’s no way to “slice” a value, subtraction isn’t
possible. A more realistic approach to arithmetic would require lookup
tables.</p>

<h3 id="branching">Branching</h3>

<p>Branching could be achieved through more lookup tables. For example,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>square_0  = 1
square_1  = 2
square_2  = 4
# ...
result := $($(op)_$(n))
</code></pre></div></div>

<p>And called as:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make n=5 op=sqrt    # 2.236068
$ make n=5 op=square  # 25
</code></pre></div></div>

<p>Or using the <code class="language-plaintext highlighter-rouge">DEBUG</code> trick above, use the condition to mask out the
results of the unwanted branch. This is similar to the <code class="language-plaintext highlighter-rouge">mov</code> paper.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>result           := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make n=5 op=square             # 25
$ make n=5 op=square verbose=1   # square(5) = 25
</code></pre></div></div>

<h3 id="what-about-loops">What about loops?</h3>

<p>Looping is a tricky problem. However, one of the most common build
(<a href="http://aegis.sourceforge.net/auug97.pdf">anti</a>?)patterns is the recursive Makefile. Borrowing from the
<code class="language-plaintext highlighter-rouge">mov</code> paper, which used an unconditional jump to restart the program
from the beginning, for a Makefile Turing-completeness I can invoke
the Makefile recursively, restarting the program with a new set of
inputs.</p>

<p>Remember the print target above? I can loop by invoking make again
with new inputs in this target,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output :
    @printf "$(result)\n"
    @$(MAKE) $(args)
</code></pre></div></div>

<p>Before going any further, now that loops have been added, the natural
next question is halting. In reality, the operating system will take
care of that after some millions of make processes have carelessly
been invoked by this horribly inefficient scheme. However, we can do
better. The program can clobber the <code class="language-plaintext highlighter-rouge">MAKE</code> variable when it’s ready to
halt. Let’s formalize it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loop = $(MAKE) $(args)
output :
    @printf "$(result)\n"
    @$(loop)
</code></pre></div></div>

<p>To halt, the program just needs to clear <code class="language-plaintext highlighter-rouge">loop</code>.</p>

<p>Suppose we want to count down to 0. There will be an initial count:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>count = 6
</code></pre></div></div>

<p>A decrement table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6  = 5
5  = 4
4  = 3
3  = 2
2  = 1
1  = 0
0  = loop
</code></pre></div></div>

<p>The last line will be used to halt by clearing the name on the right
side. This is <a href="http://c2.com/cgi/wiki?ThreeStarProgrammer">three star</a> territory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$($($(count))) =
</code></pre></div></div>

<p>The result (current iteration) loop value is computed from the lookup
table.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>result = $($(count))
</code></pre></div></div>

<p>The next loop value is passed via <code class="language-plaintext highlighter-rouge">args</code>. If <code class="language-plaintext highlighter-rouge">loop</code> was cleared above,
this result will be discarded.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>args = count=$(result)
</code></pre></div></div>

<p>With all that in place, invoking the Makefile will print a countdown
from 5 to 0 and quit. This is the general structure for the Game of
Life macro program.</p>

<h3 id="game-of-life">Game of Life</h3>

<p>A universal Turing machine has <a href="http://rendell-attic.org/gol/tm.htm">been implemented in Conway’s Game of
Life</a>. With all that heavy lifting done, one of the easiest
methods today to prove a language’s Turing-completeness is to
implement Conway’s Game of Life. Ignoring the criminal inefficiency of
it, the Game of Life Turing machine could be run on the Game of Life
simulation running on make’s macro assignments.</p>

<p>In the Game of Life program — the one linked at the top of this
article — each cell is stored in a macro named xxyy, after its
position. The top-left most cell is named 0000, then going left to
right, 0100, 0200, etc. Providing input is a matter of assigning each
of these macros. I chose <code class="language-plaintext highlighter-rouge">X</code> for alive and <code class="language-plaintext highlighter-rouge">-</code> for dead, but, as
you’ll see, any two characters permitted in macro names would work as
well.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make 0000=X 0100=- 0200=- 0300=X ...
</code></pre></div></div>

<p>The next part should be no surprise: The rules of the Game of Life are
encoded as a 512-entry lookup table. The key is formed by
concatenating the cell’s value along with all its neighbors, with
itself in the center.</p>

<p><img src="/img/diagram/make-gol.png" alt="" /></p>

<p>The “beginning” of the table looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...
</code></pre></div></div>

<p>Note: The two right-hand <code class="language-plaintext highlighter-rouge">X</code> values here are the cell coming to life
(exactly three living neighbors). Computing the <em>next</em> value (n0101)
for 0101 is done like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))
</code></pre></div></div>

<p>Given these results, constructing the input to the next loop is
simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...
</code></pre></div></div>

<p>The display output, to be given to <code class="language-plaintext highlighter-rouge">printf</code>, is built similarly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output = $(n0000)$(n0100)$(n0200)$(n0300)...
</code></pre></div></div>

<p>In the real version, this is decorated with an ANSI escape code that
clears the terminal. The <code class="language-plaintext highlighter-rouge">printf</code> interprets the escape byte (<code class="language-plaintext highlighter-rouge">\033</code>)
so that it doesn’t need to appear literally in the source.</p>

<p>And that’s all there is to it: Conway’s Game of Life running in a
Makefile. <a href="https://www.youtube.com/watch?v=dMjQ3hA9mEA">Life, uh, finds a way</a>.</p>

<!-- Obviously the following image is not public domain. -->
<p><img src="/img/humor/life-finds-a-way.jpg" alt="" /></p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Mapping Multiple Memory Views in User Space</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/04/10/"/>
    <id>urn:uuid:373e602e-0d43-3e03-f02c-2d169eb14df5</id>
    <updated>2016-04-10T21:59:16Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Modern operating systems run processes within <em>virtual memory</em> using a
piece of hardware called a <em>memory management unit</em> (MMU). The MMU
contains a <em>page table</em> that defines how virtual memory maps onto
<em>physical memory</em>. The operating system is responsible for maintaining
this page table, mapping and unmapping virtual memory to physical
memory as needed by the processes it’s running. If a process accesses
a page that is not currently mapped, it will trigger a <em>page fault</em>
and the execution of the offending thread will be paused until the
operating system maps that page.</p>

<p>This functionality allows for a neat hack: A physical memory address
can be mapped to multiple virtual memory addresses at the same time. A
process running with such a mapping will see these regions of memory
as aliased — views of the same physical memory. A store to one of
these addresses will simultaneously appear across all of them.</p>

<p>Some useful applications of this feature include:</p>

<ul>
  <li>An extremely fast, large memory “copy” by mapping the source memory
overtop the destination memory.</li>
  <li>Trivial interoperability between code instrumented with <a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy
bounds checking</a> [PDF] and non-instrumented code. A few bits
of each pointer are reserved to tag the pointer with the size of its
memory allocation. For compactness, the stored size is rounded up to
a power of two, making it “baggy.” Instrumented code checks this tag
before making a possibly-unsafe dereference. Normally, instrumented
code would need to clear (or set) these bits before dereferencing or
before passing it to non-instrumented code. Instead, the allocation
could be mapped simultaneously at each location for every possible
tag, making the pointer valid no matter its tag bits.</li>
  <li>Two responses to <a href="/blog/2016/03/31/">my last post on hotpatching</a> suggested
that, instead of modifying the instruction directly, memory
containing the modification could be mapped over top of the code. I
would copy the code to another place in memory, safely modify it in
private, switch the page protections from write to execute (both for
W^X and for <a href="https://web.archive.org/web/20190323050330/http://stackoverflow.com/a/18905927">other hardware limitations</a>), then map it over
the target. Restoring the original behavior would be as simple as
unmapping the change.</li>
</ul>

<p>Both POSIX and Win32 allow user space applications to create these
aliased mappings. The original purpose for these APIs is for shared
memory between processes, where the same physical memory is mapped
into two different processes’ virtual memory. But the OS doesn’t stop
us from mapping the shared memory to a different address within the
same process.</p>

<h3 id="posix-memory-mapping">POSIX Memory Mapping</h3>

<p>On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are <code class="language-plaintext highlighter-rouge">shm_open(3)</code>, <code class="language-plaintext highlighter-rouge">ftruncate(2)</code>, and <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<p>First, create a file descriptor to shared memory using <code class="language-plaintext highlighter-rouge">shm_open</code>. It
has very similar semantics to <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">shm_open</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">oflag</span><span class="p">,</span> <span class="n">mode_t</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">name</code> works much like a filesystem path, but is actually a
different namespace (though on Linux it <em>is</em> a tmpfs mounted at
<code class="language-plaintext highlighter-rouge">/dev/shm</code>). Resources created here (<code class="language-plaintext highlighter-rouge">O_CREAT</code>) will persist until
explicitly deleted (<code class="language-plaintext highlighter-rouge">shm_unlink(3)</code>) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via <code class="language-plaintext highlighter-rouge">fork(2)</code> or through UNIX domain sockets, so a name
isn’t strictly required.</p>

<p>OpenBSD introduced <a href="http://man.openbsd.org/OpenBSD-current/man3/shm_mkstemp.3"><code class="language-plaintext highlighter-rouge">shm_mkstemp(3)</code></a> to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
<code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag may or may not provide a fix (<a href="http://comments.gmane.org/gmane.linux.man/9815">it’s
undocumented</a>).</p>

<p>The portable workaround is to attempt to choose a unique name, open
the file with <code class="language-plaintext highlighter-rouge">O_CREAT | O_EXCL</code> (either atomically create the file or
fail), <code class="language-plaintext highlighter-rouge">shm_unlink</code> the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="s">"/example"</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">handle_error</span><span class="p">();</span> <span class="c1">// non-local exit</span>
<span class="n">shm_unlink</span><span class="p">(</span><span class="s">"/example"</span><span class="p">);</span>
</code></pre></div></div>

<p>The shared memory object is brand new (<code class="language-plaintext highlighter-rouge">O_EXCL</code>) and is therefore of
zero size. <code class="language-plaintext highlighter-rouge">ftruncate</code> sets it to the desired size. This does <em>not</em>
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally <code class="language-plaintext highlighter-rouge">mmap</code> the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (<code class="language-plaintext highlighter-rouge">munmap(2)</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>At this point both <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mh">0xdeafbeef</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%p %p 0x%x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>Will print out something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x6ffffff0000 0x6fffffe0000 0xdeafbeef
</code></pre></div></div>

<p>It’s also possible to do all this only with <code class="language-plaintext highlighter-rouge">open(2)</code> and <code class="language-plaintext highlighter-rouge">mmap(2)</code> by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.</p>

<h3 id="windows-memory-mapping">Windows Memory Mapping</h3>

<p>Windows is very similar, but directly supports anonymous shared
memory. The key functions are <code class="language-plaintext highlighter-rouge">CreateFileMapping</code>, and
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code>.</p>

<p>First create a file mapping object from an invalid handle value. Like
POSIX, the word “file” is used without actually involving files.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">,</span>
                             <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                             <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s no truncate step because the space is allocated at creation
time via the two-part size argument.</p>

<p>Then, just like <code class="language-plaintext highlighter-rouge">mmap</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>If I wanted to choose the target address myself, I’d call
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code> instead, which takes the address as additional
argument.</p>

<p>From here on it’s the same as above.</p>

<h3 id="generalizing-the-api">Generalizing the API</h3>

<p>Having some fun with this, I came up with a general API to allocate an
aliased mapping at an arbitrary number of addresses.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
</code></pre></div></div>

<p>Values in the address array must either be page-aligned or NULL to
allow the operating system to choose, in which case the map address is
written to the array.</p>

<p>It returns 0 on success. It may fail if the size is too small (0), too
large, too many file descriptors, etc.</p>

<p>Pass the same pointers back to <code class="language-plaintext highlighter-rouge">memory_alias_unmap</code> to free the
mappings. When called correctly it cannot fail, so there’s no return
value.</p>

<p>The full source is here: <a href="/download/memalias.c">memalias.c</a></p>

<h4 id="posix">POSIX</h4>

<p>Starting with the simpler of the two functions, the POSIX
implementation looks like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">munmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The complex part is creating the mapping:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="s">"/%s(%lu,%p)"</span><span class="p">,</span>
             <span class="n">__FUNCTION__</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">getpid</span><span class="p">(),</span> <span class="n">addrs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">shm_unlink</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span>
                        <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span>
                        <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared object name includes the process ID and pointer array
address, so there really shouldn’t be any non-malicious name
collisions, even if called from multiple threads in the same process.</p>

<p>Otherwise it just walks the array setting up the mappings.</p>

<h4 id="windows">Windows</h4>

<p>The Windows version is very similar.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">size</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">UnmapViewOfFile</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since Windows tracks the size internally, it’s unneeded and ignored.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">m</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">,</span>
                                 <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                                 <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">m</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">MapViewOfFileEx</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the future I’d like to find some unique applications of these
multiple memory views.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Basic Just-In-Time Compiler</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/19/"/>
    <id>urn:uuid:95e0437f-61f0-3932-55b7-f828e171d9ca</id>
    <updated>2015-03-19T04:57:55Z</updated>
    <category term="c"/><category term="tutorial"/><category term="netsec"/><category term="x86"/><category term="posix"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=17747759">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/akxq8q/a_basic_justintime_compiler/">on reddit</a>.</em></p>

<p><a href="http://redd.it/2z68di">Monday’s /r/dailyprogrammer challenge</a> was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(<code class="language-plaintext highlighter-rouge">u(0)</code>) and a sequence of operations, <code class="language-plaintext highlighter-rouge">f</code>, to apply to the previous
term (<code class="language-plaintext highlighter-rouge">u(n + 1) = f(u(n))</code>) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.</p>

<!--more-->

<p>For example, the relation <code class="language-plaintext highlighter-rouge">u(n + 1) = (u(n) + 2) * 3 - 5</code> would be
input as <code class="language-plaintext highlighter-rouge">+2 *3 -5</code>. If <code class="language-plaintext highlighter-rouge">u(0) = 0</code> then,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">u(1) = 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(2) = 4</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(3) = 13</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(4) = 40</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(5) = 121</code></li>
  <li>…</li>
</ul>

<p>Rather than write an interpreter to apply the sequence of operations,
for <a href="https://gist.github.com/skeeto/3a1aa3df31896c9956dc">my submission</a> (<a href="/download/jit.c">mirror</a>) I took the opportunity to
write a simple x86-64 Just-In-Time (JIT) compiler. So rather than
stepping through the operations one by one, my program converts the
operations into native machine code and lets the hardware do the work
directly. In this article I’ll go through how it works and how I did
it.</p>

<p><strong>Update</strong>: The <a href="http://redd.it/2zna5q">follow-up challenge</a> uses Reverse Polish
notation to allow for more complicated expressions. I wrote another
JIT compiler for <a href="https://gist.github.com/anonymous/f7e4a5086a2b0acc83aa">my submission</a> (<a href="/download/rpn-jit.c">mirror</a>).</p>

<h3 id="allocating-executable-memory">Allocating Executable Memory</h3>

<p>Modern operating systems have page-granularity protections for
different parts of <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">process memory</a>: read, write, and execute.
Code can only be executed from memory with the execute bit set on its
page, memory can only be changed when its write bit is set, and some
pages aren’t allowed to be read. In a running process, the pages
holding program code and loaded libraries will have their write bit
cleared and execute bit set. Most of the other pages will have their
execute bit cleared and their write bit set.</p>

<p>The reason for this is twofold. First, it significantly increases the
security of the system. If untrusted input was read into executable
memory, an attacker could input machine code (<em>shellcode</em>) into the
buffer, then exploit a flaw in the program to cause control flow to
jump to and execute that code. If the attacker is only able to write
code to non-executable memory, this attack becomes a lot harder. The
attacker has to rely on code already loaded into executable pages
(<a href="http://en.wikipedia.org/wiki/Return-oriented_programming"><em>return-oriented programming</em></a>).</p>

<p>Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, <code class="language-plaintext highlighter-rouge">NULL</code>
points to a special page with read, write, and execute disabled.</p>

<h4 id="an-instruction-buffer">An Instruction Buffer</h4>

<p>Memory returned by <code class="language-plaintext highlighter-rouge">malloc()</code> and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through <code class="language-plaintext highlighter-rouge">malloc()</code>, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an <code class="language-plaintext highlighter-rouge">asmbuf</code> struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PAGE_SIZE 4096
</span>
<span class="k">struct</span> <span class="n">asmbuf</span> <span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">code</span><span class="p">[</span><span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint64_t</span><span class="p">)];</span>
    <span class="kt">uint64_t</span> <span class="n">count</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use <code class="language-plaintext highlighter-rouge">sysconf(_SC_PAGESIZE)</code> to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.</p>

<p>Instead of <code class="language-plaintext highlighter-rouge">malloc()</code>, the compiler allocates memory as an anonymous
memory map (<code class="language-plaintext highlighter-rouge">mmap()</code>). It’s anonymous because it’s not backed by a
file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Windows doesn’t have POSIX <code class="language-plaintext highlighter-rouge">mmap()</code>, so on that platform we use
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> instead. Here’s the equivalent in Win32.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">type</span> <span class="o">=</span> <span class="n">MEM_RESERVE</span> <span class="o">|</span> <span class="n">MEM_COMMIT</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">VirtualAlloc</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">PAGE_READWRITE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Anyone reading closely should notice that I haven’t actually requested
that the memory be executable, which is, like, the whole point of all
this! This was intentional. Some operating systems employ a security
feature called W^X: “write xor execute.” That is, memory is either
writable or executable, but never both at the same time. This makes
the shellcode attack I described before even harder. For <a href="http://www.tedunangst.com/flak/post/now-or-never-exec">well-behaved
JIT compilers</a> it means memory protections need to be adjusted
after code generation and before execution.</p>

<p>The POSIX <code class="language-plaintext highlighter-rouge">mprotect()</code> function is used to change memory protections.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or on Win32 (that last parameter is not allowed to be <code class="language-plaintext highlighter-rouge">NULL</code>),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">old</span><span class="p">;</span>
    <span class="n">VirtualProtect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PAGE_EXECUTE_READ</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">old</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, instead of <code class="language-plaintext highlighter-rouge">free()</code> it gets unmapped.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And on Win32,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">VirtualFree</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MEM_RELEASE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I won’t list the definitions here, but there are two “methods” for
inserting instructions and immediate values into the buffer. This will
be raw machine code, so the caller will be acting a bit like an
assembler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">ins</span><span class="p">);</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="calling-conventions">Calling Conventions</h3>

<p>We’re only going to be concerned with three of x86-64’s many
registers: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rax</code>, and <code class="language-plaintext highlighter-rouge">rdx</code>. These are 64-bit (<code class="language-plaintext highlighter-rouge">r</code>) extensions
of <a href="/blog/2014/12/09/">the original 16-bit 8086 registers</a>. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">recurrence</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions">The System V AMD64 ABI calling convention</a> says that the first
integer/pointer function argument is passed in the <code class="language-plaintext highlighter-rouge">rdi</code> register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in <code class="language-plaintext highlighter-rouge">rax</code> when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy <code class="language-plaintext highlighter-rouge">rdi</code> to <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rdi</span>
</code></pre></div></div>

<p>There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in <code class="language-plaintext highlighter-rouge">asmbuf</code>. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in <code class="language-plaintext highlighter-rouge">rcx</code> rather than <code class="language-plaintext highlighter-rouge">rdi</code>. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.</p>

<p>The very last thing it will do, assuming the result is in <code class="language-plaintext highlighter-rouge">rax</code>, is
return to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">ret</span>
</code></pre></div></div>

<p>So we know the assembly, but what do we pass to <code class="language-plaintext highlighter-rouge">asmbuf_ins()</code>? This
is where we get our hands dirty.</p>

<h4 id="finding-the-code">Finding the Code</h4>

<p>If you want to do this the Right Way, you go download the x86-64
documentation, look up the instructions we’re using, and manually work
out the bytes we need and how the operands fit into it. You know, like
they used to do <a href="/blog/2016/11/17/">out of necessity</a> back in the 60’s.</p>

<p>Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file <code class="language-plaintext highlighter-rouge">peek.s</code> and hand it to <code class="language-plaintext highlighter-rouge">nasm</code>. It will produce a raw binary
with the machine code, which we’ll disassemble with <code class="language-plaintext highlighter-rouge">nidsasm</code> (the
NASM disassembler).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret
</code></pre></div></div>

<p>That’s straightforward. The first instruction is 3 bytes and the
return is 1 byte.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4889f8</span><span class="p">);</span>  <span class="c1">// mov   rax, rdi</span>
<span class="c1">// ... generate code ...</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mh">0xc3</span><span class="p">);</span>      <span class="c1">// ret</span>
</code></pre></div></div>

<p>For each operation, we’ll set it up so the operand will already be
loaded into <code class="language-plaintext highlighter-rouge">rdi</code> regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use <code class="language-plaintext highlighter-rouge">0x0123456789abcdef</code> as the
operand.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rdi</span><span class="p">,</span> <span class="mh">0x0123456789abcdef</span>
</code></pre></div></div>

<p>Which disassembled with <code class="language-plaintext highlighter-rouge">ndisasm</code> is,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301
</code></pre></div></div>

<p>Notice the operand listed little endian immediately after the
instruction. That’s also easy!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">operand</span><span class="p">;</span>
<span class="n">scanf</span><span class="p">(</span><span class="s">"%ld"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mh">0x48bf</span><span class="p">);</span>         <span class="c1">// mov   rdi, operand</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
</code></pre></div></div>

<p>Apply the same discovery process individually for each operator you
want to support, accumulating the result in <code class="language-plaintext highlighter-rouge">rax</code> for each.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">switch</span> <span class="p">(</span><span class="n">operator</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4801f8</span><span class="p">);</span>   <span class="c1">// add   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4829f8</span><span class="p">);</span>   <span class="c1">// sub   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'*'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mh">0x480fafc7</span><span class="p">);</span> <span class="c1">// imul  rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'/'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4831d2</span><span class="p">);</span>   <span class="c1">// xor   rdx, rdx</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x48f7ff</span><span class="p">);</span>   <span class="c1">// idiv  rdi</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As an exercise, try adding support for modulus operator (<code class="language-plaintext highlighter-rouge">%</code>), XOR
(<code class="language-plaintext highlighter-rouge">^</code>), and bit shifts (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the <a href="https://old.reddit.com/r/dailyprogrammer/comments/2z68di/_/cpgkcx7">closed form solution</a> to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.</p>

<h3 id="calling-the-generated-code">Calling the Generated Code</h3>

<p>Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a <code class="language-plaintext highlighter-rouge">void *</code> just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_finalize</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">recurrence</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="o">-&gt;</span><span class="n">code</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="n">x</span><span class="p">[</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">recurrence</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]);</span>
</code></pre></div></div>

<p>That’s pretty cool if you ask me! Now this was an extremely simplified
situation. There’s no branching, no intermediate values, no function
calls, and I didn’t even touch the stack (push, pop). The recurrence
relation definition in this challenge is practically an assembly
language itself, so after the initial setup it’s a 1:1 translation.</p>

<p>I’d like to build a JIT compiler more advanced than this in the
future. I just need to find a suitable problem that’s more complicated
than this one, warrants having a JIT compiler, but is still simple
enough that I could, on some level, justify not using LLVM.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Global State: a Tale of Two Bad C APIs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/10/12/"/>
    <id>urn:uuid:8a1c5135-e669-308b-6605-58c86be3003b</id>
    <updated>2014-10-12T22:48:00Z</updated>
    <category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Mutable global variables are evil. You’ve almost certainly heard that
before, but it’s worth repeating. It makes programs, and libraries
especially, harder to understand, harder to optimize, more fragile,
more error prone, and less useful. If you’re using global state in a
way that’s visible to users of your API, and it’s not essential to the
domain, you’re almost certainly doing something wrong.</p>

<p>In this article I’m going to use two well-established C APIs to
demonstrate why global state is bad for APIs: BSD regular expressions
and POSIX Getopt.</p>

<h3 id="bsd-regular-expressions">BSD Regular Expressions</h3>

<p>The <a href="http://man7.org/linux/man-pages/man3/re_comp.3.html">BSD regular expression API</a> dates back to 4.3BSD, released
in 1986. It’s just a pair of functions: one compiles the regex, the
other executes it on a string.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">re_comp</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">regex</span><span class="p">);</span>
<span class="kt">int</span>   <span class="nf">re_exec</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s immediately obvious that there’s hidden internal state. Where
else would the resulting compiled regex object be? Also notice there’s
no <code class="language-plaintext highlighter-rouge">re_free()</code>, or similar, for releasing resources held by the
compiled result. That’s because, due to its limited design, it doesn’t
hold any. It’s entirely in static memory, which means there’s some
upper limit on the complexity of the regex given to this API. Suppose
an implementation <em>does</em> use dynamically allocated memory. It seems
this might not matter when only one compiled regex is allowed.
However, this would create warnings in Valgrind and make it harder to
use for bug testing.</p>

<p>This API is not thread-safe. Only one thread can use it at a time.
It’s not reentrant. While using a regex, calling another function that
might use a regex means you have to recompile when it returns, just in
case. The global state being entirely hidden, there’s no way to tell
if another part of the program used it.</p>

<h4 id="fixing-bsd-regular-expressions">Fixing BSD Regular Expressions</h4>

<p>This API has been deprecated for some time now, so hopefully no one’s
using it anymore. 15 years after the BSD regex API came out, POSIX
standardized <a href="http://man7.org/linux/man-pages/man3/regcomp.3.html">a much better API</a>. It operates on an opaque
<code class="language-plaintext highlighter-rouge">regex_t</code> object, on which all state is stored. There’s no global
state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>    <span class="nf">regcomp</span><span class="p">(</span><span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">regex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">cflags</span><span class="p">);</span>
<span class="kt">int</span>    <span class="nf">regexec</span><span class="p">(</span><span class="k">const</span> <span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">size_t</span> <span class="nf">regerror</span><span class="p">(</span><span class="kt">int</span> <span class="n">errcode</span><span class="p">,</span> <span class="k">const</span> <span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">void</span>   <span class="nf">regfree</span><span class="p">(</span><span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">);</span>
</code></pre></div></div>

<p>This is what a good API looks like.</p>

<h3 id="getopt">Getopt</h3>

<p>POSIX defines a C API called Getopt for parsing command line
arguments. It’s a single function that operates on the <code class="language-plaintext highlighter-rouge">argc</code> and
<code class="language-plaintext highlighter-rouge">argv</code> values provided to <code class="language-plaintext highlighter-rouge">main()</code>. An option string specifies which
options are valid and whether or not they require an argument. Typical
use looks like this,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">option</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">option</span> <span class="o">=</span> <span class="n">getopt</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">,</span> <span class="s">"ab:c:d"</span><span class="p">))</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">option</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">case</span> <span class="sc">'a'</span><span class="p">:</span>
            <span class="cm">/* ... */</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="cm">/* ... */</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">b</code> and <code class="language-plaintext highlighter-rouge">c</code> options require an argument, indicated by the colons.
When encountered, this argument is passed through a global variable
<code class="language-plaintext highlighter-rouge">optarg</code>. There are four external global variables in total.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">char</span> <span class="o">*</span><span class="n">optarg</span><span class="p">;</span>
<span class="k">extern</span> <span class="kt">int</span> <span class="n">optind</span><span class="p">,</span> <span class="n">opterr</span><span class="p">,</span> <span class="n">optopt</span><span class="p">;</span>
</code></pre></div></div>

<p>If an invalid option is found, <code class="language-plaintext highlighter-rouge">getopt()</code> will automatically print a
locale-specific error message and return <code class="language-plaintext highlighter-rouge">?</code>. The <code class="language-plaintext highlighter-rouge">opterr</code> variable
can be used to disable this message and the <code class="language-plaintext highlighter-rouge">optopt</code> variable is used
to get the actual invalid option character.</p>

<p>The <code class="language-plaintext highlighter-rouge">optind</code> variable keeps track of Getopt’s progress. It slides
along <code class="language-plaintext highlighter-rouge">argv</code> as each option is processed. In a minimal, strictly
POSIX-compliant Getopt, this is all the global state required.</p>

<p>The <code class="language-plaintext highlighter-rouge">argc</code> value in <code class="language-plaintext highlighter-rouge">main()</code>, and therefore the same parameter in
<code class="language-plaintext highlighter-rouge">getopt()</code>, is completely redundant and serves no real purpose. Just
like the C strings it points to, the <code class="language-plaintext highlighter-rouge">argv</code> vector is guaranteed to be
NULL-terminated. At best it’s a premature optimization.</p>

<h4 id="threading-an-reentrancy">Threading an Reentrancy</h4>

<p>The most immediate problem is that the entire program can only parse
one argument vector at a time. It’s not thread-safe. This leaves out
the possibility of parsing argument vectors in other threads. For
example, if the program is a server that exposes a shell-like
interface to remote users, and multiple threads are used to handle
those requests, it won’t be able to take advantage of Getopt.</p>

<p>The second problem is that, even in a single-threaded application, the
program can’t pause to parse a different argument vector before
returning. It’s not reentrant. For example, suppose one of the
arguments to the program is a string containing more arguments to be
parsed for some subsystem.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#  -s    Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"
</code></pre></div></div>

<p>In theory, the value of <code class="language-plaintext highlighter-rouge">optind</code> could be saved and restored. However,
this isn’t portable. POSIX doesn’t explicitly declare that the entire
state is captured by <code class="language-plaintext highlighter-rouge">optind</code>, nor is it required to be.
Implementations are allowed to have internal, hidden global state.
This has implications in resetting Getopt.</p>

<h4 id="resetting-getopt">Resetting Getopt</h4>

<p>In a minimal, strict Getopt, resetting Getopt for parsing another
argument vector is just a matter of setting <code class="language-plaintext highlighter-rouge">optind</code> to back to its
original value of 1. However, this idiom isn’t portable, and POSIX
provides no portable method for resetting the global parser state.</p>

<p>Real implementations of Getopt go beyond POSIX. Probably the most
popular extra feature is option grouping. Typically, multiple options
can be grouped into a single argument, so long as only the final
option requires an argument.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ myprogram -adb foo
</code></pre></div></div>

<p>After processing <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">optind</code> cannot be incremented, because it’s
still working on the first argument. This means there’s another
internal counter for stepping across the group. In glibc this is
called <code class="language-plaintext highlighter-rouge">nextchar</code>. Setting <code class="language-plaintext highlighter-rouge">optind</code> to 1 will not reset this internal
counter, nor would it be detectable by Getopt if it was already set
to 1. The glibc way to reset Getopt is to set <code class="language-plaintext highlighter-rouge">optind</code> to 0, which is
otherwise an invalid value. Some other Getopt implementations follow
this idiom, but it’s not entirely portable.</p>

<p>Not only does Getopt have nasty global state, the user has no way to
reliably control it!</p>

<h4 id="error-printing">Error Printing</h4>

<p>I mentioned that Getopt will automatically print an error message
unless disabled with <code class="language-plaintext highlighter-rouge">opterr</code>. There’s no way to get at this error
message, should you want to redirect it somewhere else. It’s more
hidden, internal state. You could write your own message, but you’d
lose out on the automatic locale support.</p>

<h4 id="fixing-getopt">Fixing Getopt</h4>

<p>The way Getopt <em>should</em> have been designed was to accept a context
argument and store all state on that context. Following other POSIX
APIs (pthreads, regex), the context itself would be an opaque object.
In typical use it would have automatic (i.e. stack) duration. The
context would either be zero initialized or a function would be
provided to initialize it. It might look something like this (in the
zero-initialized case).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">getopt</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="k">const</span> <span class="n">chat</span> <span class="o">*</span><span class="n">optstring</span><span class="p">);</span>
</code></pre></div></div>

<p>Instead of <code class="language-plaintext highlighter-rouge">optarg</code> and <code class="language-plaintext highlighter-rouge">optopt</code> global variables, these values would
be obtained by interrogating the context. The same applies for
<code class="language-plaintext highlighter-rouge">optind</code> and the diagnostic message.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_optarg</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="kt">int</span>         <span class="nf">getopt_optopt</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="kt">int</span>         <span class="nf">getopt_optind</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_opterr</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>Alternatively, instead of <code class="language-plaintext highlighter-rouge">getopt_optind()</code> the API could have a
function that continues processing, but returns non-option arguments
instead of options. It would return NULL when no more arguments are
left. This is the API I’d prefer, because it would allow for argument
permutation (allow options to come after non-options, per GNU Getopt)
without actually modifying the argument vector. This common extension
to Getopt could be added cleanly. The real Getopt isn’t designed well
for extension.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_next_arg</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>This API eliminates the global state and, as a result, solves <em>all</em> of
the problems listed above. It’s essentially the same API defined by
<a href="http://linux.die.net/man/3/popt">Popt</a> and my own embeddable <a href="https://github.com/skeeto/optparse">Optparse</a>. They’re much
better options if the limitations of POSIX-style Getopt are an issue.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Pseudo-terminals</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/04/23/"/>
    <id>urn:uuid:269799fd-3a67-3a22-433a-c5224447e614</id>
    <updated>2012-04-23T00:00:00Z</updated>
    <category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>My dad recently had an interesting problem at work related to serial
ports. Since I use serial ports at work, he asked me for advice. They
have third-party software which reads and analyzes sensor data from
the serial port. It’s the only method this program has of inputting a
stream of data and they’re unable to patch it. Unfortunately, they
have another piece of software that needs to massage the data before
this final program gets it. The data needs to be intercepted coming on
the serial port somehow.</p>

<p><img src="/img/diagram/pseudo-terminals.png" alt="" /></p>

<p>The solution they were aiming for was to create a pair of virtual
serial ports. The filter software would read data in on the real
serial port, output the filtered data into a virtual serial port which
would be virtually connected to a second virtual serial port. The
analysis software would then read from this second serial port. They
couldn’t figure out how to set this up, short of buying a couple of
USB/serial port adapters and plugging them into each other.</p>

<p>It turns out this is very easy to do on Unix-like systems. POSIX
defines two functions, <code class="language-plaintext highlighter-rouge">posix_openpt(3)</code> and <code class="language-plaintext highlighter-rouge">ptsname(3)</code>. The first
one creates a pseudo-terminal — a virtual serial port — and returns
a “master” <em>file descriptor</em> used to talk to it. The second provides
the name of the pseudo-terminal device on the filesystem, usually
named something like <code class="language-plaintext highlighter-rouge">/dev/pts/5</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">posix_openpt</span><span class="p">(</span><span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_NOCTTY</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd</span><span class="p">));</span>
    <span class="cm">/* ... read and write to fd ... */</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The printed device name can be opened by software that’s expecting to
access a serial port, such as
<a href="http://en.wikipedia.org/wiki/Minicom">minicom</a>, and it can be
communicated with as if by a pipe. This could be useful in testing a
program’s serial port communication logic virtually.</p>

<p>The reason for the unusually long name is because the function wasn’t
added to POSIX until 1998 (Unix98). They were probably afraid of name
collisions with software already using <code class="language-plaintext highlighter-rouge">openpt()</code> as a function
name. The GNU C Library provides an extension <code class="language-plaintext highlighter-rouge">getpt(3)</code>, which is
just shorthand for the above.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
</code></pre></div></div>

<p>Pseudo-terminal functionality was available much earlier, of
course. It could be done through the poorly designed <code class="language-plaintext highlighter-rouge">openpty(3)</code>,
added in BSD Unix.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">openpty</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">amaster</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">aslave</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span>
            <span class="k">const</span> <span class="k">struct</span> <span class="n">termios</span> <span class="o">*</span><span class="n">termp</span><span class="p">,</span>
            <span class="k">const</span> <span class="k">struct</span> <span class="n">winsize</span> <span class="o">*</span><span class="n">winp</span><span class="p">);</span>
</code></pre></div></div>

<p>It accepts <code class="language-plaintext highlighter-rouge">NULL</code> for the last three arguments, allowing the user to
ignore them. What makes it so bad is that string <code class="language-plaintext highlighter-rouge">name</code>. The user
would pass it a chunk of allocated space and hope it was long enough
for the file name. If not, <code class="language-plaintext highlighter-rouge">openpty()</code> would overwrite the end of the
string and trash some memory. It’s highly unlikely to ever exceed
something like 32 bytes, but it’s still a correctness problem.</p>

<p>The newer <code class="language-plaintext highlighter-rouge">ptsname()</code> is only slightly better however. It returns a
string that doesn’t need to be <code class="language-plaintext highlighter-rouge">free()</code>d, because it’s static
memory. However, that means the function is not re-entrant; it has
issues in multi-threaded programs, since that string could be trashed
at any instant by another call to <code class="language-plaintext highlighter-rouge">ptsname()</code>. Consider this case,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd0</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">fd1</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd0</span><span class="p">),</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd1</span><span class="p">));</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">ptsname()</code> will be returning the same <code class="language-plaintext highlighter-rouge">char *</code> pointer each time it’s
called, merely filling the pointed-to space before returning. Rather
than printing two different device filenames, the above would print
the same filename twice. The GNU C Library provides an extension to
correct this flaw, as <code class="language-plaintext highlighter-rouge">ptsname_r()</code>, where the user provides the
memory as before but also indicates its maximum size.</p>

<p>To make a one-way virtual connection between our pseudo-terminals,
create two of them and do the typical buffer thing between the file
descriptors (for succinctness, no checking for errors),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">buffer</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">in</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">pt0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buffer</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">write</span><span class="p">(</span><span class="n">pt1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buffer</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Making a two-way connection would require the use of threads or
<code class="language-plaintext highlighter-rouge">select(2)</code>, but it wouldn’t be much more complicated.</p>

<p>While all this was new and interesting to me, it didn’t help my dad at
all because they’re using Windows. These functions don’t exist there
and creating virtual serial ports is a highly non-trivial,
less-interesting process. Buying the two adapters and connecting them
together is my recommended solution for Windows.</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
