<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged win32 at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/win32/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/win32/feed/"/>
  <updated>2026-04-07T03:24:16Z</updated>
  <id>urn:uuid:33f52ccf-2d02-4f2f-b574-8dd1468b3e7b</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  <entry>
    <title>Frankenwine: Multiple personas in a Wine process</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/19/"/>
    <id>urn:uuid:d2b53f8d-88a6-400b-a748-693a758741c5</id>
    <updated>2026-01-19T21:51:38Z</updated>
    <category term="c"/><category term="win32"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I came across a recent article on <a href="https://gpfault.net/posts/drunk-exe.html">making Linux system calls from a Wine
process</a>. Windows programs running under Wine are still normal Linux
processes and may interact with the Linux kernel like any other process.
None of this was surprising, and the demonstration works just as I expect.
Still, it got the wheels spinning and I realized an <em>almost</em> practical
application: build <a href="/blog/2023/01/18/">my pkg-config implementation</a> such that on Windows
<code class="language-plaintext highlighter-rouge">pkg-config.exe</code> behaves as a native pkg-config, but when run under Wine
this same binary takes the persona of a Linux program and becomes a cross
toolchain pkg-config, bypassing Win32 and talking directly with the Linux
kernel. <a href="https://justine.lol/cosmopolitan/">Cosmopolitcan Libc</a> cleverly does this out-of-the-box, but
in this article we’ll mash together a couple existing sources with a bit
of glue.</p>

<p>The results are in <a href="https://github.com/skeeto/u-config/commit/e0008d7e">the merge-demo branch</a> of u-config, and took
hardly any work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)
</code></pre></div></div>

<p>A platform layer, <code class="language-plaintext highlighter-rouge">main_wine.c</code>, is a merge of two existing platform
layers, one of which required unavoidable tweaks. We’ll get to those
details in a moment. First we’ll need to detect if we’re running under
Wine, and <a href="https://web.archive.org/web/20250923061634/https://stackoverflow.com/questions/7372388/determine-whether-a-program-is-running-under-wine-at-runtime/42333249#42333249">the best solution I found</a> was to locate
<code class="language-plaintext highlighter-rouge">ntdll!wine_get_version</code>. If this function exists, we’re in Wine. That
works out to a pretty one-liner because <code class="language-plaintext highlighter-rouge">ntdll.dll</code> is already loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">running_on_wine</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">GetModuleHandleA</span><span class="p">(</span><span class="s">"ntdll"</span><span class="p">),</span> <span class="s">"wine_get_version"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>An x86-64 Linux syscall wrapper with <a href="/blog/2024/12/20/">thorough inline assembly</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">syscall3</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">b</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="p">(</span><span class="kt">ptrdiff_t</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’d normally use <code class="language-plaintext highlighter-rouge">long</code> for all these integers because Linux is <a href="https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models">LP64</a>
(<code class="language-plaintext highlighter-rouge">long</code> is pointer-sized), but Windows is LLP64 (only <code class="language-plaintext highlighter-rouge">long long</code> is 64
bits). It’s so bizarre to interface with Linux from LLP64, and this will
have consequences later. With these pieces we can see the basic shape of a
split personality program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">running_on_wine</span><span class="p">())</span> <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello, wine</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
        <span class="n">WriteFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"hello, windows</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We can cram two programs into this binary and select which program at run
time depending on what we see. In typical programs locating and calling
into glibc would be a challenge, particularly with the incompatible ABIs
involved. We’re avoiding it here by interfacing directly with the kernel.</p>

<h3 id="application-to-u-config">Application to u-config</h3>

<p>Luckily u-config has completely-optional platform layers implemented with
Linux system calls. The POSIX platform layer works fine, and that’s what
distributions should generally use, but these bonus platforms are unhosted
and do not require libc. That means we can shove it into a Windows build
with relatively little trouble.</p>

<p>Before we do that, let’s think about what we’re doing. <a href="/blog/2021/08/21/">Debian has great
cross toolchain support</a>, including Mingw-w64. There are even a few
Windows libraries in the Debian package repository, <a href="https://packages.debian.org/trixie/x32/libz-mingw-w64">such as zlib</a>, and
we can build Windows programs against them. If you’re cross-building and
using pkg-config, you ought to use the cross toolchain pkg-config, which
in GNU ecosystems gets an architecture prefix like the other cross tools.
Debian cross toolchains each include a cross pkg-config, and it sometimes
<em>almost</em> works correctly! Here’s what I get on Debian 13:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz
</code></pre></div></div>

<p>Note the architecture in the <code class="language-plaintext highlighter-rouge">-I</code> and <code class="language-plaintext highlighter-rouge">-L</code> options. It really is querying
the <a href="https://peter0x44.github.io/posts/cross-compilers/">cross sysroot</a>. Though these paths are in the cross sysroot,
and so should not be listed by pkg-config. It’s unoptimal and indicates
this pkg-config is probably misconfigured. In other cases it’s far from
correct:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...
</code></pre></div></div>

<p>A tool prefixed <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-</code> should not produce paths containing
<code class="language-plaintext highlighter-rouge">x86_64-linux-gnu</code> (the host architecture in this case). Our version won’t
have these issues.</p>

<p>The u-config platform interface is five functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// read whole files</span>
<span class="n">s8node</span> <span class="o">*</span><span class="nf">os_listing</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// list directories</span>
<span class="kt">void</span>    <span class="nf">os_write</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>          <span class="c1">// standard out/err</span>
<span class="kt">void</span>    <span class="nf">os_fail</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">);</span>                       <span class="c1">// non-zero exit</span>

<span class="kt">void</span> <span class="nf">uconfig</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Platforms implement the first four functions, and call <code class="language-plaintext highlighter-rouge">uconfig()</code> with
the platform’s configuration, context pointer (<code class="language-plaintext highlighter-rouge">os *</code>), command line
arguments, environment, and some memory (all in the <code class="language-plaintext highlighter-rouge">config</code> object). My
strategy is to link two platforms into the binary, and the first challenge
is they both define <code class="language-plaintext highlighter-rouge">os_write</code>, etc. I did not plan nor intend for one
binary to contain more than one platform layer. Unity builds offer a fix
without changing a single line of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include</span> <span class="cpf">"main_windows.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span>
<span class="cp">#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include</span> <span class="cpf">"main_linux_amd64.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span></code></pre></div></div>

<p>This dirty, but effective trick <a href="/blog/2025/02/05/">may look familiar</a>. It also doesn’t
interfere with the other builds. Next I define the real platform functions
as a dispatch based on our run-time situation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="n">wine_detected</span><span class="p">;</span>

<span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">linux_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">win32_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I were serious about keeping this experiment, I’d lift <code class="language-plaintext highlighter-rouge">os</code> as I did
the functions (as <code class="language-plaintext highlighter-rouge">win32_os</code>, <code class="language-plaintext highlighter-rouge">linux_os</code>) and include <code class="language-plaintext highlighter-rouge">wine_detected</code> in
the context, eliminating this global variable. That cannot be done with
simple hacks and macros.</p>

<p>The next challenge is that I wrote the Linux platform layer assuming LP64,
and so it uses <code class="language-plaintext highlighter-rouge">long</code> instead of an equivalent platform-agnostic type like
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code>. I never thought this would be an issue because this source
literally contains <code class="language-plaintext highlighter-rouge">asm</code> blocks and no conditional compilation, yet here
we are. Lesson learned. I wanted to try an extremely janky <code class="language-plaintext highlighter-rouge">#define</code> on
<code class="language-plaintext highlighter-rouge">long</code> to fix it, but this source file has a couple <code class="language-plaintext highlighter-rouge">long long</code> that won’t
play along. These multi-token type names of C are antithetical to its
preprocessor! So I adjusted the source manually instead.</p>

<p>The Windows and Linux platform entry points are completely different, both
in name and form, and so co-exist naturally. The merged platform layer is
a new entry point that will pass control to the appropriate entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">entrypoint</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>  <span class="c1">// Linux</span>
<span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">mainCRTStartup</span><span class="p">();</span>    <span class="c1">// Windows</span>
</code></pre></div></div>

<p>On Linux <code class="language-plaintext highlighter-rouge">stack</code> is <a href="/blog/2025/03/06/">the initial value of the stack pointer</a>, which
<a href="https://articles.manugarg.com/aboutelfauxiliaryvectors">points to <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, <code class="language-plaintext highlighter-rouge">envp</code>, and <code class="language-plaintext highlighter-rouge">auxv</code></a>. We’ll need construct
an artificial “stack” for the Linux platform layer to harvest. On Windows
this is <a href="/blog/2023/02/15/">the process entry point</a>, and it will find the rest on its
own as a normal Windows process. Ultimately this ended up simpler than I
expected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">merge_entrypoint</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">wine_detected</span> <span class="o">=</span> <span class="n">running_on_wine</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="o">*</span><span class="n">fakestack</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">c16</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">GetCommandLineW</span><span class="p">();</span>
        <span class="n">fakestack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)(</span><span class="n">iz</span><span class="p">)</span><span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">fakestack</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="c1">// TODO: append envp to the fake stack</span>
        <span class="n">entrypoint</span><span class="p">((</span><span class="n">iz</span> <span class="o">*</span><span class="p">)</span><span class="n">fakestack</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">mainCRTStartup</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Where <a href="/blog/2022/02/18/"><code class="language-plaintext highlighter-rouge">cmdline_to_argv8</code> is my Windows argument parser</a>, already
used by u-config, and I reserve one element at the front to store <code class="language-plaintext highlighter-rouge">argc</code>.
Since this is just a proof-of-concept I didn’t bother fabricating and
pushing <code class="language-plaintext highlighter-rouge">envp</code> onto the fake stack. The Linux entry point doesn’t need
<code class="language-plaintext highlighter-rouge">auxv</code> and can be omitted. Once in the Linux entry point it’s essentially
a Linux process from then on, except the x64 calling convention still in
use internally.</p>

<p>Finally, I configure the Linux platform layer for Debian’s cross sysroot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include</span><span class="cpf">"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "</span><span class="c1">/usr/x86_64-w64-mingw32/lib"</span><span class="cp">
</span></code></pre></div></div>

<p>And that’s it! We have our platform merge. Build (<a href="https://github.com/skeeto/w64devkit">w64devkit</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c
</code></pre></div></div>

<p>On Debian use <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-gcc</code> for <code class="language-plaintext highlighter-rouge">cc</code>. The <code class="language-plaintext highlighter-rouge">-e</code> linker option
selects the new, higher level entry point. After installing <a href="https://packages.debian.org/trixie/wine-binfmt">Wine
binfmt</a>, here’s how it looks on Debian:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs zlib
-lz
</code></pre></div></div>

<p>That’s the correct output, but is it using the cross sysroot? Ask it to
include the <code class="language-plaintext highlighter-rouge">-I</code> argument despite it being in the cross sysroot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz
</code></pre></div></div>

<p>Looking good! It passes the <code class="language-plaintext highlighter-rouge">pc_path</code> test, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig
</code></pre></div></div>

<p>Running <em>this same binary</em> on Windows after installing zlib in w64devkit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz
</code></pre></div></div>

<p>Also:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig
</code></pre></div></div>

<p>My Frankenwine is a success!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Closures as Win32 window procedures</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/12/"/>
    <id>urn:uuid:7bf46ec6-a8b2-4ffa-857a-86c040357702</id>
    <updated>2025-12-12T19:52:10Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Back in 2017 I wrote <a href="/blog/2017/01/08/">about a technique for creating closures in C</a>
using <a href="/blog/2015/03/19/">JIT-compiled</a> wrapper. It’s neat, though rarely necessary in
real programs, so I don’t think about it often. I applied it to <code class="language-plaintext highlighter-rouge">qsort</code>,
which <a href="/blog/2023/02/11/">sadly</a> accepts no context pointer. More practical would be
working around <a href="/blog/2023/12/17/">insufficient custom allocator interfaces</a>, to
create allocation functions at run-time bound to a particular allocation
region. I’ve learned a lot since I last wrote about this subject, and <a href="https://lowkpro.com/blog/creating-c-closures-from-lua-closures.html">a
recent article</a> had me thinking about it again, and how I could do
better than before. In this article I will enhance Win32 window procedure
callbacks with a fifth argument, allowing us to more directly pass extra
context. I’m using <a href="https://github.com/skeeto/w64devkit">w64devkit</a> on x64, but the everything here should
work out-of-the-box with any x64 toolchain that speaks GNU assembly.</p>

<p>A <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nc-winuser-wndproc">window procedure</a> has this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">Wndproc</span><span class="p">(</span>
  <span class="n">HWND</span> <span class="n">hWnd</span><span class="p">,</span>
  <span class="n">UINT</span> <span class="n">Msg</span><span class="p">,</span>
  <span class="n">WPARAM</span> <span class="n">wParam</span><span class="p">,</span>
  <span class="n">LPARAM</span> <span class="n">lParam</span><span class="p">,</span>
<span class="p">);</span>
</code></pre></div></div>

<p>To create a window we must first register a class with <code class="language-plaintext highlighter-rouge">RegisterClass</code>,
which accepts a set of properties describing a window class, including a
pointer to one of these functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="p">...;</span>

    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">my_wndproc</span><span class="p">,</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>

    <span class="n">HWND</span> <span class="n">hwnd</span> <span class="o">=</span> <span class="n">CreateWindowExA</span><span class="p">(</span><span class="s">"my_class"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">state</span><span class="p">);</span>
</code></pre></div></div>

<p>The thread drives a message pump with events from the operating system,
dispatching them to this procedure, which then manipulates the program
state in response:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="n">MSG</span> <span class="n">msg</span><span class="p">;</span> <span class="n">GetMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);)</span> <span class="p">{</span>
        <span class="n">TranslateMessage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>
        <span class="n">DispatchMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>  <span class="c1">// calls the window procedure</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>All four <code class="language-plaintext highlighter-rouge">WNDPROC</code> parameters are determined by Win32. There is no context
pointer argument. So how does this procedure access the program state? We
generally have two options:</p>

<ol>
  <li>Global variables. Yucky but easy. Frequently seen in tutorials.</li>
  <li>A <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code> pointer attached to the window.</li>
</ol>

<p>The second option takes some setup. Win32 passes the last <code class="language-plaintext highlighter-rouge">CreateWindowEx</code>
argument to the window procedure when the window created, via <code class="language-plaintext highlighter-rouge">WM_CREATE</code>.
The procedure attaches the pointer to its window as <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>. This
pointer is passed indirectly, through a <code class="language-plaintext highlighter-rouge">CREATESTRUCT</code>. So ultimately it
looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">case</span> <span class="n">WM_CREATE</span><span class="p">:</span>
        <span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="n">cs</span> <span class="o">=</span> <span class="p">(</span><span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="p">)</span><span class="n">lParam</span><span class="p">;</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">state</span> <span class="o">*</span><span class="p">)</span><span class="n">cs</span><span class="o">-&gt;</span><span class="n">lpCreateParams</span><span class="p">;</span>
        <span class="n">SetWindowLongPtr</span><span class="p">(</span><span class="n">hwnd</span><span class="p">,</span> <span class="n">GWLP_USERDATA</span><span class="p">,</span> <span class="p">(</span><span class="n">LONG_PTR</span><span class="p">)</span><span class="n">arg</span><span class="p">);</span>
        <span class="c1">// ...</span>
</code></pre></div></div>

<p>In future messages we can retrieve it with <code class="language-plaintext highlighter-rouge">GetWindowLongPtr</code>. Every time
I go through this I wish there was a better way. What if there was a fifth
window procedure parameter though which we could pass a context?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);
</code></pre></div></div>

<p>We’ll build just this as a trampoline. The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a>
passes the first four arguments in registers, and the rest are pushed on
the stack, including this new parameter. Our trampoline cannot just stuff
the extra parameter in the register, but will actually have to build a
stack frame. Slightly more complicated, but barely so.</p>

<h3 id="allocating-executable-memory">Allocating executable memory</h3>

<p>In previous articles, and in the programs where I’ve applied techniques
like this, I’ve allocated executable memory with <code class="language-plaintext highlighter-rouge">VirtualAlloc</code> (or <code class="language-plaintext highlighter-rouge">mmap</code>
elsewhere). This introduces a small challenge for solving the problem
generally: Allocations may be arbitrarily far from our code and data, out
of reach of relative addressing. If they’re further than 2G apart, we need
to encode absolute addresses, and in the simple case would just assume
they’re always too far apart.</p>

<p>These days I’ve more experience with executable formats, and allocation,
and I immediately see a better solution: Request a block of writable,
executable memory from the loader, then allocate our trampolines from it.
Other than being executable, this memory isn’t special, and <a href="/blog/2025/01/19/">allocation
works the usual way</a>, using functions unaware it’s executable. By
allocating through the loader, this memory will be part of our loaded
image, guaranteed to be close to our other code and data, allowing our JIT
compiler to assume <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models#small-code-model">a small code model</a>.</p>

<p>There are a number of ways to do this, and here’s one way to do it with
GNU-styled toolchains targeting COFF:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="nf">.section</span> <span class="nv">.exebuf</span><span class="p">,</span><span class="s">"bwx"</span>
        <span class="nf">.globl</span> <span class="nv">exebuf</span>
<span class="nl">exebuf:</span>	<span class="nf">.space</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span>
</code></pre></div></div>

<p>This assembly program defines a new section named <code class="language-plaintext highlighter-rouge">.exebuf</code> containing 2M
of writable (<code class="language-plaintext highlighter-rouge">"w"</code>), executable (<code class="language-plaintext highlighter-rouge">"x"</code>) memory, allocated at run time just
like <code class="language-plaintext highlighter-rouge">.bss</code> (<code class="language-plaintext highlighter-rouge">"b"</code>). We’ll treat this like an arena out of which we can
allocate all trampolines we’ll probably ever need. With careful use of
<code class="language-plaintext highlighter-rouge">.pushsection</code> this could be basic inline assembly, but I’ve left it as a
separate source. On the C side I retrieve this like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="n">Arena</span> <span class="nf">get_exebuf</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">extern</span> <span class="kt">char</span> <span class="n">exebuf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span><span class="p">];</span>
    <span class="n">Arena</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="n">exebuf</span><span class="p">,</span> <span class="n">exebuf</span><span class="o">+</span><span class="k">sizeof</span><span class="p">(</span><span class="n">exebuf</span><span class="p">)};</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately I have to repeat myself on the size. There are different
ways to deal with this, but this is simple enough for now. I would have
loved to define the array in C with the GCC <a href="https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html"><code class="language-plaintext highlighter-rouge">section</code> attribute</a>,
but as is usually the case with this attribute, it’s not up to the task,
lacking the ability to set section flags. Besides, by not relying on the
attribute, any C compiler could compile this source, and we only need a
GNU-style toolchain to create the tiny COFF object containing <code class="language-plaintext highlighter-rouge">exebuf</code>.</p>

<p>While we’re at it, a reminder of some other basic definitions we’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="n">Str</span> <span class="nf">clone</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which have been discussed at length in previous articles.</p>

<h3 id="trampoline-compiler">Trampoline compiler</h3>

<p>From here the plan is to create a function that accepts a <code class="language-plaintext highlighter-rouge">Wndproc5</code> and a
context pointer to bind, and returns a classic <code class="language-plaintext highlighter-rouge">WNDPROC</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Wndproc5</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>Our window procedure now gets a fifth argument with the program state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">my_wndproc</span><span class="p">(</span><span class="n">HWND</span><span class="p">,</span> <span class="n">UINT</span><span class="p">,</span> <span class="n">WPARAM</span><span class="p">,</span> <span class="n">LPARAM</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When registering the class we wrap it in a trampoline compatible with
<code class="language-plaintext highlighter-rouge">RegisterClass</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">),</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>
</code></pre></div></div>

<p>All windows using this class will readily have access to this state object
through their fifth parameter. It turns out setting up <code class="language-plaintext highlighter-rouge">exebuf</code> was the
more complicated part, and <code class="language-plaintext highlighter-rouge">make_wndproc</code> is quite simple!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Wndproc5</span> <span class="n">proc</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">thunk</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"</span><span class="se">\x48\x83\xec\x28</span><span class="s">"</span>      <span class="c1">// sub   $40, %rsp</span>
        <span class="s">"</span><span class="se">\x48\xb8</span><span class="s">........"</span>      <span class="c1">// movq  $arg, %rax</span>
        <span class="s">"</span><span class="se">\x48\x89\x44\x24\x20</span><span class="s">"</span>  <span class="c1">// mov   %rax, 32(%rsp)</span>
        <span class="s">"</span><span class="se">\xe8</span><span class="s">...."</span>              <span class="c1">// call  proc</span>
        <span class="s">"</span><span class="se">\x48\x83\xc4\x28</span><span class="s">"</span>      <span class="c1">// add   $40, %rsp</span>
        <span class="s">"</span><span class="se">\xc3</span><span class="s">"</span>                  <span class="c1">// ret</span>
    <span class="p">);</span>
    <span class="n">Str</span> <span class="n">r</span>   <span class="o">=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">thunk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">proc</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="mi">24</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span> <span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="mi">20</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rel</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rel</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">WNDPROC</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The assembly allocates a new stack frame, with callee shadow space, and
with room for the new argument, which also happens to re-align the stack.
It stores the new argument for the <code class="language-plaintext highlighter-rouge">Wndproc5</code> just above the shadow space.
Then calls into the <code class="language-plaintext highlighter-rouge">Wndproc5</code> without touching other parameters. There
are two “patches” to fill out, which I’ve initially filled with dots: the
context pointer itself, and a 32-bit signed relative address for the call.
It’s going to be very near the callee. The only thing I don’t like about
this function is that I’ve manually worked out the patch offsets.</p>

<p>It’s probably not useful, but it’s easy to update the context pointer at
any time if hold onto the trampoline pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_wndproc_arg</span><span class="p">(</span><span class="n">WNDPROC</span> <span class="n">p</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="o">+</span><span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, for instance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// multiple states</span>
    <span class="n">WNDPROC</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="c1">// ...</span>
    <span class="n">set_wndproc_arg</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>  <span class="c1">// switch states</span>
</code></pre></div></div>

<p>Though I expect the most common case is just creating multiple procedures:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WNDPROC</span> <span class="n">procs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>To my slight surprise these trampolines still work with an active <a href="https://learn.microsoft.com/en-us/windows/win32/secbp/control-flow-guard">Control
Flow Guard</a> system policy. Trampolines do not have stack unwind
entries, and I thought Windows might refuse to pass control to them.</p>

<p>Here’s a complete, runnable example if you’d like to try it yourself:
<a href="https://gist.github.com/skeeto/13363b78489b26bed7485ec0d6b2c7f8"><code class="language-plaintext highlighter-rouge">main.c</code> and <code class="language-plaintext highlighter-rouge">exebuf.s</code></a></p>

<h3 id="better-cases">Better cases</h3>

<p>This is more work than going through <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>, and real programs
have a small, fixed number of window procedures — typically one — so this
isn’t the best example, but I wanted to illustrate with a real interface.
Again, perhaps the best real use is a library with a weak custom allocator
interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">);</span>   <span class="c1">// no context pointer!</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>     <span class="c1">// "</span>
<span class="p">}</span> <span class="n">Allocator</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">arena_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>

<span class="c1">// ...</span>

    <span class="n">Allocator</span> <span class="n">perm_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">Allocator</span> <span class="n">scratch_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">scratch</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>Something to keep in my back pocket for the future.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Meet the new xxd for w64devkit: rexxd</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/02/17/"/>
    <id>urn:uuid:a3ad2465-f53c-43d3-acc7-988d9d4d3989</id>
    <updated>2025-02-17T00:49:49Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>xxd is a versatile hexdump utility with a “reverse” feature, originally
written between 1990–1996. The Vim project soon adopted it, and it’s lived
there ever since. If you have Vim, you also have xxd. Its primary use
cases are (1) the basis for a hex editor due to its <code class="language-plaintext highlighter-rouge">-r</code> reverse option
that can <em>unhexdump</em> its previous output, and (2) a data embedding tool
for C and C++ (<code class="language-plaintext highlighter-rouge">-i</code>). The former provides Vim’s rudimentary hex editor
functionality. The second case is of special interest to <a href="https://github.com/skeeto/w64devkit">w64devkit</a>:
<code class="language-plaintext highlighter-rouge">xxd -i</code> appears in many builds that <a href="/blog/2016/11/15/">embed arbitrary data</a>. It’s
important that w64devkit has a compatible implementation, and a freshly
rewritten, improved xxd, <strong><a href="https://github.com/skeeto/w64devkit/blob/master/src/rexxd.c">rexxd</a></strong>, now replaces the original xxd (as
<code class="language-plaintext highlighter-rouge">xxd</code>).</p>

<p>For those unfamiliar with xxd, examples are in order. Its default hexdump
output looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world | xxd | tee dump
00000000: 6865 6c6c 6f20 776f 726c 640a            hello world.
</code></pre></div></div>

<p>Octets display in pairs with an ASCII text listing on the right. All
configurable. I can run this in reverse (<code class="language-plaintext highlighter-rouge">-r</code>), recovering the original
input:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ xxd -r dump
hello world
</code></pre></div></div>

<p>The tool reads the offset before the colon, the hexadecimal octets, and
ignores the text column. By editing <code class="language-plaintext highlighter-rouge">dump</code> with a text editor, I can
change the raw octets of the original input. From this point of view, the
hexdump is actually a program of two alternating instructions: seek and
write. xxd <em>seeks</em> to the offset, <em>writes</em> the octets, then repeats. It
also doesn’t truncate the output file, so a hexdump can express binary
patches as a seek/write program.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello
$ echo 6: 65766572796f6e650a | xxd -r - hello
$ cat hello
hello everyone
</code></pre></div></div>

<p>That seeks to offset <code class="language-plaintext highlighter-rouge">0x6</code>, then writes the 9 octets. The xxd parser is
flexible, and I did not need to follow the default format. It figured out
the format on its own, and rexxd further improves on this. We can use it
to create large files out of thin air, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo 3fffffff: 00 | xxd -r - &gt;1G
</code></pre></div></div>

<p>This command creates an all-zero, 1GiB file, <code class="language-plaintext highlighter-rouge">1G</code>, by seeking to just
before 1GiB then writing a zero. I used <code class="language-plaintext highlighter-rouge">&gt;1G</code> so that the shell would
truncate the file before starting <code class="language-plaintext highlighter-rouge">xxd</code> — in case it was larger or
contained non-zeros.</p>

<p>This is a “smart seek” of course, and its not literally seeking on every
line. The tool tracks its file position and only seeks when necessary. If
seeking fails, it simulates the seek using a write if possible. When would
it not be possible? Lines need not be in order, of course, and so it may
need to seek backwards. Lines can also overlap in contents. If it weren’t
for buffering — or if rexxd had a <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/UnifiedBufferCache">unified buffer cache</a> — then by
using the same file for input and output an “xxd program” could write new
instructions for itself and <a href="/blog/2016/04/30/">accidentally become Turing-complete</a>.</p>

<p>The other common mode, <code class="language-plaintext highlighter-rouge">-i</code>, looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello
$ xxd -i hello hello.c
</code></pre></div></div>

<p>Which produces this <code class="language-plaintext highlighter-rouge">hello.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">hello</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="mh">0x68</span><span class="p">,</span> <span class="mh">0x65</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x6f</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="mh">0x77</span><span class="p">,</span> <span class="mh">0x6f</span><span class="p">,</span> <span class="mh">0x72</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x64</span><span class="p">,</span> <span class="mh">0x0a</span>
<span class="p">};</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">hello_len</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span>
</code></pre></div></div>

<p>Note how it converted the file name into variable names. Characters
disallowed in variable names become underscores <code class="language-plaintext highlighter-rouge">_</code>. When reading from
standard input, xxd only emits the octets. Unless the new-ish <code class="language-plaintext highlighter-rouge">-n</code> name
option is given, in which case that becomes the variable name. This
remains popular because, <a href="https://en.cppreference.com/w/c/preprocessor/embed"><code class="language-plaintext highlighter-rouge">#embed</code></a> notwithstanding, as of this
writing all major toolchains remain stubborn about embedding data on their
own.</p>

<h3 id="the-case-for-replacement">The case for replacement</h3>

<p>The idea of replacing it began with backporting the <code class="language-plaintext highlighter-rouge">-n</code> name option to
Vim 9.0 xxd. The feature did not appear in a release until a year ago, 28
years after <code class="language-plaintext highlighter-rouge">-i</code>, despite its obviousness. I’ve also felt that <code class="language-plaintext highlighter-rouge">xxd</code> is
slower than it could be, and a momentary examination reveals it’s buggier
than it ought to be. <a href="/blog/2025/02/05/">As expected</a>, a few seconds of fuzz testing
<code class="language-plaintext highlighter-rouge">xxd -r</code> reveals bugs, and it doesn’t even require writing a single line
of code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ afl-gcc -fsanitize=address,undefined xxd.c
$ mkdir inputs
$ echo &gt;inputs/sample
$ afl-fuzz -i inputs/ -o fuzzout/ ./a.out -r
</code></pre></div></div>

<p>The Windows port is lacking in the usual ways, unable to handle Unicode
paths. The new Vim 9.1 xxd <code class="language-plaintext highlighter-rouge">-R</code> color feature broke the Windows port, and
if w64devkit included Vim 9.1 then I’d need to patch out the new bugs. As
demonstrated above, at least it’s trivial to compile! It’s a single source
file, <code class="language-plaintext highlighter-rouge">xxd.c</code>, and requires no configuration. I love that.</p>

<p>The more I looked, the more problems I found. It’s not doing anything
terribly complex, so I expected it wouldn’t be difficult to rewrite it
with a better foundation. So I did. Ignoring tests and documentation, my
rewrite is about twice as long. In exchange, it’s <em>substantially faster</em>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dd if=/dev/urandom of=bigfile bs=1M count=64

$ time orig-xxd bigfile dump
real    0m 4.40s
user    0m 2.89s
sys     0m 1.46s

$ time rexxd bigfile dump
real    0m 0.31s
user    0m 0.07s
sys     0m 0.21s
</code></pre></div></div>

<p>Same in reverse:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time orig-xxd -r dump nul
real    0m 5.81s
user    0m 5.67s
sys     0m 0.07s

$ time rexxd -r dump nul
real    0m 0.33s
user    0m 0.23s
sys     0m 0.09s
</code></pre></div></div>

<p>Or embedding data with rexxd:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time orig-xxd -i bigfile bigfile.c
real    0m 10.32s
user    0m 9.85s
sys     0m 0.37s

$ time rexxd -i bigfile bigfile.c
real    0m 0.40s
user    0m 0.07s
sys     0m 0.34s
</code></pre></div></div>

<p>I wanted to keep it portable and simple, so that’s without <a href="/blog/2021/12/04/">fancy SIMD
processing</a>. Just <a href="/blog/2022/04/30/">SWAR parsing</a>, <a href="/blog/2017/10/06/">branch avoidance</a>,
no division on hot paths, and sound architecture. I also optimized for the
typical case at the cost of the atypical case. It’s a little unfair to
compare it to a program probably first written on a 16-bit machine, but
there was time for it to pick up these techniques over the decades, too.</p>

<p>Unicode support works well:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat π
3.14159265358979323846264338327950288419716939937510582097494
$ rexxd -i π π.c
</code></pre></div></div>

<p>Producing this source with Unicode variables:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="err">π</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x2e</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0x39</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x36</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span>
  <span class="c1">// ...</span>
  <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x0a</span>
<span class="p">};</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="err">π</span><span class="n">_len</span> <span class="o">=</span> <span class="mi">62</span><span class="p">;</span>
</code></pre></div></div>

<p>Whereas the original xxd on Windows has the <a href="/blog/2021/12/30/">usual CRT problems</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ orig-xxd -i π
orig-xxd: p: No such file or directory
</code></pre></div></div>

<p>It also struggles with 64-bit offsets, particularly on 32-bit hosts and
LLP64 hosts like Windows. In contrast, I designed rexxd to robustly
process file offsets as 64-bit on all hosts. Its tests operate on a
virtual file system with virtual files at those sizes, so those paths
really have been tested, too.</p>

<p>The original xxd only uses static allocation, which places small range
limits on the configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ orig-xxd -c 1000
orig-xxd: invalid number of columns (max. 256).
</code></pre></div></div>

<p>In rexxd everything is <a href="/blog/2023/09/27/">arena allocated</a> of course, and options are
limited only by the available memory, so the above, and more, would work.
The arena helps make the SWAR tricks possible, too, providing a fast
runway to load more data at a time.</p>

<p>While reverse engineering the original, I documented bugs I discovered and
noted them with a <code class="language-plaintext highlighter-rouge">BUG:</code> comment if you wanted to see more. I’m not aiming
for bug compatibility, so these are not present in rexxd.</p>

<h3 id="platform-layer">Platform layer</h3>

<p>The <a href="https://manpages.debian.org/bookworm/xxd/xxd.1.en.html">xxd man page</a> suggests using strace to examine the execution of
<code class="language-plaintext highlighter-rouge">-r</code> reverse. That is, to monitor the seeks and writes of a binary patch
in order to debug it. That’s so insightful that I decided to build that as
a new <code class="language-plaintext highlighter-rouge">-x</code> option (think <code class="language-plaintext highlighter-rouge">sh -x</code>). That is, <em>rexxd has a built-in strace
on all platforms!</em> The trace is expressed in terms of unix system calls,
even on Windows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ printf '00:41 \n02:42 \n04:43' | rexxd -x -r - data.bin
open("data.bin", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 19
write(1, "A", 1) = 1
lseek(1, 2, SEEK_SET) = 2
read(0, ..., 4096) = 0
write(1, "B", 1) = 1
lseek(1, 4, SEEK_SET) = 4
write(1, "C", 1) = 1
exit(0) = ?
</code></pre></div></div>

<p>Is this doing some kind of self-<a href="/blog/2018/06/23/">ptrace</a> debugger voodoo? Nope. Like
<a href="/blog/2023/01/18/">u-config</a>, it has a <em>platform layer</em>, and it simply logs the platform
layer calls — except for the trace printout itself of course. While the
intention is to debug binary patches, it was also quite insightful in
examining rexxd itself. It helped me spot that rexxd flushed more often
than strictly necessary.</p>

<p>To port rexxd to any system, define <code class="language-plaintext highlighter-rouge">Plt</code> as needed, implement these five
<code class="language-plaintext highlighter-rouge">plt_</code> functions, then call <code class="language-plaintext highlighter-rouge">xxd</code>. The five functions mostly have the
expected unix-like semantics:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Plt</span> <span class="n">Plt</span><span class="p">;</span>
<span class="n">b32</span>  <span class="nf">plt_open</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="n">b32</span> <span class="n">trunc</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">i64</span>  <span class="nf">plt_seek</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">i64</span> <span class="n">off</span><span class="p">,</span> <span class="n">i32</span> <span class="n">whence</span><span class="p">);</span>
<span class="n">i32</span>  <span class="nf">plt_read</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">);</span>
<span class="n">b32</span>  <span class="nf">plt_write</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">plt_exit</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span><span class="p">);</span>
<span class="n">i32</span>  <span class="nf">xxd</span><span class="p">(</span><span class="n">i32</span> <span class="n">argc</span><span class="p">,</span> <span class="n">u8</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">byte</span> <span class="o">*</span><span class="n">heap</span><span class="p">,</span> <span class="n">iz</span> <span class="n">heapsize</span><span class="p">);</span>
</code></pre></div></div>

<p>If the platform wants these functions to be “virtual” then it can put
function pointers in the <code class="language-plaintext highlighter-rouge">Plt</code> struct. Otherwise it stores anything it
might need in <code class="language-plaintext highlighter-rouge">Plt</code>. Global variables are never necessary. The application
layer doesn’t use the standard library except (indirectly) <code class="language-plaintext highlighter-rouge">memset</code> and
<code class="language-plaintext highlighter-rouge">memcpy</code>, and it allocates everything it uses from the provided <code class="language-plaintext highlighter-rouge">heap</code>
parameter.</p>

<p><code class="language-plaintext highlighter-rouge">plt_open</code> is a little unusual in that it picks the file descriptor: 0 to
replace standard input, or 1 to replace standard output. All platforms
currently use a virtual file descriptor table, and these do not map onto
the real process file descriptors. But they could! Calls are straced in
the application layer, so they log virtual file descriptors as seen by
rexxd. The arena parameter offers scratch space for the Windows platform
layer to convert paths from narrow to wide for <code class="language-plaintext highlighter-rouge">CreateFileW</code>, so it can
handle <a href="https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">long path names</a> with ease.</p>

<p><code class="language-plaintext highlighter-rouge">plt_read</code> doesn’t accept a file descriptor because there’s only one from
which to read, 0. <code class="language-plaintext highlighter-rouge">plt_write</code> on the other hand allows writing to standard
error, 2.</p>

<p><code class="language-plaintext highlighter-rouge">plt_exit</code> doesn’t return, of course. In tests it <a href="/blog/2023/02/12/">longjmps</a> back
to the top level, as though returning from <code class="language-plaintext highlighter-rouge">xxd</code> with a status. This lets
me skip allocation null pointer checks, with OOM unwinding safely back to
the top level. Since rexxd allocates everything from the arena, it’s all
automatically deallocated, so it’s a clean exit.</p>

<p>On Windows, <code class="language-plaintext highlighter-rouge">plt_seek</code> calls <a href="https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setfilepointerex"><code class="language-plaintext highlighter-rouge">SetFilePointerEx</code></a>. I learned the
hard way that the behavior of calling it on a non-file is undefined, not
an error, so at least one <code class="language-plaintext highlighter-rouge">GetFileType</code> call is mandatory. I also learned
that Windows will successfully seek all the way to <code class="language-plaintext highlighter-rouge">INT64_MAX</code>. If the
file system doesn’t support that offset, it’s a write failure <em>later</em>. For
correct operation, rexxd must take care not to overflow its own internal
file position tracking near these offsets with Windows allowing seeks to
operate at the edge until the first flush. Tests run on a virtual file
system thanks to the platform layer, and some tests permit huge seeks and
simulate impossibly enormous files in order to probe behavior at the
extremes.</p>

<p>This is in contrast to Linux, where seeks beyond the underlying file
system’s supported file size is a seek error. For example, on ext4 with
the default configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo ffffffff000: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040320, SEEK_SET) = 17592186040320
read(0, ..., 4096) = 0
write(1, "\0", 1) = -1
exit(3) = ?
</code></pre></div></div>

<p>We can see the seek succeeded then the write failed because it went one
byte beyond the file system limit. While seeking one byte further will
cause the seek to fail (22 <code class="language-plaintext highlighter-rouge">EINVAL</code>), and rexxd falls back on write until
it fills the storage and runs out of space:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo ffffffff001: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040321, SEEK_SET) = -1
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
...
</code></pre></div></div>

<p>Mostly for fun, I wrote a libc-free platform layer using <a href="/blog/2023/03/23/">raw Linux system
calls</a>, and it maps <em>almost</em> perfectly onto the kernel interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Plt</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">fds</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span> <span class="p">};</span>

<span class="n">b32</span> <span class="nf">plt_open</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="n">b32</span> <span class="n">trunc</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">i32</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">fd</span> <span class="o">?</span> <span class="n">O_CREAT</span><span class="o">|</span><span class="n">O_WRONLY</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">mode</span> <span class="o">|=</span> <span class="n">trunc</span> <span class="o">?</span> <span class="n">O_TRUNC</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_open</span><span class="p">,</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">path</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="mo">0666</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">i64</span> <span class="nf">plt_seek</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">i64</span> <span class="n">off</span><span class="p">,</span> <span class="n">i32</span> <span class="n">whence</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_lseek</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">],</span> <span class="n">off</span><span class="p">,</span> <span class="n">whence</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">i32</span> <span class="nf">plt_read</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_read</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">b32</span> <span class="nf">plt_write</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">len</span> <span class="o">==</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">],</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">plt_exit</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On Windows I use the <a href="/blog/2023/05/31/">artisanal function prototypes</a> of which I’ve
grown so fond. It’s also my first time using w64devkit’s <code class="language-plaintext highlighter-rouge">-lmemory</code> in a
serious application. I’m using <a href="/blog/2024/02/05/"><code class="language-plaintext highlighter-rouge">-lchkstk</code></a> in the “xxd as a DLL”
platform layer, too, but that one’s just a toy. In that one I use <code class="language-plaintext highlighter-rouge">alloca</code>
to allocate an arena, which is a rather novel combination, and the large
stack frame requires a stack probe. Otherwise none of rexxd requires stack
probes.</p>

<p>w64devkit’s new <code class="language-plaintext highlighter-rouge">xxd.exe</code> is delightfully tidy as viewed by <a href="/blog/2024/06/30/">peports</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ du -h xxd.exe
28.0K   xxd.exe
$ peports xxd.exe
KERNEL32.dll
        0       CreateFileW
        0       ExitProcess
        0       GetCommandLineW
        0       GetFileType
        0       GetStdHandle
        0       MultiByteToWideChar
        0       ReadFile
        0       SetFilePointerEx
        0       VirtualAlloc
        0       WideCharToMultiByte
        0       WriteFile
SHELL32.dll
        0       CommandLineToArgvW
</code></pre></div></div>

<h3 id="other-notes">Other notes</h3>

<p><a href="/blog/2023/02/13/">Buffered output</a> and buffered input is custom tailored for rexxd.
When parsing line-oriented input, like <code class="language-plaintext highlighter-rouge">-r</code>, it attempts to parse from of
a <em>view</em> of the input buffer, no copying. The view is the <a href="/blog/2025/01/19/">usual string
representation</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct {
    u8 *data;
    iz  len;
} Str;
</code></pre></div></div>

<p>Does it fail if the line is longer than the buffer? If it straddles reads,
does that hurt efficiency? The answer to both is “no” due to the spillover
arena. <code class="language-plaintext highlighter-rouge">Input</code> is the buffered input struct, and here’s the interface to
get the next line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Str nextline(Input *, Arena *);
</code></pre></div></div>

<p>If the line isn’t entirely contained in the input buffer, the complete
line is <a href="/blog/2024/05/25/">concatenated</a> in the arena. So it comfortably handles
huge lines while no-copy optimizing for typical short, non-straddling
lines. With a per-iteration arena, any arena-backed line is automatically
freed at the end of the iteration, so it’s all transparent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">Arena</span> <span class="n">scratch</span> <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
        <span class="n">Str</span> <span class="n">line</span> <span class="o">=</span> <span class="n">nextline</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
        <span class="c1">// ... line may point into an Input or scratch ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If the line doesn’t fit in the arena, it triggers OOM handling. That is,
it calls <code class="language-plaintext highlighter-rouge">plt_exit</code> and something platform-appropriate happens without
returning. Beats the pants off <a href="https://man7.org/linux/man-pages/man3/getline.3.html">old <code class="language-plaintext highlighter-rouge">getline</code></a>!</p>

<p>I came up with a <code class="language-plaintext highlighter-rouge">maxof</code> macro that evaluates the maximum of any integral
type, signed or unsigned. It appears in <a href="/blog/2024/05/24/">overflow checks</a> and more,
I really like how it turned out. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">pos</span> <span class="o">&gt;</span> <span class="n">maxof</span><span class="p">(</span><span class="n">i64</span><span class="p">)</span> <span class="o">-</span> <span class="n">off</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// overflow</span>
    <span class="p">}</span>
    <span class="n">pos</span> <span class="o">+=</span> <span class="n">off</span><span class="p">;</span>
</code></pre></div></div>

<p>Or:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i32</span> <span class="nf">trunc32</span><span class="p">(</span><span class="n">iz</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">n</span><span class="o">&gt;</span><span class="n">maxof</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span> <span class="o">?</span> <span class="n">maxof</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span> <span class="o">:</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now that I have <code class="language-plaintext highlighter-rouge">-lmemory</code> and generally solved string function issues for
myself, I leaned into <code class="language-plaintext highlighter-rouge">__builtin_memset</code> and <code class="language-plaintext highlighter-rouge">__builtin_memcpy</code> for this
project. Despite <code class="language-plaintext highlighter-rouge">restrict</code>, it’s surprisingly difficult to get compilers
to optimize loops into semantically equivalent string function calls. An
explicit built-in solves that. It also produces faster debug builds, which
is what I run while I work. At <code class="language-plaintext highlighter-rouge">-O0</code>, rexxd is about half the speed of a
release build.</p>

<p>Other than <code class="language-plaintext highlighter-rouge">-x</code>, I don’t plan on inventing new features. I’d like to
maintain compatibility with the <code class="language-plaintext highlighter-rouge">xxd</code> found everywhere else, and I don’t
expect adoption beyond w64devkit. Overall the project took about twice as
long as I anticipated — two weekends instead of one — but it turned out
better than I expected and I’m very pleased with the results.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Windows dynamic linking depends on the active code page</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/10/07/"/>
    <id>urn:uuid:cc7861a5-aaa0-4a27-8867-9f48cf72e444</id>
    <updated>2024-10-07T19:50:17Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Windows paths have been <a href="https://simonsapin.github.io/wtf-8/#ill-formed-utf-16">WTF-16</a>-encoded for decades, but module names
in the <a href="/blog/2024/06/30/">import tables</a> of <a href="https://learn.microsoft.com/en-us/windows/win32/debug/pe-format">Portable Executable</a> are octets.
If a name contains values beyond ASCII — technically out of spec — then
the dynamic linker must somehow decode those octets into Unicode in order
to construct a lookup path. There are multiple ways this could be done,
and the most obvious is the process’s active code page (ACP), which is
exactly what happens. As a consequence, the specific DLL loaded by the
linker may depend on the system code page. In this article I’ll contrive
such a situation.</p>

<p><a href="https://learn.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibrarya">LoadLibraryA</a> is a similar situation, and potentially applies the code
page to a longer portion of the module path. <a href="https://learn.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibraryw">LoadLibraryW</a> is
unaffected, at least for the directly-named module, because it’s Unicode
all the way through.</p>

<p>For my contrived demonstration I came up with two names that to
English-reading eyes appears as two words with extraneous markings:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">Ãµral.dll</code>: CP-1252=<code class="language-plaintext highlighter-rouge">"C3 B5 …"</code></li>
  <li><code class="language-plaintext highlighter-rouge">õral.dll</code>: CP-1252=<code class="language-plaintext highlighter-rouge">"F5 …"</code>; UTF-8=<code class="language-plaintext highlighter-rouge">"C3 B5 …"</code></li>
</ul>

<p>Both end with <code class="language-plaintext highlighter-rouge">ral.dll</code>. I’ve included the <a href="https://en.wikipedia.org/wiki/Windows-1252">CP-1252</a> encoding for the
differing prefixes, and the UTF-8 encoding for the second. I’m using
CP-1252 because it’s the most common system code page in the world,
especially the Western hemisphere. Due to case insensitivity, the actual
DLL may be named <code class="language-plaintext highlighter-rouge">ãµral.dll</code> — i.e. to match the second library case — but
the module name <em>must</em> be encoded as uppercase when <a href="/blog/2021/05/31/">building the import
library</a>. Alternatively the second could be <code class="language-plaintext highlighter-rouge">Õral.dll</code>, particularly
because I won’t use it when constructing an import library.</p>

<p>The plan is to store the octets <code class="language-plaintext highlighter-rouge">C3 B5 …</code> in the import table. A process
using CP-1252 decodes it to <code class="language-plaintext highlighter-rouge">Ãµral.dll</code>. In the UTF-8 code page it decodes
to <code class="language-plaintext highlighter-rouge">õral.dll</code>. For testing we can use an <a href="/blog/2021/12/30/">application manifest</a> to
control the code page for a particular PE image — a lot easier than
changing the system code page. Otherwise, this trick could dynamically
change the behavior of a program in response to the system code page
without actually inspecting the active code page.</p>

<p>The libraries will have a single function <code class="language-plaintext highlighter-rouge">get</code>, which returns a string
indicating which library was loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define X(s) #s
#define S(s) X(s)
</span><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">get</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">S</span><span class="p">(</span><span class="n">V</span><span class="p">);</span> <span class="p">}</span>
</code></pre></div></div>

<p>Constructing the import library can be tricky because you must consider
how the toolchain, editors, and shells decode and encode text, which may
involve the build system’s code page. It’s shockingly difficult to script!
Binutils <code class="language-plaintext highlighter-rouge">dlltool</code> cannot process these names and cannot be used at all.
With bleeding edge <a href="https://github.com/skeeto/w64devkit">w64devkit</a> I could reliably construct the DLLs and
import library like so, even in a script (Windows 10 and later only):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -shared -DV=UTF-8 -o Õral.dll  detect.c
$ gcc -shared -DV=ANSI  -o Ãµral.dll detect.c -Wl,--out-implib=detect.lib
</code></pre></div></div>

<p>That produces two DLLs and one import library, <code class="language-plaintext highlighter-rouge">detect.lib</code>, with the
desired module name octets. A straightforward MSVC <code class="language-plaintext highlighter-rouge">cl</code> invocation also
works so long as it’s not from a batch file. It will quite correctly warn
about the strange name situation, which I like. My test program, <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">char</span> <span class="o">*</span><span class="nf">get</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="n">puts</span><span class="p">(</span><span class="n">get</span><span class="p">());</span> <span class="p">}</span>
</code></pre></div></div>

<p>I link <code class="language-plaintext highlighter-rouge">detect.lib</code> when I build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o main.exe main.c detect.lib
</code></pre></div></div>

<p>I designed <a href="/blog/2024/06/30/"><code class="language-plaintext highlighter-rouge">peports</code></a> to print non-ASCII octets unambiguously
(<code class="language-plaintext highlighter-rouge">\xXX</code>), and it’s the only tool I know that does so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ peports main.exe | tail -n 2
\xc3\xb5ral.dll
        1       get
</code></pre></div></div>

<p>The module name has the <code class="language-plaintext highlighter-rouge">C3 B5 …</code> prefix octets. When I run it under my
system code page, CP-1252:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./main
ANSI
</code></pre></div></div>

<p>If I <a href="/blog/2021/12/30/">add a UTF-8 manifest</a>, even just a “side-by-side” manifest, it
loads the other library despite an identical import table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o main.exe main.c detect.lib libwinsane.o
$ ./main
UTF-8
</code></pre></div></div>

<p>Again, without the manifest, if I switched my system code page to UTF-8
then <code class="language-plaintext highlighter-rouge">UTF-8</code> would still be the result.</p>

<p>I can’t think of much practical use for this trick outside of malware. In
a real program it would be simpler to inspect code page, and there’s no
benefit to avoiding such a check if it’s needed. Malware could use it to
trick inspection tools and scanners that decode module names differently
than the dynamic linker. Such tools often incorrectly assume UTF-8, which
is what motivated this article.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Slim Reader/Writer Locks are neato</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/10/03/"/>
    <id>urn:uuid:0bbd925e-c012-4711-b513-b34cd0357bfa</id>
    <updated>2024-10-03T22:40:13Z</updated>
    <category term="win32"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I’m 18 years late, but <a href="https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks">Slim Reader/Writer Locks</a> have a fantastic
interface: pointer-sized (“slim”), zero-initialized, and non-allocating.
Lacking cleanup, they compose naturally with <a href="/blog/2023/09/27/">arena allocation</a>.
Sounds like a futex? That’s because they’re built on futexes introduced at
the same time. They’re also complemented by <a href="https://learn.microsoft.com/en-us/windows/win32/sync/condition-variables">condition variables</a>
with the same desirable properties. My only quibble is that slim locks
<a href="/blog/2022/10/05/">could easily have been 32-bit objects</a>, but it hardly matters. This
article, while treating <a href="/blog/2023/05/31/">Win32 as a foreign interface</a>, discusses a
paper-thin C++ wrapper interface around lock and condition variables, in
<a href="/blog/2024/04/14/">my own style</a>.</p>

<p>If you’d like to see/try a complete, working demonstration before diving
into the details: <a href="https://gist.github.com/skeeto/42adc0c90a156d4457422e034be697e8"><code class="language-plaintext highlighter-rouge">demo.cpp</code></a>. We’re going to build this from the
ground up, so let’s establish a few primitive integer definitions:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">b32</span> <span class="o">=</span> <span class="kt">signed</span><span class="p">;</span>
<span class="k">using</span> <span class="n">i32</span> <span class="o">=</span> <span class="kt">signed</span><span class="p">;</span>
<span class="k">using</span> <span class="n">uz</span>  <span class="o">=</span> <span class="k">decltype</span><span class="p">(</span><span class="mi">0u</span><span class="n">z</span><span class="p">);</span>
</code></pre></div></div>

<p>Think of <code class="language-plaintext highlighter-rouge">uz</code> as like <code class="language-plaintext highlighter-rouge">uintptr_t</code>. This implementation will support both
32-bit and 64-bit targets, and we’ll need it as the basis for locks and
condition variables:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">Lock</span> <span class="o">:</span> <span class="n">uz</span><span class="p">;</span>
<span class="k">enum</span> <span class="n">Cond</span> <span class="o">:</span> <span class="n">uz</span><span class="p">;</span>
</code></pre></div></div>

<p>Opaque enums provide additional type safety: They have the properties of
an integer, including trivial destruction, but are distinct types which
compilers forbid mixing with other integers. We can’t, say, accidentally
cross condition variable and lock parameters — my main concern. Aside from
zero-initialization, we do not actually care about the values of these
variables, so enumerators are unnecessary. (Caveat: GDB cannot display
opaque enums, which is slightly irritating.)</p>

<p>The documentation doesn’t explicitly mention zero initialization, but the
official <code class="language-plaintext highlighter-rouge">*_INIT</code> constants are defined as zero. That locks in zero at the
ABI level, so we can count on it.</p>

<p>All the functions we’ll need are exported by <code class="language-plaintext highlighter-rouge">kernel32.dll</code>. Locks have
two variations on lock/unlock: “exclusive” (write) and “shared” (read).
There are also “try” versions, but I won’t be using them.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r, p) extern "C" __declspec(dllimport) r __stdcall p noexcept
</span><span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">AcquireSRWLockExclusive</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="p">));</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">AcquireSRWLockShared</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="p">));</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">ReleaseSRWLockExclusive</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="p">));</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">ReleaseSRWLockShared</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>Declaring Win32 functions in C++ is a mouthful, and everything must be
written in just the right order, but it’s mostly tucked away in a macro.
Usually there’s a stack discipline to these locks, so an RAII scoped guard
is in order:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Guard</span> <span class="p">{</span>
    <span class="n">Lock</span> <span class="o">*</span><span class="n">l</span><span class="p">;</span>
    <span class="n">Guard</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="n">l</span><span class="p">)</span> <span class="o">:</span> <span class="n">l</span><span class="p">{</span><span class="n">l</span><span class="p">}</span> <span class="p">{</span> <span class="n">AcquireSRWLockExclusive</span><span class="p">(</span><span class="n">l</span><span class="p">);</span> <span class="p">}</span>
    <span class="o">~</span><span class="n">Guard</span><span class="p">()</span>              <span class="p">{</span> <span class="n">ReleaseSRWLockExclusive</span><span class="p">(</span><span class="n">l</span><span class="p">);</span> <span class="p">}</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="nc">RGuard</span> <span class="p">{</span>
    <span class="n">Lock</span> <span class="o">*</span><span class="n">l</span><span class="p">;</span>
    <span class="n">RGuard</span><span class="p">(</span><span class="n">Lock</span> <span class="o">*</span><span class="n">l</span><span class="p">)</span> <span class="o">:</span> <span class="n">l</span><span class="p">{</span><span class="n">l</span><span class="p">}</span> <span class="p">{</span> <span class="n">AcquireSRWLockShared</span><span class="p">(</span><span class="n">l</span><span class="p">);</span> <span class="p">}</span>
    <span class="o">~</span><span class="n">RGuard</span><span class="p">()</span>              <span class="p">{</span> <span class="n">ReleaseSRWLockShared</span><span class="p">(</span><span class="n">l</span><span class="p">);</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Dead simple. (What about <a href="https://en.cppreference.com/w/cpp/language/rule_of_three">rule of three</a>? Instead of working around
this language design flaw, <a href="https://quuxplusone.github.io/blog/2023/05/05/deprecated-copy-with-dtor/">reach into the distant future</a> where
it’s been fixed: <code class="language-plaintext highlighter-rouge">-Werror=deprecated-copy-dtor</code>.) Usage might look like:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Example</span> <span class="p">{</span>
    <span class="n">Lock</span> <span class="n">lock</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">i32</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>

<span class="n">i32</span> <span class="n">incr</span><span class="p">(</span><span class="n">Example</span> <span class="o">*</span><span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Guard</span> <span class="n">g</span><span class="p">(</span><span class="o">&amp;</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">lock</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">++</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the <code class="language-plaintext highlighter-rouge">= {}</code> to guarantee the lock is always ready for use. It gets
more interesting with condition variables in the mix. That’s three more
functions:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">W32</span><span class="p">(</span><span class="n">b32</span><span class="p">,</span>  <span class="n">SleepConditionVariableSRW</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="p">,</span> <span class="n">Lock</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span><span class="p">,</span> <span class="n">b32</span><span class="p">));</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">WakeAllConditionVariable</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="p">));</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">,</span> <span class="n">WakeConditionVariable</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>The last parameter on <a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-sleepconditionvariablesrw">SleepConditionVariableSRW</a> indicates if the
lock was acquired shared. Why do locks have distinct acquire and release
functions while condition variables use a flag for the same purpose? Beats
me. I’ll unfold it into two functions, selected by type, with a default
infinite wait:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="nf">wait</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="n">Guard</span> <span class="o">*</span><span class="n">g</span><span class="p">,</span> <span class="n">i32</span> <span class="n">ms</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">SleepConditionVariableSRW</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">l</span><span class="p">,</span> <span class="n">ms</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">b32</span> <span class="n">wait</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="n">RGuard</span> <span class="o">*</span><span class="n">g</span><span class="p">,</span> <span class="n">i32</span> <span class="n">ms</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">SleepConditionVariableSRW</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">l</span><span class="p">,</span> <span class="n">ms</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage might look like:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">RGuard</span> <span class="nf">g</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lock</span><span class="p">);</span> <span class="n">remaining</span><span class="p">;)</span> <span class="p">{</span>
    <span class="n">wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">done</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">g</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The other side is nothing more than a rename (but could also be
<a href="/blog/2023/08/27/">accomplished through linking</a>):</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">signal</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">WakeConditionVariable</span><span class="p">(</span><span class="n">c</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">broadcast</span><span class="p">(</span><span class="n">Cond</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">WakeAllConditionVariable</span><span class="p">(</span><span class="n">c</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And a couple examples of its usage:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">Guard</span> <span class="nf">g</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lock</span><span class="p">);</span> <span class="o">!--</span><span class="n">remaining</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">signal</span><span class="p">(</span><span class="o">&amp;</span><span class="n">done</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Or:</span>

<span class="n">Guard</span> <span class="n">g</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lock</span><span class="p">);</span>
<span class="n">ready</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">broadcast</span><span class="p">(</span><span class="o">&amp;</span><span class="n">init</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">remaining</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">done</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">g</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A satisfying, powerful synchronization interface with hardly any code!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Giving C++ std::regex a C makeover</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/09/04/"/>
    <id>urn:uuid:83fb81ed-290e-4bc7-87bd-d0bbc6c01d25</id>
    <updated>2024-09-04T17:15:07Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re working in C using one of the major toolchains — that is,
it’s mainly a C++ implementation — and you need regular expressions. You
could integrate a library, but there’s a regex implementation in the C++
standard library included with your compiler, just within reach. As a
resourceful engineer, using an asset already in hand seems prudent. But
it’s a C++ interface, and you’re using C instead of C++ for a reason,
perhaps <em>to avoid dealing with C++</em>. Have no worries. This article is
about wrapping <a href="https://en.cppreference.com/w/cpp/regex"><code class="language-plaintext highlighter-rouge">std::regex</code></a> in a tidy C interface which not only
hides all the C++ machinery, but <em>utterly tames it</em>. It’s not so much
practical as a potpourri of interesting techniques.</p>

<p>If you’d like to skip ahead, here’s the full source up front. Tested with
<a href="https://github.com/skeeto/w64devkit">w64devkit</a>, MSVC <code class="language-plaintext highlighter-rouge">cl</code>, and <code class="language-plaintext highlighter-rouge">clang-cl</code>: <strong><a href="https://github.com/skeeto/scratch/tree/master/regex-wrap">scratch/regex-wrap</a></strong></p>

<h3 id="interface-design">Interface design</h3>

<p>The C interface I came up with, <code class="language-plaintext highlighter-rouge">regex.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span>
<span class="cp">#define S(s) (str){s, sizeof(s)-1}
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">str</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">regex</span> <span class="n">regex</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">str</span>      <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strlist</span><span class="p">;</span>

<span class="n">regex</span>  <span class="o">*</span><span class="nf">regex_new</span><span class="p">(</span><span class="n">str</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">strlist</span> <span class="nf">regex_match</span><span class="p">(</span><span class="n">regex</span> <span class="o">*</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Longtime readers will find it familiar: <a href="/blog/2023/10/08/">my favorite</a> non-owning,
counted strings form in place of null-terminated strings — similar to C++
<code class="language-plaintext highlighter-rouge">std::string_view</code> — and <a href="/blog/2023/09/27/">arena allocation</a>. Yes, such fundamental
types wouldn’t “belong” to a regex library like this, but imagine they’re
standardized by the project or whatever. Also, this is purely a C header,
not a C/C++ polyglot, and will not be used by the C++ portion.</p>

<p>In particular note the lack of “free” functions. <strong>The regex engine
allocates everything in the arena</strong>, including all temporary working
memory used while compiling, matching, etc. So in a sense, it could be
called <a href="/blog/2018/06/10/">a <em>non-allocating library</em></a>. This requires a bit of C++
abuse: I will not call some C++ regex destructors. It shouldn’t matter
because they only redundantly manage memory in the arena.  (If regex
objects are holding file handles or something else unnecessary then its
implementation so poor as to not be worth using, and we should just use a
better regex library.)</p>

<p>Now’s a good time to mention a caveat: In order to pull this off the regex
library lives in its own Dynamic-Link Library with its own copy of the C++
standard library, i.e. statically linked. My demo is Windows-only, but
this concept theoretically extends to shared objects on Linux. Since it’s
a C interface that doesn’t expose standard library objects, the DLL can be
used by programs compiled with different toolchains. Though that wouldn’t
apply to my inciting hypothetical.</p>

<p>Example usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">regex</span>  <span class="o">*</span><span class="n">re</span> <span class="o">=</span> <span class="n">regex_new</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"(</span><span class="se">\\</span><span class="s">w+)"</span><span class="p">),</span> <span class="n">perm</span><span class="p">);</span>
<span class="n">str</span>     <span class="n">s</span>  <span class="o">=</span> <span class="n">S</span><span class="p">(</span><span class="s">"Hello, world! This is a test."</span><span class="p">);</span>
<span class="n">strlist</span> <span class="n">m</span>  <span class="o">=</span> <span class="n">regex_match</span><span class="p">(</span><span class="n">re</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%2td = %.*s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">m</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">len</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This program prints:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 0 = Hello
 1 = world
 2 = This
 3 = is
 4 = a
 5 = test
</code></pre></div></div>

<p>If matching lots of source strings, scope the arena to the loop and then
the results, and any regex working memory, are automatically freed in O(1)
at the end of each iteration:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ninputs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">arena</span>   <span class="n">scratch</span> <span class="o">=</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
    <span class="n">strlist</span> <span class="n">matches</span> <span class="o">=</span> <span class="n">regex_match</span><span class="p">(</span><span class="n">re</span><span class="p">,</span> <span class="n">inputs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="c1">// ... consume matches ...</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="c-implementation">C++ implementation</h3>

<p>On the C++ side the first thing I do is replace <code class="language-plaintext highlighter-rouge">new</code> and <code class="language-plaintext highlighter-rouge">delete</code>, which
is how I force it to allocate from the arena. This replaces <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code>
for <em>globally</em>, but recall that the regex library has its own, private C++
implementation. Replacements apply only to itself even if there’s other
C++ present in the process. If this is the only C++ in the process then it
doesn’t require such careful isolation.</p>

<p>I can’t tell <code class="language-plaintext highlighter-rouge">std::regex</code> about the arena — it calls <code class="language-plaintext highlighter-rouge">operator new</code> the
usual way, without extra arguments — so I have to smuggle it in through a
thread-local variable:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">thread_local</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
</code></pre></div></div>

<p>If I’m sure the library is only used by a single thread then I can omit
<code class="language-plaintext highlighter-rouge">thread_local</code>, but it’s useful here to demonstrate and measure. Using it
in my operator replacements:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="nf">new</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span> <span class="n">align</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">arena</span>    <span class="o">*</span><span class="n">a</span>     <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">ssize</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">pad</span>   <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">&amp;</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ssize</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">ssize</span> <span class="o">&gt;</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">bad_alloc</span><span class="p">{};</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-=</span> <span class="n">size</span> <span class="o">+</span> <span class="n">pad</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="k">new</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="k">operator</span> <span class="k">new</span><span class="p">(</span>
        <span class="n">size</span><span class="p">,</span>
        <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span><span class="p">(</span><span class="n">__STDCPP_DEFAULT_NEW_ALIGNMENT__</span><span class="p">)</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Starting in C++17, replacing the global allocator requires definitions for
both plain <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code> and aligned <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code>. The <a href="https://en.cppreference.com/w/cpp/memory/new/operator_new">many other
variants</a>, including arrays, call these four and so may be skipped.
Allocating over-aligned objects isn’t a special case for arenas, so I
implemented plain <code class="language-plaintext highlighter-rouge">new</code> by calling aligned <code class="language-plaintext highlighter-rouge">new</code>. I’d prefer to <a href="/blog/2024/04/14/">allocate
through a template</a> so that I can “see” the type, but that’s not an
option in this case.</p>

<p>After converting to signed sizes <a href="/blog/2024/05/24/">because they’re simpler</a>, it’s the
usual from-the-end allocation. I prefer <code class="language-plaintext highlighter-rouge">-fno-exceptions</code> but <code class="language-plaintext highlighter-rouge">std::regex</code>
is inherently <em>exceptional</em> — and I mean that in at least two bad ways —
so they’re required. The good news is this library gracefully and reliably
handles out-of-memory errors. (The arena makes this trivial to test, so
try it for yourself!)</p>

<p>I added a little extra flair replacing <code class="language-plaintext highlighter-rouge">delete</code>:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{}</span>

<span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span> <span class="k">noexcept</span>
<span class="p">{</span>
    <span class="n">arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">==</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The two mandatory replacements are no-ops because that’s simply how arenas
work. We don’t free individual objects, but many at once. It’s <em>completely
optional</em>, but I also replaced sized <code class="language-plaintext highlighter-rouge">delete</code> for little other reason than
<a href="/blog/2023/12/17/">sized deallocation is cool</a>. C++ destructs in reverse order, so
this is likely to work out. At least with GCC libstdc++, it freed about a
third of the workspace memory before returning to C. I’d rather it didn’t
try to free anything at all, but since it’s going to call <code class="language-plaintext highlighter-rouge">delete</code> anyway
I can get some use out of it.</p>

<p>Interesting side note: In a rough benchmark these replacements made MSVC
<code class="language-plaintext highlighter-rouge">std::regex</code> matching four times faster! I expected a <em>small</em> speedup, but
not that. In the typical case it appears to be wasting most of its time on
allocation. On the other hand, libstdc++ <code class="language-plaintext highlighter-rouge">std::regex</code> is overall quite a
bit slower than MSVC, and my replacements had no performance effect. It’s
spending its time elsewhere, and the small gains are lost interacting with
the thread-local.</p>

<p>Finally the meat:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="s">"C"</span> <span class="n">std</span><span class="o">::</span><span class="n">regex</span> <span class="o">*</span><span class="nf">regex_new</span><span class="p">(</span><span class="n">str</span> <span class="n">re</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">perm</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="n">std</span><span class="o">::</span><span class="n">regex</span><span class="p">(</span><span class="n">re</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">re</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">re</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">{};</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It sets the thread-local to the arena, then constructs with “iterators” at
each end of the input. All exceptions are caught and turned into a null
return. Depending on need, we may want to indicate <em>why</em> it failed — out
of memory, invalid regex, etc. — by returning an error value of some sort.
An exercise for the reader.</p>

<p>The matcher is a little more complicated:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="s">"C"</span> <span class="n">strlist</span> <span class="nf">regex_match</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">regex</span> <span class="o">*</span><span class="n">re</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">perm</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="n">std</span><span class="o">::</span><span class="n">cregex_iterator</span> <span class="n">it</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="o">*</span><span class="n">re</span><span class="p">);</span>
        <span class="n">std</span><span class="o">::</span><span class="n">cregex_iterator</span> <span class="n">end</span><span class="p">;</span>

        <span class="n">strlist</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{};</span>
        <span class="n">r</span><span class="p">.</span><span class="n">len</span>  <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">distance</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="n">end</span><span class="p">);</span>
        <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="k">new</span> <span class="n">str</span><span class="p">[</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">]();</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">it</span> <span class="o">!=</span> <span class="n">end</span><span class="p">;</span> <span class="n">it</span><span class="o">++</span><span class="p">,</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">position</span><span class="p">();</span>
            <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">len</span>  <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">length</span><span class="p">();</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>

    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">{};</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I create a <code class="language-plaintext highlighter-rouge">char *</code> “cregex” iterator, again giving it each end of the
input. I hope it’s not just making a copy (MSVC <code class="language-plaintext highlighter-rouge">std::regex</code> does <em>grumble
grumble</em>). The result is allocated out of the arena. As before, exceptions
convert to a null return. Callers can distinguish errors because no-match
results have a non-null pointer. The iterator, being a local variable, is
destroyed before returning, uselessly calling <code class="language-plaintext highlighter-rouge">delete</code>. I could avoid this
by allocating it with <code class="language-plaintext highlighter-rouge">new</code>, but in practice it doesn’t matter.</p>

<p>You might have noticed the lack of <code class="language-plaintext highlighter-rouge">declspec(dllexport)</code>. <a href="/blog/2023/08/27/">DEF files are
great</a>, and I’ve come to appreciate and prefer them. GCC and MSVC
accept them as another input on the command line, and the source need not
be aware exports. My <code class="language-plaintext highlighter-rouge">regex.def</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIBRARY regex
EXPORTS
regex_new
regex_match
</code></pre></div></div>

<p>In w64devkit, the command to build the DLL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ g++ -shared -std=c++17 -o regex.dll regex.cpp regex.def
</code></pre></div></div>

<p>The MSVC command almost maps 1:1 to the GCC command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /LD /std:c++17 /EHsc regex.cpp regex.def
</code></pre></div></div>

<p>In either case only the C interface is exported (via <a href="/blog/2024/06/30/">peports</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ peports -e regex.dll
EXPORTS
        1       regex_match
        2       regex_new
</code></pre></div></div>

<h3 id="reasons-against">Reasons against</h3>

<p>Though this library is conveniently on hand, and my minimalist C wrapper
interface is nicer than a typical C regex library interface, and even
hides some <code class="language-plaintext highlighter-rouge">std::regex</code> problems, trade-offs must be considered:</p>

<ul>
  <li>No Unicode support, particularly UTF-8</li>
  <li><code class="language-plaintext highlighter-rouge">std::regex</code> implementations are universally poor and slow</li>
  <li>libstdc++ <code class="language-plaintext highlighter-rouge">std::regex</code> is especially slow to compile</li>
  <li>Isolating in a DLL (if needed) is inconvenient</li>
  <li>DLL is 200K (MSVC) to 700K (GCC) or so</li>
</ul>

<p>Depending on what I’m doing, some of these may have me looking elsewhere.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>An improved chkstk function on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/02/05/"/>
    <id>urn:uuid:381be450-559c-4521-911a-ba524dca7b64</id>
    <updated>2024-02-05T17:56:05Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="rant"/>
    <content type="html">
      <![CDATA[<p>If you’ve spent much time developing with Mingw-w64 you’ve likely seen the
symbol <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>, perhaps in an error message. It’s a little piece of
runtime provided by GCC via libgcc which ensures enough of the stack is
committed for the caller’s stack frame. The “function” uses a custom ABI
and is implemented in assembly. So is the subject of this article, a
slightly improved implementation soon to be included in <a href="/blog/2020/05/15/">w64devkit</a> as
libchkstk (<code class="language-plaintext highlighter-rouge">-lchkstk</code>).</p>

<p>The MSVC toolchain has an identical (x64) or similar (x86) function named
<code class="language-plaintext highlighter-rouge">__chkstk</code>. We’ll discuss that as well, and w64devkit will include x86 and
x64 implementations, useful when linking with MSVC object files. The new
x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> in particular is also better than the MSVC definition.</p>

<p>A note on spelling: <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> is spelled with three underscores, and
<code class="language-plaintext highlighter-rouge">__chkstk</code> is spelled with two. On x86, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/decorated-names#FormatC"><code class="language-plaintext highlighter-rouge">cdecl</code> functions</a> are
decorated with a leading underscore, and so may be rendered, e.g. in error
messages, with one fewer underscore. The true name is undecorated, and the
raw symbol name is identical on x86 and x64. Further complicating matters,
libgcc defines a <code class="language-plaintext highlighter-rouge">___chkstk</code> with three underscores. As far as I can tell,
this spelling arose from confusion regarding name decoration, but nobody’s
noticed for the past 28 years. libgcc’s x64 <code class="language-plaintext highlighter-rouge">___chkstk</code> is obviously and
badly broken, so I’m sure nobody has ever used it anyway, not even by
accident thanks to the misspelling. I’ll touch on that below.</p>

<p>When referring to a particular instance, I will use a specific spelling.
Otherwise the term “chkstk” refers to the family. If you’d like to skip
ahead to the source for libchkstk: <strong><a href="https://github.com/skeeto/w64devkit/blob/master/src/libchkstk.S"><code class="language-plaintext highlighter-rouge">libchkstk.S</code></a></strong>.</p>

<h3 id="a-gradually-committed-stack">A gradually committed stack</h3>

<p>The header of a Windows executable lists two stack sizes: a <em>reserve</em> size
and an initial <em>commit</em> size. The first is the largest the main thread
stack can grow, and the second is the amount <a href="https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc">committed</a> when the
program starts. A program gradually commits stack pages <em>as needed</em> up to
the reserve size. Binutils <code class="language-plaintext highlighter-rouge">objdump</code> option <code class="language-plaintext highlighter-rouge">-p</code> lists the sizes. Typical
output for a Mingw-w64 program:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000
</code></pre></div></div>

<p>The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB
initially committed. With the Binutils linker, <code class="language-plaintext highlighter-rouge">ld</code>, you can set them at
link time using <code class="language-plaintext highlighter-rouge">--stack</code>. Via <code class="language-plaintext highlighter-rouge">gcc</code>, use <code class="language-plaintext highlighter-rouge">-Xlinker</code>. For example, to
reserve an 8MiB stack and commit half of it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Xlinker --stack=$((8&lt;&lt;20)),$((4&lt;&lt;20)) ...
</code></pre></div></div>

<p>MSVC <code class="language-plaintext highlighter-rouge">link.exe</code> similarly has <a href="https://learn.microsoft.com/en-us/cpp/build/reference/stack-stack-allocations"><code class="language-plaintext highlighter-rouge">/stack</code></a>.</p>

<p>The purpose of this mechanism is to avoid paying the <em>commit charge</em> for
unused stack. It made sense 30 years ago when stacks were a potentially
large portion of physical memory. These days it’s a rounding error and
silly we’re still dealing with it. Using the above options you can choose
to commit the entire stack up front, at which point a chkstk helper is no
longer needed (<a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59532"><code class="language-plaintext highlighter-rouge">-mno-stack-arg-probe</code></a>, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/gs-control-stack-checking-calls"><code class="language-plaintext highlighter-rouge">/Gs2147483647</code></a>). This
requires link-time control of the main module, which isn’t always an
option, like when supplying a DLL for someone else to run.</p>

<p>The program grows the stack by touching the singular <a href="https://devblogs.microsoft.com/oldnewthing/20220203-00/?p=106215">guard page</a>
mapped between the committed and uncommitted portions of the stack. This
action triggers a page fault, and the default fault handler commits the
guard page and maps a new guard page just below. In other words, the stack
grows one page at a time, in order.</p>

<p>In most cases nothing special needs to happen. The guard page mechanism is
transparent and in the background. However, if a function stack frame
exceeds the page size then there’s a chance that it might leap over the
guard page, crashing the program. To prevent this, compilers insert a
chkstk call in the function prologue. Before local variable allocation,
chkstk walks down the stack — that is, towards lower addresses — nudging
the guard page with each step. (As a side effect it provides <a href="/blog/2017/06/21/">stack clash
protection</a> — the only security aspect of chkstk.) For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">callee</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">large</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">];</span>
    <span class="n">callee</span><span class="p">(</span><span class="n">large</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with 64-bit <code class="language-plaintext highlighter-rouge">gcc -O</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">example:</span>
    <span class="nf">movl</span>    <span class="kc">$</span><span class="mi">1048616</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">call</span>    <span class="nv">___chkstk_ms</span>
    <span class="nf">subq</span>    <span class="o">%</span><span class="nb">rax</span><span class="p">,</span> <span class="o">%</span><span class="nb">rsp</span>
    <span class="nf">leaq</span>    <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rcx</span>
    <span class="nf">call</span>    <span class="nv">callee</span>
    <span class="nf">addq</span>    <span class="kc">$</span><span class="mi">1048616</span><span class="p">,</span> <span class="o">%</span><span class="nb">rsp</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>I used GCC, but this is practically identical to the code generated by
MSVC and Clang. Note the call to <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> in the function prologue
before allocating the stack frame (<code class="language-plaintext highlighter-rouge">subq</code>). Also note that it sets <code class="language-plaintext highlighter-rouge">eax</code>.
As a volatile register, this would normally accomplish nothing because
it’s done just before a function call, but recall that <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> has
a custom ABI. That’s the argument to chkstk. Further note that it uses
<code class="language-plaintext highlighter-rouge">rax</code> on the return. That’s not the value returned by chkstk, but rather
that x64 <em>chkstk preserves all registers</em>.</p>

<p>Well, maybe. The official documentation says that registers <a href="https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog">r10 and r11
are volatile</a>, but that information conflicts with Microsoft’s own
implementation. Just in case, I choose a conservative interpretation that
all registers are preserved.</p>

<h3 id="implementing-chkstk">Implementing chkstk</h3>

<p>In a high level language, chkstk might look something like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: hypothetical implementation</span>
<span class="kt">void</span> <span class="nf">___chkstk_ms</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">frame_size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">frame</span><span class="p">[</span><span class="n">frame_size</span><span class="p">];</span>  <span class="c1">// NOTE: variable-length array</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">frame_size</span> <span class="o">-</span> <span class="n">PAGE_SIZE</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">-=</span> <span class="n">PAGE_SIZE</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">frame</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// touch the guard page</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This wouldn’t work for a number of reasons, but if it did, <code class="language-plaintext highlighter-rouge">volatile</code>
would serve two purposes. First, forcing the side effect to occur. The
second is more subtle: The loop must happen in exactly this order, from
high to low. Without <code class="language-plaintext highlighter-rouge">volatile</code>, loop iterations would be independent — as
there are no dependencies between iterations — and so a compiler could
reverse the loop direction.</p>

<p>The store can happen anywhere within the guard page, so it’s not necessary
to align <code class="language-plaintext highlighter-rouge">frame</code> to the page. Simply touching at least one byte per page
is enough. This is essentially the definition of libgcc <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>.</p>

<p>How many iterations occur? In <code class="language-plaintext highlighter-rouge">example</code> above, the stack frame will be
around 1MiB (2<sup>20</sup>). With pages of 4KiB (2<sup>12</sup>) that’s
256 iterations. The loop happens unconditionally, meaning <em>every function
call</em> requires 256 iterations of this loop. Wouldn’t it be better if the
loop ran only as needed, i.e. the first time? MSVC x64 <code class="language-plaintext highlighter-rouge">__chkstk</code> skips
iterations if possible, and the same goes for my new <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>. Much
like <a href="/blog/2022/02/18/#my-getcommandlinew">the command line string</a>, the low address of the current
thread’s guard page is accessible through the <a href="https://en.wikipedia.org/wiki/Win32_Thread_Information_Block">Thread Information
Block</a> (TIB). A chkstk can cheaply query this address, only looping
during initialization or so. (<a href="/blog/2023/03/23/">In contrast to Linux</a>, a thread’s
stack is fundamentally managed by the operating system.)</p>

<p>Taking that into account, an improved algorithm:</p>

<ol>
  <li>Push registers that will be used</li>
  <li>Compute the low address of the new stack frame (F)</li>
  <li>Retrieve the low address of the committed stack (C)</li>
  <li>Go to 7</li>
  <li>Subtract the page size from C</li>
  <li>Touch memory at C</li>
  <li>If C &gt; F, go to 5</li>
  <li>Pop registers to restore them and return</li>
</ol>

<p>A little unusual for an unconditional forward jump in pseudo-code, but
this closely matches my assembly. The loop causes page faults, and it’s
the slow, uncommon path. The common, fast path never executes 5–6. I’d
also chose smaller instructions in order to keep the function small and
reduce instruction cache pressure. My x64 implementation as of this
writing:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">___chkstk_ms:</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">rcx</span>              <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">neg</span>  <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span> <span class="nb">rax</span> <span class="err">=</span> <span class="nv">frame</span> <span class="nv">low</span> <span class="nv">address</span>
    <span class="nf">add</span>  <span class="o">%</span><span class="nb">rsp</span><span class="p">,</span> <span class="o">%</span><span class="nb">rax</span>        <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span> <span class="err">"</span>
    <span class="nf">mov</span>  <span class="o">%</span><span class="nb">gs</span><span class="p">:(</span><span class="mh">0x10</span><span class="p">),</span> <span class="o">%</span><span class="nb">rcx</span>  <span class="o">//</span> <span class="mi">3</span><span class="nv">.</span> <span class="nb">rcx</span> <span class="err">=</span> <span class="nv">stack</span> <span class="nv">low</span> <span class="nv">address</span>
    <span class="nf">jmp</span>  <span class="mi">1</span><span class="nv">f</span>                <span class="o">//</span> <span class="mi">4</span><span class="nv">.</span>
<span class="err">0:</span>  <span class="nf">sub</span>  <span class="kc">$</span><span class="mh">0x1000</span><span class="p">,</span> <span class="o">%</span><span class="nb">rcx</span>     <span class="o">//</span> <span class="mi">5</span><span class="nv">.</span>
    <span class="nf">test</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>      <span class="o">//</span> <span class="mi">6</span><span class="nv">.</span> <span class="nv">page</span> <span class="nv">fault</span> <span class="p">(</span><span class="nv">very</span> <span class="nv">slow</span><span class="err">!</span><span class="p">)</span>
<span class="err">1:</span>  <span class="nf">cmp</span>  <span class="o">%</span><span class="nb">rax</span><span class="p">,</span> <span class="o">%</span><span class="nb">rcx</span>        <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">ja</span>   <span class="mb">0b</span>                <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">rcx</span>              <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">ret</span>                    <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
</code></pre></div></div>

<p>I’ve labeled each instruction with its corresponding pseudo-code. Step 6
is unusual among chkstk implementations: It’s not a <em>store</em>, but a <em>load</em>,
still sufficient to fault the page. That <code class="language-plaintext highlighter-rouge">test</code> instruction is just two
bytes, and unlike other two-byte options, doesn’t write garbage onto the
stack — which <em>would</em> be allowed — nor use an extra register. I searched
through single byte instructions that can page fault, all of which involve
implicit addressing through <code class="language-plaintext highlighter-rouge">rdi</code> or <code class="language-plaintext highlighter-rouge">rsi</code>, but they increment <code class="language-plaintext highlighter-rouge">rdi</code> or
<code class="language-plaintext highlighter-rouge">rsi</code>, and would would require another instruction to correct it.</p>

<p>Because of the return address and two <code class="language-plaintext highlighter-rouge">push</code> operations, the low stack
frame address is technically <em>too low</em> by 24 bytes. That’s fine. If this
exhausts the stack, the program is really cutting it close and the stack
is too small anyway. I could be more precise — which, as we’ll soon see,
is required for x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> — but it would cost an extra instruction
byte.</p>

<p>On x64, <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> and <code class="language-plaintext highlighter-rouge">__chkstk</code> have identical semantics, so name it
<code class="language-plaintext highlighter-rouge">__chkstk</code> — which I’ve done in libchkstk — and it works with MSVC. The
only practical difference between my chkstk and MSVC <code class="language-plaintext highlighter-rouge">__chkstk</code> is that
mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking
the optimization, is libgcc <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>, weighing 50 bytes, or in
practice, due to an unfortunate Binutils default of padding sections, 64
bytes.</p>

<p>I’m no assembly guru, and I bet this can be even smaller without hurting
the fast path, but this is the best I could come up with at this time.</p>

<p><strong>Update</strong>: Stefan Kanthak, who has <a href="https://skanthak.homepage.t-online.de/msvcrt.html">extensively explored this
topic</a>, points out that large stack frame requests might overflow
my low frame address calculation at (3), effectively disabling the probe.
Such requests might occur from alloca calls or variable-length arrays
(VLAs) with untrusted sizes. As far as I’m concerned, such programs are
already broken, but it only cost a two-byte instruction to deal with it. I
have not changed this article, but the source in w64devkit <a href="https://github.com/skeeto/w64devkit/commit/50b343db">has been
updated</a>.</p>

<h3 id="32-bit-chkstk">32-bit chkstk</h3>

<p>On x86 <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> has identical semantics to x64. Mine is a copy-paste
of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC
was ahead of the curve on this design.</p>

<p>However, x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> is <em>bonkers</em>. It not only commits the stack, but
also allocates the stack frame. That is, it returns with a different stack
pointer. The return pointer is initially <em>inside the new stack frame</em>, so
chkstk must retrieve it and return by other means. It must also precisely
compute the low frame address.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">__chkstk:</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">ecx</span>               <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">neg</span>  <span class="o">%</span><span class="nb">eax</span>               <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span>
    <span class="nf">lea</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">esp</span><span class="p">,</span><span class="o">%</span><span class="nb">eax</span><span class="p">),</span> <span class="o">%</span><span class="nb">eax</span> <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span>
    <span class="nf">mov</span>  <span class="o">%</span><span class="nb">fs</span><span class="p">:(</span><span class="mh">0x08</span><span class="p">),</span> <span class="o">%</span><span class="nb">ecx</span>   <span class="o">//</span> <span class="mi">3</span><span class="nv">.</span>
    <span class="nf">jmp</span>  <span class="mi">1</span><span class="nv">f</span>                 <span class="o">//</span> <span class="mi">4</span><span class="nv">.</span>
<span class="err">0:</span>  <span class="nf">sub</span>  <span class="kc">$</span><span class="mh">0x1000</span><span class="p">,</span> <span class="o">%</span><span class="nb">ecx</span>      <span class="o">//</span> <span class="mi">5</span><span class="nv">.</span>
    <span class="nf">test</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="p">(</span><span class="o">%</span><span class="nb">ecx</span><span class="p">)</span>       <span class="o">//</span> <span class="mi">6</span><span class="nv">.</span> <span class="nv">page</span> <span class="nv">fault</span> <span class="p">(</span><span class="nv">very</span> <span class="nv">slow</span><span class="err">!</span><span class="p">)</span>
<span class="err">1:</span>  <span class="nf">cmp</span>  <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">ecx</span>         <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">ja</span>   <span class="mb">0b</span>                 <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">ecx</span>               <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">xchg</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>         <span class="o">//</span> <span class="nv">?.</span> <span class="nb">al</span><span class="nv">locate</span> <span class="nv">frame</span>
    <span class="nf">jmp</span>  <span class="o">*</span><span class="p">(</span><span class="o">%</span><span class="nb">eax</span><span class="p">)</span>            <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span> <span class="nv">return</span>
</code></pre></div></div>

<p>The main differences are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">eax</code> is treated as volatile, so it is not saved</li>
  <li>The low frame address is precisely computed with <code class="language-plaintext highlighter-rouge">lea</code> (2)</li>
  <li>The frame is allocated at step (?) by swapping F and the stack pointer</li>
  <li>Post-swap F now points at the return address, so jump through it</li>
</ul>

<p>MSVC x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> does not query the TIB (3), and so unconditionally
runs the loop. So there’s an advantage to my implementation besides size.</p>

<p>libgcc x86 <code class="language-plaintext highlighter-rouge">___chkstk</code> has this behavior, and so it’s also a suitable
<code class="language-plaintext highlighter-rouge">__chkstk</code> aside from the misspelling. Strangely, libgcc x64 <code class="language-plaintext highlighter-rouge">___chkstk</code>
<em>also</em> allocates the stack frame, which is never how chkstk was supposed
to work on x64. I can only conclude it’s never been used.</p>

<h3 id="optimization-in-practice">Optimization in practice</h3>

<p>Does the skip-the-loop optimization matter in practice? Consider a
function using a large-ish, stack-allocated array, perhaps to process
<a href="/blog/2023/08/23/">environment variables</a> or <a href="https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">long paths</a>, each of which max out
around 64KiB.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">path_contains</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="n">wchar</span> <span class="o">*</span><span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">var</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">15</span><span class="p">];</span>
    <span class="n">GetEnvironmentVariableW</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">var</span><span class="p">,</span> <span class="n">countof</span><span class="p">(</span><span class="n">var</span><span class="p">));</span>
    <span class="c1">// ... search for path in var ...</span>
<span class="p">}</span>

<span class="kt">int64_t</span> <span class="nf">getfilesize</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">wide</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">15</span><span class="p">];</span>
    <span class="n">MultiByteToWideChar</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">wide</span><span class="p">,</span> <span class="n">countof</span><span class="p">(</span><span class="n">wide</span><span class="p">));</span>
    <span class="c1">// ... look up file size via wide path ...</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">path_contains</span><span class="p">(</span><span class="s">L"PATH"</span><span class="p">,</span> <span class="s">L"c:</span><span class="se">\\</span><span class="s">windows</span><span class="se">\\</span><span class="s">system32"</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>

    <span class="kt">int64_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">getfilesize</span><span class="p">(</span><span class="s">"π.txt"</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Each call to these functions with such large local arrays is also a call
to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely
detectable in a benchmark. If the function touches the file system, which
is likely when processing paths, then chkstk doesn’t matter at all. My
starting example had a 1MiB array, or 256 chkstk iterations. That starts
to become measurable, though it’s also pushing the limits. At that point
you <a href="/blog/2023/09/27/">ought to be using a scratch arena</a>.</p>

<p>So ultimately after writing an improved <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> I could only
measure a tiny difference in contrived programs, and none in any real
application. Though there’s still one more benefit I haven’t yet
mentioned…</p>

<h3 id="the-first-thing-we-do-lets-kill-all-the-lawyers">“The first thing we do, let’s <a href="/blog/2023/06/22/#119-henry-vi">kill all the lawyers</a>”.</h3>

<p>My original motivation for this project wasn’t the optimization — which I
didn’t even discover until after I had started — but <em>licensing</em>. I hate
software licenses, and the <a href="/blog/2023/01/18/">tools I’ve written for w64devkit</a>
are dedicated to the public domain. Both source <em>and</em> binaries (as
distributed). I can do so because <a href="/blog/2023/02/15/">I don’t link runtime components</a>,
not even libgcc. Not <a href="/blog/2023/05/31/">even header files</a>. Every byte of code in those
binaries is my work or the work of my collaborators.</p>

<p>Every once in awhile <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> rears its ugly head, and I have to
make a decision. Do I re-work my code to avoid it? Do I take the reigns of
the linker and disable stack probes? I haven’t necessarily allocated a
large local array: A bit of luck with function inlining can combine
several smaller stack frames into one that’s just large enough to require
chkstk.</p>

<p>Since libgcc falls under the <a href="https://www.gnu.org/licenses/gcc-exception-3.1.html">GCC Runtime Library Exception</a>, if it’s
linked into my program through an “Eligible Compilation Process” — which I
believe includes w64devkit — then the GPL-licensed functions embedded in
my binary are legally siloed and the GPL doesn’t infect the rest of the
program. These bits are still GPL in isolation, and if someone were to
copy them out of the program then they’d be normal GPL code again. In
other words, it’s not a 100% public domain binary if libgcc was linked!</p>

<p>(If some FSF lawyer says I’m wrong, then this is an escape hatch through
which anyone can scrub the GPL from GCC runtime code, and then ignore the
runtime exception entirely.)</p>

<p>MSVC is worse. Hardly anyone follows its license, but fortunately for most
the license is practically unenforced. Its chkstk, which currently resides
in a loose <code class="language-plaintext highlighter-rouge">chkstk.obj</code>, falls into what Microsoft calls “Distributable
Code.” Its license requires “external end users to agree to terms that
protect the Distributable Code.” In other words, if you compile a program
with MSVC, you’re required to have a EULA including the relevant terms
from the Visual Studio license. You’re not legally permitted to distribute
software in the manner of w64devkit — no installer, just a portable zip
distribution — if that software has been built with MSVC.  At least not
without special care which nobody does. (Don’t worry, I won’t tell.)</p>

<h3 id="how-to-use-libchkstk">How to use libchkstk</h3>

<p>To avoid libgcc entirely you need <code class="language-plaintext highlighter-rouge">-nostdlib</code>. Otherwise it’s implicitly
offered to the linker, and you’d need to manually check if it picked up
code from libgcc. If <code class="language-plaintext highlighter-rouge">ld</code> complains about a missing chkstk, use <code class="language-plaintext highlighter-rouge">-lchkstk</code>
to get a definition. If you use <code class="language-plaintext highlighter-rouge">-lchkstk</code> when it’s not needed, nothing
happens, so it’s safe to always include.</p>

<p>I also recently added <a href="https://github.com/skeeto/w64devkit/blob/master/src/libmemory.c">a libmemory</a> to w64devkit, providing tiny,
public domain definitions of <code class="language-plaintext highlighter-rouge">memset</code>, <code class="language-plaintext highlighter-rouge">memcpy</code>, <code class="language-plaintext highlighter-rouge">memmove</code>, <code class="language-plaintext highlighter-rouge">memcmp</code>, and
<code class="language-plaintext highlighter-rouge">strlen</code>. All compilers fabricate calls to these five functions even if
you don’t call them yourself, which is how they were selected. (Not
because I like them. <a href="/blog/2023/02/11/">I really don’t.</a>). If a <code class="language-plaintext highlighter-rouge">-nostdlib</code> build
complains about these, too, then add <code class="language-plaintext highlighter-rouge">-lmemory</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -nostdlib ... -lchkstk -lmemory
</code></pre></div></div>

<p>In MSVC the equivalent option is <code class="language-plaintext highlighter-rouge">/nodefaultlib</code>, after which you may see
missing chkstk errors, and perhaps more. <code class="language-plaintext highlighter-rouge">libchkstk.a</code> is compatible with
MSVC, and <code class="language-plaintext highlighter-rouge">link.exe</code> doesn’t care that the extension is <code class="language-plaintext highlighter-rouge">.a</code> rather than
<code class="language-plaintext highlighter-rouge">.lib</code>, so supply it at link time. Same goes for <code class="language-plaintext highlighter-rouge">libmemory.a</code> if you need
any of those, too.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl ... /link /nodefaultlib libchkstk.a libmemory.a
</code></pre></div></div>

<p>While I despise licenses, I still take them seriously in the software I
distribute. With libchkstk I have another tool to get it under control.</p>

<hr />

<p>Big thanks to Felipe Garcia for reviewing and correcting mistakes in this
article before it was published!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to link identical function names from different DLLs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/08/27/"/>
    <id>urn:uuid:265f121d-9418-4eb6-929f-a125264d0f2a</id>
    <updated>2023-08-27T01:46:31Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>For the typical DLL function call you declare the function prototype (via
header file), you inform the link editor (<code class="language-plaintext highlighter-rouge">ld</code>, <code class="language-plaintext highlighter-rouge">link</code>) that the DLL
exports a symbol with that name (import library), it matches the declared
name with this export, and it becomes an import in your program’s import
table. What happens when two different DLLs export the same symbol? The
link editor will pick the first found. But what if you want to use <em>both</em>
exports? If they have the same name, how could program or link editor
distinguish them? In this article I’ll demonstrate a technique to resolve
this by creating a program which links with and directly uses two
different C runtimes (CRTs) simultaneously.</p>

<p>In <a href="https://learn.microsoft.com/en-us/windows/win32/debug/pe-format">PE executable images</a>, an import isn’t just a symbol, but a tuple
of DLL name and symbol. For human display, a tuple is typically formatted
with an exclamation point delimiter, as in <code class="language-plaintext highlighter-rouge">msvcrt.dll!malloc</code>, though
sometimes without the <code class="language-plaintext highlighter-rouge">.dll</code> suffix. You’ve likely seen this in stack
traces. Because it’s a tuple and not just a symbol, it’s possible to refer
to, and import, the same symbol from different DLLs. Contrast that with
ELF, which has a list of shared objects, and a separate list of symbols,
with the dynamic linker pairing them up at load time. That permits cool
tricks like <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code>, but for the same reason loading is less
predictable.</p>

<p>Windows comes with several CRTs, and various libraries and applications
use one or another (<a href="/blog/2023/02/15/">or none</a>) depending on how they were built. As
C standard library implementations they export mostly the same symbols,
<code class="language-plaintext highlighter-rouge">malloc</code>, <code class="language-plaintext highlighter-rouge">printf</code>, etc. With imports as tuples, it’s not so unusual for
an application to load multiple CRTs at once. Typically coexistence is
transitive. That is, a module does not directly access both CRTs but
depends on modules that use different CRTs. One module calls, say,
<code class="language-plaintext highlighter-rouge">msvcrt.dll!malloc</code>, and another module calls <code class="language-plaintext highlighter-rouge">ucrtbase.dll!malloc</code>. With
DLL-qualified symbols, this is sound so long as modules don’t cross the
streams, e.g. an allocation in one module must not be freed in the other.
Libraries in this ecosystem must avoid exposing their CRT through their
interfaces, such as expecting the library’s caller to <code class="language-plaintext highlighter-rouge">free()</code> objects:
The caller might not have access to the right <code class="language-plaintext highlighter-rouge">free</code>!</p>

<p>Contrast again with the unix ecosystem generally, where a process can only
load one libc and everyone is expected to share. Libraries commonly expect
callers to <code class="language-plaintext highlighter-rouge">free()</code> their objects (e.g. <a href="https://tiswww.case.edu/php/chet/readline/readline.html#Basic-Behavior">libreadline</a>, <a href="https://man.archlinux.org/man/xcb-requests.3.en">xcb</a>),
blending their interface with libc.</p>

<p>Suppose you’re in such a situation where, due to unix-oriented libraries,
your application must use functions from two different CRTs at once. One
might have been compiled with Mingw-w64 and linked with MSVCRT, and the
other compiled with MSVC and linked with UCRT. We need to call <code class="language-plaintext highlighter-rouge">malloc</code>
and <code class="language-plaintext highlighter-rouge">free</code> in each, but they have the same name. What a pickle!</p>

<p>There’s an obvious, and probably most common, solution: <a href="https://learn.microsoft.com/en-us/windows/win32/dlls/run-time-dynamic-linking">run-time dynamic
linking</a>. Use load-time linking on one CRT, and LoadLibrary on the
other CRT with GetProcAddress to obtain function pointers. However, it’s
possible to do this entirely with load-time linking!</p>

<h3 id="a-malloc-by-any-other-name-would-allocate-as-well">A malloc by any other name would allocate as well</h3>

<p>Think about it a moment and you might wonder: If the names are the same,
how can I pick which I’m calling? The tuple representation won’t work
because <code class="language-plaintext highlighter-rouge">!</code> cannot appear in an identifier, which is, after all, why it
was chosen. The trick is that we’re going to <em>rename</em> one of them! To
demonstrate, I’ll use <a href="/blog/2020/09/25/">my Windows development kit</a>, <a href="https://github.com/skeeto/w64devkit">w64devkit</a>, a
Mingw-w64 distribution that links MSVCRT. I’m going to use UCRT as the
second CRT to access <code class="language-plaintext highlighter-rouge">ucrtbase.dll!malloc</code>.</p>

<p>I can choose whatever valid identifier I’d like, so I’m going to pick
<code class="language-plaintext highlighter-rouge">ucrt_malloc</code>. This will <a href="/blog/2021/05/31/">require a declaration</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">ucrt_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>If I stop here and try to use it, of course it won’t work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ld: undefined reference to `__imp_ucrt_malloc'
</code></pre></div></div>

<p>The linker hasn’t yet been informed of the change in management. For that
we’ll need an import library. I’ll define one using a <a href="https://sourceware.org/binutils/docs/binutils/def-file-format.html">.def file</a>,
which I’ll name <code class="language-plaintext highlighter-rouge">ucrtbase.def</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIBRARY ucrtbase.dll
EXPORTS
ucrt_malloc == malloc
</code></pre></div></div>

<p>The last line says that this library has the symbol <code class="language-plaintext highlighter-rouge">ucrt_malloc</code>, but
that it should be imported as <code class="language-plaintext highlighter-rouge">malloc</code>. This line is the lynchpin to the
whole scheme. Note: The double equals is important, as a single equals
sign means something different.  Next, use <code class="language-plaintext highlighter-rouge">dlltool</code> to build the import
library:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dlltool -d ucrtbase.def -l ucrtbase.lib
</code></pre></div></div>

<p>The equivalent MSVC tool is <a href="https://learn.microsoft.com/en-us/cpp/build/reference/overview-of-lib"><code class="language-plaintext highlighter-rouge">lib</code></a>, but as far as I know it cannot
quite do this sort of renaming. However, MSVC <code class="language-plaintext highlighter-rouge">link</code> will work just fine
with this <code class="language-plaintext highlighter-rouge">dlltool</code>-created import library. The name <code class="language-plaintext highlighter-rouge">ucrtbase.lib</code>, while
obvious, is irrelevant. It’s that <code class="language-plaintext highlighter-rouge">LIBRARY</code> line that ties it to the DLL.
My test source file looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">ucrt_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">msvcrt</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">)};</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">ucrt</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">)};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It compiles successfully:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -g3 -o main.exe main.c ucrtbase.lib
</code></pre></div></div>

<p>I can see the two <code class="language-plaintext highlighter-rouge">malloc</code> imports with <code class="language-plaintext highlighter-rouge">objdump</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p main.exe
...
DLL Name: msvcrt.dll
...
844a	 1021  malloc
...
DLL Name: ucrtbase.dll
847e	    1  malloc
</code></pre></div></div>

<p>It loads and runs successfully, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gdb main.exe
Reading symbols from main.exe...
(gdb) break 9
Breakpoint 1 at 0x1400013cd: file main.c, line 9.
(gdb) run
Thread 1 hit Breakpoint 1, main () at main.c:9
9           return 0;
(gdb) p msvcrt
$1 = {0xd06a30, 0xd06a70, 0xd06ab0}
(gdb) p ucrt
$2 = {0x6e9490, 0x6eb7c0, 0x6eb800}
</code></pre></div></div>

<p>The pointer addresses confirm that these are two, distinct allocators.
Perhaps you’re wondering what happens if I cross the streams?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The MSVCRT allocator justifiably panics over the bad pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">cc</span> <span class="o">-</span><span class="n">g3</span> <span class="o">-</span><span class="n">o</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span> <span class="n">chaos</span><span class="p">.</span><span class="n">c</span> <span class="n">ucrtbase</span><span class="p">.</span><span class="n">lib</span>
<span class="err">$</span> <span class="n">gdb</span> <span class="o">-</span><span class="n">ex</span> <span class="n">run</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span>
<span class="n">Starting</span> <span class="n">program</span><span class="o">:</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span>
<span class="n">warning</span><span class="o">:</span> <span class="n">HEAP</span><span class="p">[</span><span class="n">chaos</span><span class="p">.</span><span class="n">exe</span><span class="p">]</span><span class="o">:</span>
<span class="n">warning</span><span class="o">:</span> <span class="n">Invalid</span> <span class="n">address</span> <span class="n">specified</span> <span class="n">to</span> <span class="n">RtlFreeHeap</span>
<span class="n">Thread</span> <span class="mi">1</span> <span class="n">received</span> <span class="n">signal</span> <span class="n">SIGTRAP</span><span class="p">,</span> <span class="n">Trace</span><span class="o">/</span><span class="n">breakpoint</span> <span class="n">trap</span><span class="p">.</span>
<span class="mh">0x00007ffc42c369af</span> <span class="n">in</span> <span class="n">ntdll</span><span class="o">!</span><span class="n">RtlRegisterSecureMemoryCacheCallback</span> <span class="p">()</span>
<span class="p">(</span><span class="n">gdb</span><span class="p">)</span>
</code></pre></div></div>

<p>While you’re probably not supposed to meddle with <code class="language-plaintext highlighter-rouge">ucrtbase.dll</code> like
this, the general principle of export renames is reasonable. I don’t
expect I’ll ever need to do it, but I like that I have the option.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Everything you never wanted to know about Win32 environment blocks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/08/23/"/>
    <id>urn:uuid:3e73a0bb-fc27-4da2-9ae9-fab773a759d0</id>
    <updated>2023-08-23T21:51:10Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>In an effort to avoid <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ProgrammingViaSuperstition">programming by superstition</a>, I did a deep dive
into the Win32 “environment block,” the data structure holding a process’s
environment variables, in order to better understand it. Along the way I
discovered implied and undocumented behaviors. (The <em>environment block</em>
must not to be confused with the <a href="https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/pebteb/peb/index.htm">Process Environment Block</a> (PEB)
which is different.) Because I cannot possibly retain all the quirky
details in my head for long, I’m writing them down for future reference. I
ran my tests on different Windows versions as far back as Windows XP SP3
in order to fill in gaps where documentation is ambiguous, incomplete, or
wrong. Overall conclusion: Correct, direct manipulation of an environment
block is impossible <em>in the general case</em> due to under-specified and
incorrect documentation. This has important consequences mainly for
programming language runtimes.</p>

<p>Win32 has two interfaces for interacting with environment variables:</p>

<ol>
  <li><a href="https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getenvironmentvariable">GetEnvironmentVariable</a> and <a href="https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setenvironmentvariable">SetEnvironmentVariable</a></li>
  <li><a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getenvironmentstringsw">GetEnvironmentStrings</a> and <a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-freeenvironmentstringsw">FreeEnvironmentStrings</a></li>
</ol>

<p>The first, which I’ll call get/set, is the easy interface, with Windows
doing all the searching and sorting on your behalf. It’s also the only
supported interface through which a process can manipulate its own
variables. It has no function for enumerating variables.</p>

<p>The second, which I’ll call get/free, allocates a <em>copy of</em> the
environment block. Calls to get/set does not modify existing copies.
Similarly, manipulating this block has no effect on the environment as
viewed through get/set. In other words, it’s <em>read only</em>. We can enumerate
our environment variables by walking the environment block. As I will
discuss below, enumeration is it’s only consistently useful purpose!</p>

<p>Technically it’s possible to access the actual environment block through
undocumented fields in the PEB. It’s the same content as returned by
get/free except that it’s not a copy. It cannot be accessed safely, so I’m
ignoring this route.</p>

<p>The environment block format is a null-terminated block of null-terminated
strings:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>keyA=a\0keyBB=bb\0keyCCC=ccc\0\0
</code></pre></div></div>

<p>Each string <del>begins with a character other than <code class="language-plaintext highlighter-rouge">=</code> and</del> contains at
least one <code class="language-plaintext highlighter-rouge">=</code>. In my tests this rule was strictly enforced by Windows, and
I could not construct an environment block that broke this rule. This list
is usually, but not always, sorted. It may contain repeated variables, but
they’re always assigned the same value, which is also strictly enforced by
Windows.</p>

<p><del>The get/free interface has no “set” function, and a process cannot set
its own environment block to a custom buffer.</del> (Update: Stefan Kanthak
points out <a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-setenvironmentstringsw">SetEnvironmentStringsW</a>. I missed it because it was only
officially documented a few months before this article was written.) There
<em>is</em> one interface where a process gets to provide a raw environment
block: <a href="https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw">CreateProcess</a>. That is, a parent can construct one for its
children.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">wchar_t</span> <span class="n">env</span><span class="p">[]</span> <span class="o">=</span> <span class="s">L"HOME=C:</span><span class="se">\\</span><span class="s">Users</span><span class="se">\\</span><span class="s">me</span><span class="se">\0</span><span class="s">PATH=C:</span><span class="se">\\</span><span class="s">bin;C:</span><span class="se">\\</span><span class="s">Windows</span><span class="se">\0</span><span class="s">"</span><span class="p">;</span>
    <span class="n">CreateProcessW</span><span class="p">(</span><span class="s">L"example.exe"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">env</span><span class="p">,</span> <span class="p">...);</span>
</code></pre></div></div>

<p>Windows imposes some rules upon this environment block:</p>

<ul>
  <li>
    <p><del>If an element begins with <code class="language-plaintext highlighter-rouge">=</code> or does not contain <code class="language-plaintext highlighter-rouge">=</code>, CreateProcess
fails.</del></p>
  </li>
  <li>
    <p>Repeated variables are modified to match the first instance. If you’re
potentially overriding using a duplicate, put the override first.</p>
  </li>
  <li>
    <p>Some cases of bad formatting become memory access violations.</p>
  </li>
</ul>

<p>As usual for Win32, there are <a href="https://simonsapin.github.io/wtf-8/">no rules against ill-formed UTF-16</a>,
and I could <a href="/blog/2022/02/18/">always pass</a> such “UTF-16” through into the child
environment block. Keep that in mind even when using the get/set
interface.</p>

<p>The SetEnvironmentVariable documentation gives a maximum variable size:</p>

<blockquote>
  <p>The maximum size of a user-defined environment variable is 32,767
characters. There is no technical limitation on the size of the
environment block.</p>
</blockquote>

<p>At least on more recent versions of Windows, my experiments proved exactly
the opposite. There is no limit on a user-defined environment variables,
but environment blocks are limited to 2GiB, for both 32-bit and 64-bit
processes. I could even create such huge environments in <a href="https://learn.microsoft.com/en-us/cpp/build/reference/largeaddressaware-handle-large-addresses">large address
aware</a> 32-bit processes, though the interfaces are prone to error due
to allocations problems.</p>

<p>There’s one special case where CreateProcess is illogical, and it’s
certainly a case of confusion within its implementation. <strong>An environment
block is not allowed to be empty.</strong> An empty environment is represented as
a block containing one empty (zero length) element. That is, two null
terminators in a row. It’s the one case where an environment block may
contain an element without a <code class="language-plaintext highlighter-rouge">=</code>. The <em>logical</em> empty environment block
would be just one null terminator, to terminate the block itself, because
it contains no variables. You can safely pretend that’s the case when
parsing an environment block, as this special case is superfluous.</p>

<p>However, CreateProcess partially enforces this silly, unnecessary special
case! If an environment block begins with a null terminator, the next
character <em>must be in a mapped memory region</em> because it will read this
character. If it’s not mapped, the result is a memory access violation.
Its actual value doesn’t matter, and CreateProcess will treat it as though
it was another null terminator. Surely someone at Microsoft would have
noticed by now that this behavior makes no sense, but I guess it’s kept
for backwards compatibility?</p>

<p>The CreateProcess documentation says that “the system uses a sorted
environment” but this made no difference in my tests. The word “must”
appears in this sentence, but it’s unclear if it applies to sorting, or
even outside the special case being discussed. GetEnvironmentVariable
works fine on an unsorted environment block. SetEnvironmentVariable
maintains sorting, but given an unsorted block it goes somewhere in the
middle, probably wherever a bisection happens to land. Perhaps look-ups in
sorted blocks are faster, but environment blocks are so small — <del>a
maximum of 32K characters</del> (Update: only true for ANSI) — that, in
practice, it really does not matter.</p>

<p>Suppose you’re meticulous and want to sort your environment block before
spawning a process. How do you go about it? There’s the rub: The official
documentation is incomplete! The <a href="https://learn.microsoft.com/en-us/windows/win32/procthread/changing-environment-variables">Changing Environment Variables</a>
page says:</p>

<blockquote>
  <p>All strings in the environment block must be sorted alphabetically by
name. The sort is case-insensitive, Unicode order, without regard to
locale.</p>
</blockquote>

<p>What do they mean by “case-insensitive” sort? Does “Unicode order” mean
<a href="https://www.unicode.org/Public/15.0.0/ucd/CaseFolding.txt">case folding</a>? A reasonable guess, but no, that’s not how get/set
works. Besides, how does “Unicode order” apply to ill-formed UTF-16?
Worse, get/set sorting is certainly not “Unicode order” even outside of
case-insensitivity! For example, <code class="language-plaintext highlighter-rouge">U+1F31E</code> (SUN WITH FACE) sorts ahead of
<code class="language-plaintext highlighter-rouge">U+FF01</code> (FULLWIDTH EXCLAMATION MARK) because the former encodes in UTF-16
as <code class="language-plaintext highlighter-rouge">U+D83C U+DF1E</code>. Maybe it’s case-insensitive only in ASCII? Nope, π
(<code class="language-plaintext highlighter-rouge">U+03C0</code>) and Π (<code class="language-plaintext highlighter-rouge">U+03A0</code>) are considered identical. Windows uses some
kind of case-insensitive, but not case-<em>folded</em>, undocumented early 1990s
UCS-2 sorting logic for environment variables.</p>

<p><strong>Update</strong>: John Doty <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cc2a4c4d7-95cc-48a4-8047-c79b55eba261%40app.fastmail.com%3E">suspects</a> the <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlcompareunicodestring">RtlCompareUnicodeString</a>
function for sorting. It <a href="https://github.com/skeeto/scratch/blob/master/misc/envsort.c">lines up perfectly with get/set</a> for
all possible inputs.</p>

<p>Without better guidance, the only reliable way to “correctly” sort an
environment block is to build it with get/set, then retrieve the result
with get/free. The algorithm looks like:</p>

<ol>
  <li>Get a copy of the environment with GetEnvironmentStrings.</li>
  <li>Walk the environment and call SetEnvironmentVariable on each name with
a null pointer as the value. This clears out the environment.</li>
  <li>Call SetEnvironmentVariable for each variable in the new environment.</li>
  <li>Get a sorted copy of the new environment with GetEnvironmentStrings.</li>
</ol>

<p>Unfortunately that’s all global state, so you can only construct one new
environment block at a time.</p>

<p>If you know all your variable names ahead of time, then none of this is a
problem. Determine what Windows thinks the order should be, then use that
in your program when constructing the environment block. It’s the <em>general
case</em> where this is a challenge, such as a language runtime designed to
operate on arbitrary environment variables with behavior congruent to the
rest of the system.</p>

<p>There are similar issues with looking up variables in an environment
block. How does case-insensitivity work? Sorting is “without regard to
locale” but what about when comparing variable names? The documentation
doesn’t say. When enumerating variables using get/free, you might read
what get/set considers to be duplicates, though at least values will
always agree with get/set, i.e. they’re aliases of one variables. Windows
maintains that invariant in my tests. The above algorithm would also
delete these duplicates.</p>

<p>For example, if someone passed you a “dirty” environment with duplicates,
or that was unsorted, this would clean it up in a way that allows get/free
to be traversed in order without duplicates.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">GetEnvironmentStringsW</span><span class="p">();</span>

    <span class="c1">// Clear out the environment</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="o">*</span><span class="n">var</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">var</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">split</span> <span class="o">=</span> <span class="n">wcscspn</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="s">L"="</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="n">SetEnvironmentVariableW</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'='</span><span class="p">;</span>
        <span class="n">var</span> <span class="o">+=</span> <span class="n">len</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Restore the original variables</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="o">*</span><span class="n">var</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">var</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">split</span> <span class="o">=</span> <span class="n">wcscspn</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="s">L"="</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="n">SetEnvironmentVariableW</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="n">var</span><span class="o">+</span><span class="n">split</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="n">var</span> <span class="o">+=</span> <span class="n">len</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">FreeEnvironmentStringsW</span><span class="p">(</span><span class="n">env</span><span class="p">);</span>
</code></pre></div></div>

<p>On the second pass, SetEnvironmentVariableW will gobble up all the
duplicates.</p>

<p>As a final note, the CreateProcess page had said this <a href="https://web.archive.org/web/20180110151515/http://msdn.microsoft.com/en-us/library/ms682425(VS.85).aspx">up until February
2023</a> about the environment block parameter:</p>

<blockquote>
  <p>If this parameter is <code class="language-plaintext highlighter-rouge">NULL</code> and the environment block of the parent
process contains Unicode characters, you must also ensure that
<code class="language-plaintext highlighter-rouge">dwCreationFlags</code> includes <code class="language-plaintext highlighter-rouge">CREATE_UNICODE_ENVIRONMENT</code>.</p>
</blockquote>

<p>That seems to indicate it’s virtually always wrong to call CreateProcess
without that flag — that is, Windows will trash the child’s environment
unless this flag is passed — which is a bonkers default. Fortunately this
appears to be wrong, which is probably why the documentation was finally
corrected (after several decades). Omitting this flag was fine under all
my tests, and I was unable to produce surprising behavior on any system.</p>

<p>In summary:</p>

<ul>
  <li>Prefer get/set for all operations except enumeration</li>
  <li>Environment blocks are not necessarily sorted</li>
  <li>Repeat variables are forced to the value of the first instance</li>
  <li>Variables may contain ill-formed UTF-16</li>
  <li>Empty environment blocks have a superfluous special case</li>
  <li><del>Entries cannot begin with <code class="language-plaintext highlighter-rouge">=</code></del></li>
  <li>Entries must contain at least one <code class="language-plaintext highlighter-rouge">=</code></li>
  <li>Sort order is ambiguous, so you cannot reliably do it yourself</li>
  <li>Case-insensitivity of names is ambiguous, so rely on get/set</li>
  <li><code class="language-plaintext highlighter-rouge">CREATE_UNICODE_ENVIRONMENT</code> necessary only for non-null environment</li>
</ul>

<p><strong>Update September 2024</strong>: Correction from Kasper Brandt <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C098b0421-af0e-46fb-8921-2a4e76f5a361@app.fastmail.com%3E">regarding
variables beginning with <code class="language-plaintext highlighter-rouge">=</code></a>. I misunderstood how it was parsed and
came to the wrong conclusion.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Hand-written Windows API prototypes: fast, flexible, and tedious</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/05/31/"/>
    <id>urn:uuid:35b44114-7ad2-422b-9eaf-dc37e7eaaf97</id>
    <updated>2023-05-31T01:38:31Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I love fast builds, and for years I’ve been bothered by the build penalty
for translation units including <code class="language-plaintext highlighter-rouge">windows.h</code>. This header has an enormous
number of definitions and declarations and so, for C programs, it tends to
dominate the build time of those translation units. Most programs,
especially systems software, only needs a tiny portion of it. For example,
when compiling <a href="/blog/2023/01/18/">u-config</a> with GCC, two thirds of the debug build was
spent processing <code class="language-plaintext highlighter-rouge">windows.h</code> just for <a href="https://github.com/skeeto/u-config/blob/e6ebb9b/miniwin32.h">4 types, 16 definitions, and 16
prototypes</a>.</p>

<p>To give a sense of the numbers, here’s <code class="language-plaintext highlighter-rouge">empty.c</code>, which does nothing but
include <code class="language-plaintext highlighter-rouge">windows.h</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E empty.c | grep -vc '^$'
82041
</code></pre></div></div>

<p>With <a href="https://github.com/skeeto/w64devkit">w64devkit</a> this takes my system ~450ms to compile with GCC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Compiling an actually empty source file takes ~10ms, so it really is
spending practically all that time processing headers. MSVC is a faster
compiler, and this extends to processing an even larger <code class="language-plaintext highlighter-rouge">windows.h</code> that
crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s
</code></pre></div></div>

<p>That’s just low enough to be tolerable, but I’d like the situation with
GCC to be better. Defining <code class="language-plaintext highlighter-rouge">WIN32_LEAN_AND_MEAN</code> reduces the number of
included headers, which has a significant effect:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s
</code></pre></div></div>

<h3 id="precompiled-headers">Precompiled headers</h3>

<p>The official solution is precompiled headers. Put all the system header
includes, <a href="/blog/2023/01/08/">or similar</a>, into a dedicated header, then compile that
header into a special format. For example, <code class="language-plaintext highlighter-rouge">headers.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">main.c</code> includes <code class="language-plaintext highlighter-rouge">windows.h</code> through this header:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"headers.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I ask <a href="https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html">GCC to compile <code class="language-plaintext highlighter-rouge">headers.h</code></a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc headers.h
</code></pre></div></div>

<p>It produces <code class="language-plaintext highlighter-rouge">headers.h.gch</code>. When a source includes <code class="language-plaintext highlighter-rouge">headers.h</code>, GCC first
searches for an appropriate <code class="language-plaintext highlighter-rouge">.gch</code>. Not only must the name match, but so
must all the definitions at the moment of inclusion: <code class="language-plaintext highlighter-rouge">headers.h</code> should
always be the first included header, otherwise it may not work. Now when I
compile <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Much better! MSVC has a conventional name for this header recognizable to
every Visual Studio user: <code class="language-plaintext highlighter-rouge">stdafx.h</code>. It works a bit differently, and I’ve
never used it myself, but I trust it has similar results.</p>

<p>Precompiled headers requires some extra steps that vary by toolchain. Can
we do better? That depends on your definition of “better!”</p>

<h3 id="artisan-handcrafted-prototypes">Artisan, handcrafted prototypes</h3>

<p>As mentioned, systems software tends to need only a few declarations:
open, read, write, stat, etc. What if I wrote these out manually? A bit
tedious, but it doesn’t require special precompiled header handling. It
also creates some new possibilities. To illustrate, a <a href="/blog/2023/02/15/">CRT-free</a>
“hello world” program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This takes my system half a second to compile — quite long to produce just
26 assembly instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!
</code></pre></div></div>

<p>The program requires prototypes only for GetStdHandle and WriteFile, a
definition for <code class="language-plaintext highlighter-rouge">STD_OUTPUT_HANDLE</code>, and some typedefs. Starting with the
easy stuff, the definition and <a href="https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types">types look like this</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define STD_OUTPUT_HANDLE ((DWORD)-11)
</span>
<span class="k">typedef</span> <span class="kt">int</span> <span class="n">BOOL</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">HANDLE</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">DWORD</span><span class="p">;</span>
</code></pre></div></div>

<p>By the way, here’s a cheat code for quickly finding preprocessor
definitions, faster than looking them up elsewhere:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo '#include &lt;windows.h&gt;' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)
</code></pre></div></div>

<p>Did you catch the pattern? It’s <code class="language-plaintext highlighter-rouge">-10 - fd</code>, where <code class="language-plaintext highlighter-rouge">fd</code> is the conventional
unix file descriptor number: a kind of mnemonic.</p>

<p>Prototypes are a little trickier, especially if you care about 32-bit. The
Windows API uses the “stdcall” calling convention, which is distinct from
the “cdecl” calling convention on x86, though the same on x64. Of course,
you must already be aware of this merely using the API, as your own
callbacks must usually be stdcall themselves. Further, API functions are
<a href="/blog/2021/05/31/">DLL imports</a> and should be declared as such. Putting it together,
here’s GetStdHandle:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="n">HANDLE</span> <span class="kr">__stdcall</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">);</span>
</code></pre></div></div>

<p>This works with both Mingw-w64 and MSVC. MSVC requires <code class="language-plaintext highlighter-rouge">__stdcall</code> between
the return type and function name, so don’t get clever about it. If you
only care about GCC then you can declare both at once using attributes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">)</span>
    <span class="n">__attribute__</span><span class="p">((</span><span class="n">dllimport</span><span class="p">,</span><span class="n">stdcall</span><span class="p">));</span>
</code></pre></div></div>

<p>I like to hide all this behind a macro, with a “table” of all my imports
listed just below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r) __declspec(dllimport) r __stdcall
</span><span class="n">W32</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">)</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">BOOL</span><span class="p">)</span>   <span class="n">WriteFile</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">DWORD</span><span class="p">,</span> <span class="n">DWORD</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>In WriteFile you may have noticed I’m taking shortcuts. The “official”
definition uses an ugly pointer typedef, <code class="language-plaintext highlighter-rouge">LPCVOID</code>, instead of pointer
syntax, but I skipped that type definition. I also replaced the last
argument, an <code class="language-plaintext highlighter-rouge">OVERLAPPED</code> pointer, with a generic pointer. I only need to
pass null. I can keep sanding it down to something more ergonomic:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">W32</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span>    <span class="n">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s how I typically write these prototypes. I dropped the <code class="language-plaintext highlighter-rouge">const</code>
because it doesn’t help me. I used signed sizes because I like them better
and it’s <a href="/blog/2023/02/13/">what I’m usually holding</a> at the call site. But doesn’t
changing the signedness potentially break compatibility? It makes no
difference to any practical ABI: It’s passed the same way. In general,
signedness is a matter for <em>operators</em>, and only some of them — mainly
comparisons (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, etc.) and division. It’s a similar story for
pointers starting with the 32-bit era, so I can choose whatever pointer
types are convenient.</p>

<p>In general, I can do anything I want so long as I know my compiler will
produce an appropriate function call. These are not standard functions,
like <code class="language-plaintext highlighter-rouge">printf</code> or <code class="language-plaintext highlighter-rouge">memcpy</code>, which are implemented in part by the compiler
itself, but foreign functions. It’s no different than teaching <a href="/blog/2018/05/27/">an
FFI</a> how to make a call. This is also, in essence, how OpenGL and
Vulkan work, with applications <a href="https://www.khronos.org/opengl/wiki/OpenGL_Loading_Library">defining the API for themselves</a>.</p>

<p>Considering all this, my new hello world:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r) __declspec(dllimport) r __stdcall
</span><span class="n">W32</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span>    <span class="n">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You know, there’s a kind of beauty to a program that requires no external
definitions. It builds quickly and produces a binary bit-for-bit identical
to the original:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>I’ve also been using this to patch over API rough edges. For example,
<a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsarecvfrom">WSARecvFrom</a> takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-wsaoverlapped">WSAOVERLAPPED</a>, but <a href="https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-getqueuedcompletionstatus">GetQueuedCompletionStatus</a>
takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped">OVERLAPPED</a>. These types are explicitly compatible, and only
defined separately for annoying technical reasons. I must use the same
overlapped object with both APIs at once, meaning I would normally need
ugly pointer casts on my Winsock calls, or vice versa with I/O completion
ports. But because I’m writing all these definitions myself, I can define
a common overlapped structure for both!</p>

<p>Perhaps you’re worried that this would be too fragile. Well, as a legacy
software aficionado, I enjoy <a href="/blog/2018/04/13/">building and running my programs on old
platforms</a>. So far these programs still work properly <a href="https://winworldpc.com/library/">going back
30 years</a> to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag,
it’s always been a bug (now long fixed) in the old operating system, not
in my programs or these prototypes. So, in effect, this technique has
worked well for the past 30 years!</p>

<p>Writing out these definitions is a bit of a chore, but after paying that
price I’ve been quite happy with the results. I will likely continue doing
it in the future, at least for non-graphical applications.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>CRT-free in 2023: tips and tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/15/"/>
    <id>urn:uuid:025441bf-084e-4c3e-9a37-269e2ac1a4d6</id>
    <updated>2023-02-15T02:12:00Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Seven years ago I wrote about <a href="/blog/2016/01/31/">“freestanding” Windows executables</a>.
After an additional seven years of practical experience both writing and
distributing such programs, half using <a href="https://github.com/skeeto/w64devkit">a custom-built toolchain</a>,
it’s time to revisit these cabalistic incantations and otherwise scant
details. I’ve tweaked my older article over the years as I’ve learned, but
this is a full replacement and does not assumes you’ve read it. The <a href="/blog/2023/02/11/">“why”
has been covered</a> and the focus will be on the “how”. Both the GNU
and MSVC toolchains will be considered.</p>

<p>I no longer call these “freestanding” programs since that term is, at
best, <a href="https://github.com/ipxe/ipxe/commit/e8393c372">inaccurate</a>. In fact, we will be actively avoiding GCC
features associated with that label. Instead I call these <em>CRT-free</em>
programs, where CRT stands for the <em>C runtime</em> the Windows-oriented term
for <em>libc</em>. This term communicates both intent and scope.</p>

<h3 id="entry-point">Entry point</h3>

<p>You should already know that <code class="language-plaintext highlighter-rouge">main</code> is not the program’s entry point, but
a C application’s entry point. The CRT provides the entry point, where it
initializes the CRT, including <a href="/blog/2022/02/18/">parsing command line options</a>, then
calls the application’s <code class="language-plaintext highlighter-rouge">main</code>. The real entry point doesn’t have a name.
It’s just the address of the function to be called by the loader without
arguments.</p>

<p>You might naively assume you could continue using the name <code class="language-plaintext highlighter-rouge">main</code> and tell
the linker to use it as the entry point. You would be wrong. <strong>Avoid the
name <code class="language-plaintext highlighter-rouge">main</code>!</strong> It has a special meaning in C gets special treatment. Using
it without a conventional CRT will confuse your tools an may cause build
issues.</p>

<p>While you can use almost any other name you like, the conventional names
are <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> (console subsystem) and <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code> (windows
subsystem). It’s easy to remember: Append <code class="language-plaintext highlighter-rouge">CRTStartup</code> to the name you’d
use in a normal CRT-linking application. I strongly recommend using these
names because it reduces friction. Your tools are already familiar with
them, so you won’t need to do anything special.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>     <span class="c1">// console subsystem</span>
<span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>  <span class="c1">// windows subsystem</span>
</code></pre></div></div>

<p>The MSVC linker documentation says the entry point uses the <code class="language-plaintext highlighter-rouge">__stdcall</code>
calling convention. <del>Ignore this and <strong>do not use <code class="language-plaintext highlighter-rouge">__stdcall</code> for your
entry point!</strong></del> Since entry points may take no arguments, there is no
practical difference from the <code class="language-plaintext highlighter-rouge">__cdecl</code> calling convention, so it matters
little. <del>Rather, the goal is to avoid <code class="language-plaintext highlighter-rouge">__stdcall</code> <em>function decorations</em>.
In particular, the GNU linker <code class="language-plaintext highlighter-rouge">--entry</code> option does not understand them,
nor can it find decorated entry points on its own. If you use <code class="language-plaintext highlighter-rouge">__stdcall</code>,
then the 32-bit GNU linker will silently (!) choose the beginning of your
<code class="language-plaintext highlighter-rouge">.text</code> section as the entry point.</del> (This bug was fixed in Binutils
2.42, released January 2024. <code class="language-plaintext highlighter-rouge">__stdcall</code> entry points now link correctly.)</p>

<p>If you’re using C++, then of course you will also need to use <code class="language-plaintext highlighter-rouge">extern "C"</code>
so that it’s not name-mangled. Otherwise the results are similarly bad.</p>

<p>If using <code class="language-plaintext highlighter-rouge">-fwhole-program</code>, you will need to mark your entry point as
externally visible for GCC so that it knows its an entry point. While
linkers are familiar with conventional entry point names, GCC the
<em>compiler</em> is not. Normally you do not need to worry about this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>  <span class="c1">// for -fwhole-program</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The entry point returns <code class="language-plaintext highlighter-rouge">int</code>. <em>If there are no other threads</em> then the
process will exit with the returned value as its exit status. In practice
this is only useful for console programs. Windows subsystem programs have
threads started automatically, without warning, and it’s almost certain
your main thread is not the last thread. You probably want to use
<code class="language-plaintext highlighter-rouge">ExitProcess</code> or even <code class="language-plaintext highlighter-rouge">TerminateProcess</code> instead of returning. The latter
exits more abruptly and can avoid issues with certain subsystems, like
DirectSound, not shutting down gracefully: It doesn’t even let them try.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">TerminateProcess</span><span class="p">(</span><span class="n">GetCurrentProcess</span><span class="p">(),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="compilation">Compilation</h3>

<p>Starting with the GNU toolchain, you have two ways to get into “CRT-free
mode”: <code class="language-plaintext highlighter-rouge">-nostartfiles</code> and <code class="language-plaintext highlighter-rouge">-nostdlib</code>. The former is more dummy-proof,
and it’s what I use in build documentation. The latter can be a more
complicated, but when it succeeds you get guarantees about the result. I
use it in build scripts I intend to run myself, which I want to fail if
they don’t do exactly what I expect. To illustrate, consider this trivial
program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ExitProcess</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This program uses <code class="language-plaintext highlighter-rouge">ExitProcess</code> from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>. Compiling is easy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-nostartfiles</code> prevents it from linking the CRT entry point, but it
still implicitly passes other “standard” linker flags, including libraries
<code class="language-plaintext highlighter-rouge">-lmingw32</code> and <code class="language-plaintext highlighter-rouge">-lkernel32</code>. Programs can use <code class="language-plaintext highlighter-rouge">kernel32.dll</code> functions
without explicitly linking that DLL. But, hey, isn’t <code class="language-plaintext highlighter-rouge">-lmingw32</code> the CRT,
the thing we’re avoiding? It is, but it wasn’t actually linked because the
program didn’t reference it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll
</code></pre></div></div>

<p>However, <code class="language-plaintext highlighter-rouge">-nostdlib</code> does not pass any of these libraries, so you need to
do so explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lkernel32
</code></pre></div></div>

<p>The MSVC toolchain behaves a little like <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, not linking a
CRT unless you need it, semi-automatically. However, you’ll need to list
<code class="language-plaintext highlighter-rouge">kernel32.dll</code> and tell it which subsystem you’re using.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console kernel32.lib
</code></pre></div></div>

<p>However, MSVC has a handy little feature to list these arguments in the
source file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _MSC_VER
</span>  <span class="cp">#pragma comment(linker, "/subsystem:console")
</span>  <span class="cp">#pragma comment(lib, "kernel32.lib")
#endif
</span></code></pre></div></div>

<p>This information must go somewhere, and I prefer the source file rather
than a build script. Then anyone can point MSVC at the source without
worrying about options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c
</code></pre></div></div>

<p>I try to make all my Windows programs so simply built.</p>

<h3 id="stack-probes">Stack probes</h3>

<p>On Windows, it’s expected that stacks will commit dynamically. That is,
the stack is merely <em>reserved</em> address space, and it’s only committed when
the stack actually grows into it. This made sense 30 years ago as a memory
saving technique, but today it no longer makes sense. However, programs
are still built to use this mechanism.</p>

<p>To function properly, programs must touch each stack page for the first
time in order. Normally that’s not an issue, but if your stack frame
exceeds the page size, there’s a chance it might step over a page. When a
function has a large stack frame, GCC inserts a call to a “stack probe” in
<code class="language-plaintext highlighter-rouge">libgcc</code> that touches its pages in the prologue. It’s not unlike <a href="/blog/2017/06/21/">stack
clash protection</a>.</p>

<p>For example, if I have a 4kiB local variable:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">12</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When I compile with <code class="language-plaintext highlighter-rouge">-nostdlib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'
</code></pre></div></div>

<p>It’s trying to link the CRT stack probe. You can disable this behavior
with <code class="language-plaintext highlighter-rouge">-mno-stack-arg-probe</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -nostdlib example.c
</code></pre></div></div>

<p>Or you can just link <code class="language-plaintext highlighter-rouge">-lgcc</code> to provide a definition:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lgcc
</code></pre></div></div>

<p>Had you used <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, you wouldn’t have noticed because it passes
<code class="language-plaintext highlighter-rouge">-lgcc</code> automatically. It’s “dummy-proof” because this sort of issue goes
away before it comes up, though for the same reason it’s harder to tell
exactly what went into a program.</p>

<p>If you disable the probe altogether — my preference — you’ve only solved
the linker problem, but the underlying stack commit problem remains and
your program may crash. You can solve that by telling the linker to ask
the loader to commit a larger stack up front rather than grow it at run
time. Say, 2MiB:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c
</code></pre></div></div>

<p>Of course, I wish that this was simply the default behavior because it’s
far more sensible! A much better option is to avoid large stack frames in
the first place. Allocate locals larger than, say, 1KiB in a scratch arena
instead of on the stack.</p>

<p>MSVC doesn’t have <code class="language-plaintext highlighter-rouge">libgcc</code> of course, but it still generates stack probes
both for growing the stack and for security checks. The latter requires
<code class="language-plaintext highlighter-rouge">kernel32.dll</code>, so if I compile the same program with MSVC, I get a bunch
of linker failures:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...
</code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">/Gs1000000000</code> turns off the stack probes, <code class="language-plaintext highlighter-rouge">/GS-</code> turns off the
checks, <code class="language-plaintext highlighter-rouge">/stack</code> commits a larger stack:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000
</code></pre></div></div>

<p>Though, as before, better to avoid large stack frames in the first place.</p>

<h3 id="built-in-functions-ugh">Built-in functions… ugh</h3>

<p>The three major C and C++ compilers — GCC, MSVC, Clang — share a common,
evil weakness: “built-in” functions. <em>No matter what</em>, they each assume
you will supply definitions for standard string functions at link time,
particularly <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memcpy</code>. They do this no matter how many
“seriously now, do not use standard C functions” options you pass. When
you don’t link a CRT, you may need to define them yourself.</p>

<p>With GCC there’s a catch: it will transform your <code class="language-plaintext highlighter-rouge">memset</code> definition —
that is, <em>in a function named <code class="language-plaintext highlighter-rouge">memset</code></em> — into a call to itself. After
all, it looks an awful lot like <code class="language-plaintext highlighter-rouge">memset</code>! This typically manifests as an
infinite loop. <strong>Use <code class="language-plaintext highlighter-rouge">-fno-builtin</code> to prevent GCC from mis-compiling
built-in functions.</strong></p>

<p>Even with <code class="language-plaintext highlighter-rouge">-fno-builtin</code>, both GCC and Clang will continue inserting calls
to built-in functions elsewhere. For example, making an especially large
local variable (and using <code class="language-plaintext highlighter-rouge">volatile</code> to prevent it from being optimized
out):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the latest GCC and Clang will generate a <code class="language-plaintext highlighter-rouge">memset</code> call
despite <code class="language-plaintext highlighter-rouge">-fno-builtin</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...
</code></pre></div></div>

<p>To be absolutely pure, you will need to address this in just about any
non-trivial program. On the other hand, <code class="language-plaintext highlighter-rouge">-nostartfiles</code> will grab a
definition from <code class="language-plaintext highlighter-rouge">msvcrt.dll</code> for you:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll
</code></pre></div></div>

<p>To be clear, <em>this is a completely legitimate and pragmatic route!</em> You
get the benefits of both worlds: the CRT is still out of the way, but
there’s also no hassle from misbehaving compilers. If this sounds like a
good deal, then do it! (For on-lookers feeling smug: there is no such
easy, general solution for this problem on Linux.)</p>

<p>When you write your own definitions, I suggest putting each definition in
its own section so that they can be discarded via <code class="language-plaintext highlighter-rouge">-Wl,--gc-sections</code> when
unused:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">section</span><span class="p">(</span><span class="s">".text.memset"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far, for all three compilers, I’ve only needed to provide definitions
for <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memcpy</code>.</p>

<h3 id="stack-alignment-on-32-bit-x86">Stack alignment on 32-bit x86</h3>

<p>GCC expects a 16-byte aligned stack and generates code accordingly. Such
is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However,
the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal
with it, there will likely be unaligned loads. Some may not be valid (e.g.
SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a
function attribute for this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>GCC will now align the stack in this function’s prologue. Adjustment is
only necessary at entry points, as GCC will maintain alignment through its
own frames. This includes <em>all</em> entry points, not just the program entry
point, particularly thread start functions. Rule of thumb for i686 GCC:
<strong>If <code class="language-plaintext highlighter-rouge">WINAPI</code> or <code class="language-plaintext highlighter-rouge">__stdcall</code> appears in a definition, the stack likely
requires alignment</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="n">DWORD</span> <span class="n">WINAPI</span> <span class="nf">mythread</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s harmless to use this attribute on x64. The prologue will just be a
smidge larger. If you’re worried about it, use <code class="language-plaintext highlighter-rouge">#ifdef __i686__</code> to limit
it to 32-bit builds.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>If I’ve written a graphical application with <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>, used
large stack frames, marked my entry point as externally visible, plan to
support 32-bit builds, and defined a couple of needed string functions, my
optimal entry point may look something like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef __GNUC__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>
<span class="cp">#endif
#ifdef __i686__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="cp">#endif
</span><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then my “optimize all the things” release build may look something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32
</code></pre></div></div>

<p>Or with MSVC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows
</code></pre></div></div>

<p>Or if I’m taking it easy maybe just:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c
</code></pre></div></div>

<p>Or with MSVC (linker flags in source):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 app.c
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>SDL2 common mistakes and how to avoid them</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/01/08/"/>
    <id>urn:uuid:5b345c81-80d1-4459-981f-b5826a2bb8e7</id>
    <updated>2023-01-08T02:09:26Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://old.reddit.com/r/C_Programming/comments/106djd0/sdl2_common_mistakes_and_how_to_avoid_them/">on reddit</a>.</em></p>

<p><a href="https://www.libsdl.org/">SDL</a> has grown on me over the past year. I didn’t understand its value
until viewing it in the right lens: as a complete platform and runtime
replacing the host’s runtime, possibly including libc. Ideally an SDL
application links exclusively against SDL and otherwise not directly
against host libraries, though in practice it’s somewhat porous. With care
— particularly in avoiding mistakes covered in this article — that ideal
is quite achievable for C applications that fit within SDL’s feature set.</p>

<!--more-->

<p>SDL applications are always interesting one way or another, so I like to
dig in when I come across them. The items in this article are mistakes
I’ve either made myself or observed across many such passion projects in
the wild.</p>

<h3 id="mistake-1-not-using-sdl2-config">Mistake 1: Not using <code class="language-plaintext highlighter-rouge">sdl2-config</code></h3>

<p>This shell script comes with SDL2 and smooths over differences between
platforms, even when cross compiling. It informs your compiler where to
find and how to link SDL2. The script even works on Windows if you have a
unix shell, such as via <a href="https://github.com/skeeto/w64devkit">w64devkit</a>. Use it as a command substitution at
the end of the build command, particularly when using <code class="language-plaintext highlighter-rouge">--libs</code>. A one-shot
or <a href="https://en.wikipedia.org/wiki/Unity_build">unity build</a> (my preference) looks like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc app.c $(sdl2-config --cflags --libs)
</code></pre></div></div>

<p>Or under separate compilation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -c app.c $(sdl2-config --cflags)
$ cc app.o $(sdl2-config --libs)
</code></pre></div></div>

<p>Alternatively, static link by replacing <code class="language-plaintext highlighter-rouge">--libs</code> with <code class="language-plaintext highlighter-rouge">--static-libs</code>,
though this is discouraged by the SDL project. When dynamically linked,
users can, and do, trivially substitute a different SDL2 binary, such as
one patched for their system. In my experience, static linking works
reliably on Windows but poorly on Linux.</p>

<p>Alternatively, use the general purpose <code class="language-plaintext highlighter-rouge">pkg-config</code>. Don’t forget <code class="language-plaintext highlighter-rouge">eval</code>!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ eval cc app.c $(pkg-config sdl2 --cflags --libs)
</code></pre></div></div>

<p>I wrote <a href="/blog/2023/01/18/">a pkg-config for Windows</a> specifically for this case.</p>

<p>Caveats:</p>

<ul>
  <li>
    <p>Some circumstances require special treatment, and <code class="language-plaintext highlighter-rouge">sdl2-config</code> may be
too blunt a tool. That’s fine, but generally prefer <code class="language-plaintext highlighter-rouge">sdl2-config</code> as the
default approach.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">sdl2-config</code> does not support extensions such as <code class="language-plaintext highlighter-rouge">SDL2_image</code>, so you
will need to use <code class="language-plaintext highlighter-rouge">pkg-config</code>. Personally I don’t think they’re worth
the trouble when there’s <a href="https://github.com/nothings/stb">stb</a>, or <a href="/blog/2022/12/18/">QOI instead of PNG</a>.</p>
  </li>
  <li>
    <p>There’s an alternative build option using CMake, without any use of
<code class="language-plaintext highlighter-rouge">sdl2-config</code>, but I won’t discuss it here.</p>
  </li>
</ul>

<h3 id="mistake-2-including-sdl2sdlh">Mistake 2: Including <code class="language-plaintext highlighter-rouge">SDL2/SDL.h</code></h3>

<p>A lot of examples, including tutorials linked from the official SDL
website, have <code class="language-plaintext highlighter-rouge">SDL2/</code> in their include paths. That’s because they’re
making mistake 1, not using <code class="language-plaintext highlighter-rouge">sdl2-config</code>, and are instead relying on
Linux distributions having installed SDL2 in a place <em>coincidentally</em>
accessible through that include path.</p>

<p>This is annoying when SDL2 <em>not</em> installed there, or if I don’t want it
using the system’s SDL2. Worse, it can result in subtly broken builds as
it mixes and matches different SDL installations. The correct SDL2 include
is the following:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span></code></pre></div></div>

<p>Note the quotes, which helps prevent picking up an arbitrary system header
by accident. When carefully and narrowly targeting SDL-the-platform, this
will be the only “system” include anywhere in your application.</p>

<h3 id="mistake-3-not-surrendering-main">Mistake 3: Not surrendering <code class="language-plaintext highlighter-rouge">main</code></h3>

<p>A conventional SDL application has a <code class="language-plaintext highlighter-rouge">main</code> function defined in its
source, but despite the name, this is distinct from C <code class="language-plaintext highlighter-rouge">main</code>. To smooth
over <a href="/blog/2022/02/18/">platform differences</a>, SDL may rename the application’s <code class="language-plaintext highlighter-rouge">main</code>
to <code class="language-plaintext highlighter-rouge">SDL_main</code> and substitute its own C <code class="language-plaintext highlighter-rouge">main</code>. Because of this, <code class="language-plaintext highlighter-rouge">main</code>
must have the conventional <code class="language-plaintext highlighter-rouge">argc</code>/<code class="language-plaintext highlighter-rouge">argv</code> prototype and must return a
value. (As a special case, C permits <code class="language-plaintext highlighter-rouge">main</code> to implicitly <code class="language-plaintext highlighter-rouge">return 0</code>, so
it’s an easy mistake to make.)</p>

<p>With this in mind, the bare minimum SDL2 application:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Caveat: Like with <code class="language-plaintext highlighter-rouge">sdl2-config</code>, some special circumstances require
control over the application entry point — see <code class="language-plaintext highlighter-rouge">SDL_MAIN_HANDLED</code> and
<code class="language-plaintext highlighter-rouge">SDL_SetMainReady</code> — but that should be reserved until there’s a need.</p>

<p>One such special case is avoiding linking a CRT on Windows. In principle
it’s this simple:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SDL_SetMainReady</span><span class="p">();</span>
    <span class="c1">// ...</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then it’s <a href="/blog/2016/01/31/">the usual compiler and linker flags</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib -o app.exe app.c $(sdl2-config --cflags --libs)
</code></pre></div></div>

<p>This will create a tiny <code class="language-plaintext highlighter-rouge">.exe</code> that doesn’t link any system DLL, just
<code class="language-plaintext highlighter-rouge">SDL2.dll</code>. Quite platform agnostic indeed!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p app.exe | grep -Fi .dll
        DLL Name: SDL2.dll
</code></pre></div></div>

<p>Alas, as of this writing, this does not work reliably. SDL2’s accelerated
renderers on Windows do not clean up properly in <code class="language-plaintext highlighter-rouge">SDL_QuitSubSystem</code> nor
<code class="language-plaintext highlighter-rouge">SDL_Quit</code>, so the process cannot exit without calling ExitProcess in
<code class="language-plaintext highlighter-rouge">kernel32.dll</code> (or similar). This is still an open experiment.</p>

<h3 id="mistake-4-using-the-sdl-wiki-for-api-documentation">Mistake 4: Using the SDL wiki for API documentation</h3>

<p>The <a href="https://wiki.libsdl.org/SDL2/FrontPage">SDL wiki</a> is not authoritative documentation, merely a <em>convenient</em>
web-linkable — and downloadable (see “offline html”) — information source.
However, anyone who’s spent time on it can tell you it’s incomplete. The
authoritative API documentation is <em>the SDL headers</em>, which fortunately
are already on hand for building SDL applications. The SDL maintainers
<a href="https://www.youtube.com/playlist?list=PL6m6sxLnXksbqdsAcpTh4znV9j70WkmqG">themselves use the headers, not the wiki</a>.</p>

<p>If, like me, you’re using <a href="https://github.com/universal-ctags/ctags">ctags</a>, this is actually good news! With a
bit of configuration, you can jump to any bit of SDL documentation at any
time in your editor, treating the SDL headers like a hyperlinked wiki
built into your editor. Just like building, <code class="language-plaintext highlighter-rouge">sdl2-config</code> can tell ctags
where find those headers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ctags -a -R --kinds-c=dept $(sdl2-config --prefix)/include/SDL2
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">-a</code> (<code class="language-plaintext highlighter-rouge">--append</code>) to append to the tags file I’ve already
generated for my own program, <code class="language-plaintext highlighter-rouge">-R</code> (<code class="language-plaintext highlighter-rouge">--recurse</code>) to automatically find all
the headers, and <code class="language-plaintext highlighter-rouge">--kinds-c=dept</code> capture exactly the kinds of symbols I
care about — <code class="language-plaintext highlighter-rouge">#define</code>, <code class="language-plaintext highlighter-rouge">enum</code>, prototypes, <code class="language-plaintext highlighter-rouge">typedef</code> — no more no less.</p>

<p>In Vim I <code class="language-plaintext highlighter-rouge">CTRL-]</code> over any SDL symbol to jump to its documentation, and
then I can use it again within its documentation comment to jump further
still to any symbols it mentions, then finally use the jump or tag stack
to return. As long as I have <code class="language-plaintext highlighter-rouge">t</code> in <a href="https://vimdoc.sourceforge.net/htmldoc/options.html#'cpt'"><code class="language-plaintext highlighter-rouge">'complete'</code></a> (<code class="language-plaintext highlighter-rouge">'cpt'</code>), which
is the default, I can also “tab”-complete any SDL symbol using the tags
table. There are a few rough edges here and there, but overall it’s a
solid editing paradigm.</p>

<p>By the way, with <code class="language-plaintext highlighter-rouge">sdl2-config</code> in your <code class="language-plaintext highlighter-rouge">$PATH</code>, all the above works out of
the box in w64devkit! That’s where I’ve mostly been working with SDL.</p>

<h3 id="mistake-5-using-stdio-streams">Mistake 5: Using stdio streams</h3>

<p>A common bit of code in real SDL programs and virtually every tutorial:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">SDL_Init</span><span class="p">(...))</span> <span class="p">{</span>
    <span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"SDL_Init(): %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is not ideal:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">fprintf</code> is not part of the SDL platform. This is going behind SDL’s
back, reaching around the abstraction to a different platform. Strictly
speaking, this API may not even be available to an SDL application.</p>
  </li>
  <li>
    <p>SDL applications are graphical, so <code class="language-plaintext highlighter-rouge">stderr</code> is likely disconnected from
anything useful. Few would ever see this message.</p>
  </li>
</ul>

<p>Fortunately SDL provides two alternatives:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">SDL_Log</code>: like C <code class="language-plaintext highlighter-rouge">printf</code>, but SDL will strive to connect it to
somewhere useful. If the application was launched from a terminal or
console, SDL will find it and hook it up to the logger. On Windows, if
there’s a debugger attached, SDL will use <a href="https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw">OutputDebugString</a> to
send logs to the debugger.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">SDL_ShowSimpleMessageBox</code>: using any means possible, attempt to display
a message to the user. Like <code class="language-plaintext highlighter-rouge">SDL_Log</code>, it’s safe to use before/without
initializing SDL subsystems.</p>
  </li>
</ul>

<p>If you’re paranoid, you could even use both:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">SDL_Init</span><span class="p">(...))</span> <span class="p">{</span>
    <span class="n">SDL_ShowSimpleMessageBox</span><span class="p">(</span>
        <span class="n">SDL_MESSAGEBOX_ERROR</span><span class="p">,</span> <span class="s">"SDL_Init()"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">(),</span> <span class="mi">0</span>
    <span class="p">);</span>
    <span class="n">SDL_Log</span><span class="p">(</span><span class="s">"SDL_Init(): %s"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Though note that <code class="language-plaintext highlighter-rouge">SDL_ShowSimpleMessageBox</code> can fail, which will set a
new, different error message for <code class="language-plaintext highlighter-rouge">SDL_Log</code>!</p>

<p>There’s a similar story again with <code class="language-plaintext highlighter-rouge">fopen</code> and loading assets. SDL has an
I/O API, <code class="language-plaintext highlighter-rouge">SDL_RWops</code>. It’s probably better than the host’s C equivalent,
particularly with regards to paths. If you’re not already embedding your
assets, use the SDL API instead.</p>

<h3 id="mistake-6-using-sdl_renderer_accelerated">Mistake 6: Using <code class="language-plaintext highlighter-rouge">SDL_RENDERER_ACCELERATED</code></h3>

<p>This flag — and its surrounding bit set, <code class="language-plaintext highlighter-rouge">SDL_RendererFlags</code> — are a
subtle design flaw in the SDL2 API. Its existence is misleading, causing
to widespread misuse. It does not help that the documentation, both header
and wiki, is incomplete and unclear. The <code class="language-plaintext highlighter-rouge">SDL_CreateRenderer</code> function
accepts a bit set as its third argument, and it serves two simultaneous
purposes:</p>

<ul>
  <li>
    <p>Indicates <em>mandatory</em> properties of the renderer. Examples: “must use
accelerated rendering,” “must use software rendering,” “must support
vertical synchronization (vsync).” Drivers without the chosen properties
are skipped.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">SDL_RENDERER_PRESENTVSYNC</code> is set, also enables vsync in the created
render.</p>
  </li>
</ul>

<p>The common mistake is thinking that this bit indicates preference: “prefer
an accelerated renderer if possible”. But it really means “accelerated
renderer or bust.”</p>

<p>Given a zero for renderer flags, SDL will first attempt to create an
accelerated renderer. Failing that, it will then attempt to create a
software renderer. A software renderer fallback is exactly the behavior
you want! After all, this fallback is one of the primary features of the
SDL renderer API. This is so straightforward there are no caveats.</p>

<h3 id="mistake-7-not-accounting-for-vsync">Mistake 7: Not accounting for vsync</h3>

<p>For a game, you probably ought to enable vsync in your renderer. The hint:
You’re using <code class="language-plaintext highlighter-rouge">SDL_PollEvent</code> in your main event loop. Otherwise you will
waste lots of resources rendering thousands of frames per second. If my
laptop fan spins up running your SDL application, it’s probably because
you didn’t do this. The following should be the most conventional SDL
renderer configuration:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r</span> <span class="o">=</span> <span class="n">SDL_CreateRenderer</span><span class="p">(</span><span class="n">window</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">SDL_RENDERER_PRESENTVSYNC</span><span class="p">);</span>
</code></pre></div></div>

<p>The software renderer supports vsync, so it will not be excluded from the
driver search when vsync is requested.</p>

<p>That’s only for SDL renderers. If you’re using OpenGL, set a non-zero
<code class="language-plaintext highlighter-rouge">SDL_GL_SetSwapInterval</code> so that <code class="language-plaintext highlighter-rouge">SDL_GL_SwapWindow</code> synchronizes. For the
other rendering APIs, consult their documentation. (I can only speak to
SDL and OpenGL from experience.)</p>

<p>Caveat: Beware accidentally relying on vsync for timing in your game. You
don’t want your game’s physics to depend on the host’s display speed. Even
the pros make this mistake from time to time.</p>

<p>However, if you’re <em>not</em> making a game – perhaps instead an <a href="https://caseymuratori.com/blog_0001">IMGUI</a>
application <em>without active animations</em> — there’s a good chance you don’t
need or want vsync. The hint: You’re using <code class="language-plaintext highlighter-rouge">SDL_WaitEvent</code> in your main
event loop.</p>

<p>In summary, graphical SDL applications fall into one of two cases:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">SDL_PollEvent</code> with vsync</li>
  <li><code class="language-plaintext highlighter-rouge">SDL_WaitEvent</code> without vsync</li>
</ul>

<h3 id="mistake-8-using-asserth-instead-of-sdl_assert">Mistake 8: Using <code class="language-plaintext highlighter-rouge">assert.h</code> instead of <code class="language-plaintext highlighter-rouge">SDL_assert</code></h3>

<p>Alright, this one isn’t so common, but I’d like to highlight it. <strong>The
<code class="language-plaintext highlighter-rouge">SDL_assert</code> macro is fantastic</strong>, easily beating <code class="language-plaintext highlighter-rouge">assert.h</code> which
<a href="/blog/2022/06/26/">doesn’t even break in the right place</a>. It uses SDL to present a
user interface to the assertion, with support for retrying and ignoring.
It also works great under debuggers, breaking exactly as it should. I have
nothing but praise for it, so don’t pass up the chance to use it when you
can.</p>

<p>While I’m at it: during developing and testing, <em>always always always</em> run
your application under a debugger. Don’t close the debugger, just launch
through it again after rebuilding. Also, enable UBSan and ASan when
available for the extra assertions.</p>

<h3 id="sdl-wishlist">SDL wishlist</h3>

<p>For months I had wondered why SDL provides no memory allocation API. I’m
fine if it doesn’t have a general purpose allocator since I just want to
grab a chunk of host memory <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">for an arena</a>. However, SDL <em>does</em>
have allocations functions — <code class="language-plaintext highlighter-rouge">SDL_malloc</code>, etc. I didn’t know about them
until I stopped making mistake 4.</p>

<p>It was the same story again with math functions: I’d like not to stray
from SDL as a platform, but what if I need transcendental functions? I
could whip up crude implementations myself, but I’d prefer not. SDL has
those too: <code class="language-plaintext highlighter-rouge">SDL_sin</code>, etc. Caveat: The <code class="language-plaintext highlighter-rouge">math.h</code> functions are built-ins,
and compilers use that information to better optimize programs, e.g. cool
stuff like <code class="language-plaintext highlighter-rouge">-mrecip</code>, or SIMD vectorization. That cannot be done with
SDL’s equivalents.</p>

<p>I’m surprised SDL has no random number generator considering how important
it is to games. Since I <a href="/blog/2017/09/21/">prefer to handle this myself</a>, I don’t mind
that so much, but it does leave a lot of toy programs out there calling C
<code class="language-plaintext highlighter-rouge">rand</code>. I <em>would</em> like SDL if provided <a href="/blog/2019/04/30/">a single, good seed early during
startup</a>. There isn’t even a wall clock function for the classic
<code class="language-plaintext highlighter-rouge">srand(time(0))</code> seeding event! My solution has been to mix event
timestamps into the random state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">Uint32</span> <span class="nf">rand32</span><span class="p">(</span><span class="n">Uint64</span> <span class="o">*</span><span class="p">);</span>

<span class="n">Uint64</span> <span class="n">rng</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">SDL_Event</span> <span class="n">e</span><span class="p">;</span> <span class="n">SDL_PollEvent</span><span class="p">(</span><span class="o">&amp;</span><span class="n">e</span><span class="p">);)</span> <span class="p">{</span>
    <span class="n">rng</span> <span class="o">^=</span> <span class="n">e</span><span class="p">.</span><span class="n">common</span><span class="p">.</span><span class="n">timestamp</span><span class="p">;</span>
    <span class="n">rand32</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">);</span>  <span class="c1">// stir</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As I learn more in the future, I may come back and add to this list. At
the very least I expect to use SDL increasingly in my own projects.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>How to build a WaitGroup from a 32-bit integer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/05/"/>
    <id>urn:uuid:cc83b101-2d77-42b8-b409-d4ed36831479</id>
    <updated>2022-10-05T03:19:07Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Go has a nifty synchronization utility called a <a href="https://godocs.io/sync#WaitGroup">WaitGroup</a>, on which
one or more goroutines can wait for concurrent task completion. In other
languages, the usual task completion convention is <em>joining</em> threads doing
the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup
replaces joins. Building a WaitGroup using typical, portable primitives is
a messy affair involving constructors and destructors, managing lifetimes.
However, on at least Linux and Windows, we can build a WaitGroup out of a
zero-initialized integer, much like my <a href="/blog/2022/05/14/">32-bit queue</a> and <a href="/blog/2022/03/13/">32-bit
barrier</a>.</p>

<p>In case you’re not familiar with it, a typical WaitGroup use case in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">wg</span> <span class="n">sync</span><span class="o">.</span><span class="n">WaitGroup</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">task</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">tasks</span> <span class="p">{</span>
    <span class="n">wg</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
    <span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">t</span> <span class="n">Task</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// ... do task ...</span>
        <span class="n">wg</span><span class="o">.</span><span class="n">Done</span><span class="p">()</span>
    <span class="p">}(</span><span class="n">task</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">wg</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>
</code></pre></div></div>

<p>I zero-initialize the WaitGroup, the main goroutine increments the counter
before starting each task goroutine, each goroutine decrements the counter
when done, and the main goroutine waits until the counter reaches zero. My
goal is to build the same mechanism in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">workfunc</span><span class="p">(</span><span class="n">task</span> <span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do task ...</span>
    <span class="n">waitgroup_done</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">waitgroup_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When it’s done, the WaitGroup is back to zero, and no cleanup is required.</p>

<p>I’m going to take it a little further than that: Since its meaning and
contents are explicit, you may initialize a WaitGroup to any non-negative
task count! In other words, <code class="language-plaintext highlighter-rouge">waitgroup_add</code> is optional if the total
number of tasks is known up front.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="n">ntasks</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
</code></pre></div></div>

<p>A sneak peek at the full source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/waitgroup.c"><code class="language-plaintext highlighter-rouge">waitgroup.c</code></a></strong></p>

<h3 id="the-four-elements-of-synchronization">The four elements (of synchronization)</h3>

<p>To build this WaitGroup, we’re going to need four primitives from the host
platform, each operating on an <code class="language-plaintext highlighter-rouge">int</code>. The first two are atomic operations,
and the second two interact with the system scheduler. To port the
WaitGroup to a platform you need only implement these four functions,
typically as one-liners.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>  <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// atomic load</span>
<span class="k">static</span> <span class="kt">int</span>  <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>  <span class="c1">// atomic add-then-fetch</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>      <span class="c1">// wait on change at address</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// wake all waiters by address</span>
</code></pre></div></div>

<p>The first two should be self-explanatory. The <code class="language-plaintext highlighter-rouge">wait</code> function waits for
the pointed-at integer to change its value, and the second argument is its
expected current value. The scheduler will double-check the integer before
putting the thread to sleep in case it changes at the last moment — in
other words, an atomic check-then-maybe-sleep. The <code class="language-plaintext highlighter-rouge">wake</code> function is the
other half. After changing the integer, a thread uses it to wake all
threads waiting for the pointed-at integer to change. Together, this
mechanism is known as a <em>futex</em>.</p>

<p>I’m going to simplify the WaitGroup semantics a bit in order to make my
implementation even simpler. Go’s WaitGroup allows adding negatives, and
the <code class="language-plaintext highlighter-rouge">Add</code> method essentially does double-duty. My version forbids adding
negatives. That means the “add” operation is just an atomic increment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_add</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">,</span> <span class="kt">int</span> <span class="n">delta</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since it cannot bring the counter to zero, there’s nothing else to do. The
“done” operation <em>can</em> decrement to zero:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_done</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">wake</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the atomic decrement brought the count to zero, we finished the last
task, so we need to wake the waiters. We don’t know if anyone is actually
waiting, but that’s fine. Some futex use cases will avoid making the
relatively expensive system call if nobody’s waiting — i.e. don’t waste
time on a system call for each unlock of an uncontended mutex — but in the
typical WaitGroup case we <em>expect</em> a waiter when the count finally goes to
zero. That’s the common case.</p>

<p>The most complicated of the three is waiting:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">wait</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First check if the count is already zero and return if it is. Otherwise
use the futex to <em>wait for it to change</em>. Unfortunately that’s not exactly
the semantics we want, which would be to wait for a certain target. This
doesn’t break the wait, but it’s a potential source of inefficiency. If a
thread finishes a task between our load and wait, we don’t go to sleep,
and instead try again. However, in practice, I ran thousands of threads
through this thing concurrently and I couldn’t observe such a “miss.” As
far as I can tell, it’s so rare it doesn’t matter.</p>

<p>If this was a concern, the WaitGroup could instead be a pair of integers:
the counter and a “latch” that is either 0 or 1. Waiters wait on the
latch, and the latch is modified (atomically) when the counter transitions
to or from zero. That gives waiters a stable value on which to wait,
proxying the counter. However, since this doesn’t seem to matter in
practice, I prefer the elegance and simplicity of the single-integer
WaitGroup.</p>

<h3 id="four-elements-linux">Four elements: Linux</h3>

<p>With the WaitGroup done at a high level, we now need the per-platform
parts. Both GCC and Clang support <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html">GNU-style atomics</a>, so I’ll just
assume these are available on Linux without worrying about the compiler.
The first two functions wrap these built-ins:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_load_n</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_add_fetch</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">wait</code> and <code class="language-plaintext highlighter-rouge">wake</code> we need the <a href="https://man7.org/linux/man-pages/man2/futex.2.html"><code class="language-plaintext highlighter-rouge">futex(2)</code> system call</a>. In an
attempt to discourage its direct use, glibc doesn’t wrap this system call
in a function, so we must make the system call ourselves.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">current</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="n">INT_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">INT_MAX</code> means “wake as many as possible.” The other common value is
1 for waking a single waiter. Also, these system calls can’t meaningfully
fail, so there’s no need to check the return value. If <code class="language-plaintext highlighter-rouge">wait</code> wakes up
early (e.g. <code class="language-plaintext highlighter-rouge">EINTR</code>), it’s going to check the counter again anyway. In
fact, if your kernel is more than 20 years old, predating futexes, and
returns <code class="language-plaintext highlighter-rouge">ENOSYS</code> (“Function not implemented”), it will <em>still</em> work
correctly, though it will be incredibly inefficient.</p>

<h3 id="four-elements-windows">Four elements: Windows</h3>

<p>Windows didn’t support futexes until Windows 8 in 2012, and were still
supporting Windows without it into 2020, so they’re still relatively “new”
for this platform. Nonetheless, they’re now mature enough that we can
count on them being available.</p>

<p>I’d like to support both GCC-ish (<a href="https://github.com/skeeto/w64devkit">via Mingw-w64</a>) and MSVC-ish
compilers. Mingw-w64 provides a compatible <code class="language-plaintext highlighter-rouge">intrin.h</code>, so I can stick to
MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t
define atomics for <code class="language-plaintext highlighter-rouge">int</code> (or even <code class="language-plaintext highlighter-rouge">int32_t</code>), strictly <code class="language-plaintext highlighter-rouge">long</code>, so I have
to sneak in a little cast. (Recall: <code class="language-plaintext highlighter-rouge">sizeof(long) == sizeof(int)</code> on every
version of Windows supporting futexes.) The other option is to <code class="language-plaintext highlighter-rouge">typedef</code>
the WaitGroup so that it’s <code class="language-plaintext highlighter-rouge">int</code> on Linux (for the futex) and <code class="language-plaintext highlighter-rouge">long</code> on
Windows (for atomics).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">_InterlockedOr</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">addend</span> <span class="o">+</span> <span class="n">_InterlockedExchangeAdd</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The official, sanctioned futex functions are <a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress">WaitOnAddress</a> and
<a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall">WakeByAddressAll</a>. They <a href="https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/CALK-3m%2B6tX_ubMVGV7NarAm6VH0AoOp5THyXfEUA%3DTjyu5L%3Dxw%40mail.gmail.com/">used to be in <code class="language-plaintext highlighter-rouge">kernel32.dll</code></a>, but as of
this writing they live in <code class="language-plaintext highlighter-rouge">API-MS-Win-Core-Synch-l1-2-0.dll</code>, linked via
<code class="language-plaintext highlighter-rouge">-lsynchronization</code>. Gross. Since I can’t stomach this, I instead call the
low-level RTL functions where it’s actually implemented: RtlWaitOnAddress
and RtlWakeAddressAll. These live in the nice neighborhood of <code class="language-plaintext highlighter-rouge">ntdll.dll</code>.
They’re undocumented as far as I can tell, but thankfully <a href="https://github.com/wine-mirror/wine/blob/master/dlls/ntdll/sync.c">Wine comes to
the rescue</a>, providing both documentation and several different
implementations. Reading through it is educational, and hints at ways to
construct futexes on systems lacking them.</p>

<p>These functions aren’t declared in any headers, so I have to do it myself.
On the plus side, so far I haven’t paid the substantial compile-time costs
of <a href="https://web.archive.org/web/20090912002357/http://www.tilander.org/aurora/2008/01/include-windowsh.html">including <code class="language-plaintext highlighter-rouge">windows.h</code></a>, and so I can continue avoiding it. These
functions <em>are</em> listed in the <code class="language-plaintext highlighter-rouge">ntdll.dll</code> import library, so I don’t need
to <a href="/blog/2021/05/31/">invent the import library entries</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWaitOnAddress</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWakeAddressAll</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Rather conveniently, the semantics perfectly line up with Linux futexes!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWaitOnAddress</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWakeAddressAll</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like with Linux, there’s no meaningful failure, so the return values don’t
matter.</p>

<p>That’s the whole implementation. Considering just a single platform, a
flexible, lightweight, and easy-to-use synchronization facility in ~50
lines of relatively simple code is a pretty good deal if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>My new debugbreak command</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/07/31/"/>
    <id>urn:uuid:c333d1ab-86b5-4389-b2b7-325d0eb90987</id>
    <updated>2022-07-31T12:59:59Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I <a href="/blog/2022/06/26/">previously mentioned</a> the Windows feature where <a href="https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-registerhotkey">pressing
F12</a> in a debuggee window causes it to break in the debugger. It
works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the
hotkey simply raises a breakpoint <a href="https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp">structured exception</a>. It’s been
surprisingly useful, and I’ve wanted it available in more contexts, such
as console programs or even on Linux. The result is a new <a href="https://github.com/skeeto/w64devkit/blob/4282797/src/debugbreak.c"><code class="language-plaintext highlighter-rouge">debugbreak</code>
command</a>, now included in <a href="/blog/2020/05/15/">w64devkit</a>. Though, of course, you
already have <a href="/blog/2020/09/25/">everything you need</a> to build it and try it out right
now. I’ve also worked out a Linux implementation.</p>

<p>It’s named after an <a href="https://docs.microsoft.com/en-us/visualstudio/debugger/debugbreak-and-debugbreak">MSVC intrinsic and Win32 function</a>. It takes no
arguments, and its operation is indiscriminate: It raises a breakpoint
exception in <em>all</em> debuggee processes system-wide. Reckless? Perhaps, but
certainly convenient. You don’t need to tell it which process you want to
pause. It just works, and a good debugging experience is one of ease and
convenience.</p>

<p>The linchpin is <a href="https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-debugbreakprocess">DebugBreakProcess</a>. The command walks the process
list and fires this function at each process. Nothing happens for programs
without a debugger attached, so it doesn’t even bother checking if it’s a
debuggee. It couldn’t be simpler. I’ve used it on everything from Windows
XP to Windows 11, and it’s worked flawlessly.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="n">s</span> <span class="o">=</span> <span class="n">CreateToolhelp32Snapshot</span><span class="p">(</span><span class="n">TH32CS_SNAPPROCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">PROCESSENTRY32W</span> <span class="n">p</span> <span class="o">=</span> <span class="p">{</span><span class="k">sizeof</span><span class="p">(</span><span class="n">p</span><span class="p">)};</span>
<span class="k">for</span> <span class="p">(</span><span class="n">BOOL</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32FirstW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">);</span> <span class="n">r</span><span class="p">;</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32NextW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">))</span> <span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">PROCESS_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">th32ProcessID</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">DebugBreakProcess</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I use it almost exclusively from Vim, where I’ve given it a <a href="https://learnvimscriptthehardway.stevelosh.com/chapters/06.html">leader
mapping</a>. With the editor focused, I can type backslash then
<kbd>d</kbd> to pause the debuggee.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">map</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">d</span> <span class="p">:</span><span class="k">call</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"debugbreak"</span><span class="p">)&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>With the debuggee paused, I’m free to add new breakpoints or watchpoints,
or print the call stack to see what the heck it’s busy doing. The
mechanism behind DebugBreakProcess is to create a new thread in the
target, with that thread raising the breakpoint exception. The debugger
will be stopped in this new thread. In GDB you can use the <code class="language-plaintext highlighter-rouge">thread</code>
command to switch over to the thread that actually matters, usually <code class="language-plaintext highlighter-rouge">thr
1</code>.</p>

<h3 id="debugbreak-on-linux">debugbreak on Linux</h3>

<p>On unix-like systems the equivalent of a breakpoint exception is a
<code class="language-plaintext highlighter-rouge">SIGTRAP</code>. There’s already a standard command for sending signals,
<a href="https://man7.org/linux/man-pages/man1/kill.1.html"><code class="language-plaintext highlighter-rouge">kill</code></a>, so a <code class="language-plaintext highlighter-rouge">debugbreak</code> command can be built using nothing more
than a few lines of shell script. However, unlike DebugBreakProcess,
signaling every process with <code class="language-plaintext highlighter-rouge">SIGTRAP</code> will only end in tears. The script
will need a way to determine which processes are debuggees.</p>

<p>Linux exposes processes in the file system as virtual files under <code class="language-plaintext highlighter-rouge">/proc</code>,
where each process appears as a directory. Its <code class="language-plaintext highlighter-rouge">status</code> file includes a
<code class="language-plaintext highlighter-rouge">TracerPid</code> field, which will be non-zero for debuggees. The script
inspects this field, and if non-zero sends a <code class="language-plaintext highlighter-rouge">SIGTRAP</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="k">for </span>pid <span class="k">in</span> <span class="si">$(</span>find /proc <span class="nt">-maxdepth</span> 1 <span class="nt">-printf</span> <span class="s1">'%f\n'</span> | <span class="nb">grep</span> <span class="s1">'^[0-9]\+$'</span><span class="si">)</span><span class="p">;</span> <span class="k">do
    </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s1">'^TracerPid:\s[^0]'</span> /proc/<span class="nv">$pid</span>/status 2&gt;/dev/null <span class="o">&amp;&amp;</span>
        <span class="nb">kill</span> <span class="nt">-TRAP</span> <span class="nv">$pid</span>
<span class="k">done</span>
</code></pre></div></div>

<p>This script, now part of <a href="/blog/2012/06/23/">my dotfiles</a>, has worked very well so
far, and effectively smoothes over some debugging differences between
Windows and Linux, reducing my context switching mental load. There’s
probably a better way to express this script, but that’s the best I could
do so far. On the BSDs you’d need to parse the output of <code class="language-plaintext highlighter-rouge">ps</code>, though each
system seems to do its own thing for distinguishing debuggees.</p>

<h3 id="a-missing-feature">A missing feature</h3>

<p>I had originally planned for one flag, <code class="language-plaintext highlighter-rouge">-k</code>. Rather than breakpoint
debugees, it would terminate all debuggee processes. This is especially
important on Windows where debuggee processes block builds due to file
locking shenanigans. I’d just run <code class="language-plaintext highlighter-rouge">debugbreak -k</code> as part of the build.
However, it’s not possible to terminate debuggees paused in the debugger —
the common situation. I’ve given up on this for now.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>The wild west of Windows command line parsing</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/02/18/"/>
    <id>urn:uuid:04c886e0-3434-4292-b7de-e8213461838c</id>
    <updated>2022-02-18T03:52:12Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I’ve been experimenting again lately with <a href="/blog/2016/01/31/">writing software without a
runtime</a> aside from the operating system itself, both on Linux and
Windows. Another way to look at it: I write and embed a bespoke, minimal
runtime within the application. One of the runtime’s core jobs is
retrieving command line arguments from the operating system. On Windows
this is a deeper rabbit hole than I expected, and far more complex than I
realized. There is no standard, and every runtime does it a little
differently. Five different applications may see five different sets of
arguments — even different argument counts — from the same input, and this
is <em>before</em> any sort of option parsing. It’s truly a modern day Tower of
Babel: “Confound their command line parsing, that they may not understand
one another’s arguments.”</p>

<p>Unix-like systems pass the <code class="language-plaintext highlighter-rouge">argv</code> array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it <code class="language-plaintext highlighter-rouge">argv</code>. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>     <span class="c1">; argc</span>
        <span class="nf">lea</span>   <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>   <span class="c1">; argv</span>
        <span class="nf">call</span>  <span class="nv">main</span>
        <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="mi">60</span>        <span class="c1">; SYS_exit</span>
        <span class="nf">syscall</span>
</code></pre></div></div>

<p>It’s 5 instructions (20 bytes) on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">ldr</span>  <span class="nv">w0</span><span class="p">,</span> <span class="p">[</span><span class="nb">sp</span><span class="p">]</span>        <span class="c1">; argc</span>
        <span class="nf">add</span>  <span class="nv">x1</span><span class="p">,</span> <span class="nb">sp</span><span class="p">,</span> <span class="mi">8</span>       <span class="c1">; argv</span>
        <span class="nf">bl</span>   <span class="nv">main</span>
        <span class="nf">mov</span>  <span class="nv">w8</span><span class="p">,</span> <span class="mi">93</span>          <span class="c1">; SYS_exit</span>
        <span class="nf">svc</span>  <span class="mi">0</span>
</code></pre></div></div>

<p>On Windows, <code class="language-plaintext highlighter-rouge">argv</code> is passed in serialized form as a string. That’s how
MS-DOS did it (via the <a href="https://en.wikipedia.org/wiki/Program_Segment_Prefix">Program Segment Prefix</a>), because <a href="http://www.gaby.de/cpm/manuals/archive/cpm22htm/ch5.htm">that’s how
CP/M did it</a>. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and <em>somebody</em> has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.</p>

<p>Windows itself provides an operating system routine for parsing command
line strings: <a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw">CommandLineToArgvW</a>. Fetch the command line string
with <a href="https://docs.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew">GetCommandLineW</a>, pass it to this function, and you have your
<code class="language-plaintext highlighter-rouge">argc</code> and <code class="language-plaintext highlighter-rouge">argv</code>. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so <a href="/blog/2021/12/30/">if you want to work in UTF-8</a> you’ll also need
<code class="language-plaintext highlighter-rouge">WideCharToMultiByte</code>. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.</p>

<h3 id="my-getcommandlinew">My GetCommandLineW</h3>

<p>GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the <a href="https://docs.microsoft.com/en-us/windows/win32/api/winternl/ns-winternl-peb">Process
Environment Block</a>. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the <a href="https://en.wikipedia.org/wiki/Win32_Thread_Information_Block">Thread Information Block</a>,
itself found via the <code class="language-plaintext highlighter-rouge">gs</code> register (x64, <code class="language-plaintext highlighter-rouge">fs</code> on x86), an <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">old 386 segment
register</a>. Buried in the PEB is a <a href="https://docs.microsoft.com/en-us/windows/win32/api/subauth/ns-subauth-unicode_string"><code class="language-plaintext highlighter-rouge">UNICODE_STRING</code></a>, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">wchar_t</span> <span class="o">*</span><span class="nf">cmdline_fetch</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="cp">#if __amd64
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%gs:(0x60), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x20(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x78(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#elif __i386
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%fs:(0x30), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x10(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x44(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#endif
</span>    <span class="k">return</span> <span class="n">cmd</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the <code class="language-plaintext highlighter-rouge">gs</code> register, and you’ll find it using these offsets, too.</p>

<p>If you look carefully at the <code class="language-plaintext highlighter-rouge">UNICODE_STRING</code> you’ll see the length is
given by a <code class="language-plaintext highlighter-rouge">USHORT</code> in units of bytes, despite being a 16-bit <code class="language-plaintext highlighter-rouge">wchar_t</code>
string. This is <a href="https://devblogs.microsoft.com/oldnewthing/20031210-00/?p=41553">the source</a> of Windows’ maximum command line length
of <a href="https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw">32,767 characters</a> (including terminator).</p>

<p>GetCommandLineW is from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>, but CommandLineToArgvW is a bit
more off the beaten path in <code class="language-plaintext highlighter-rouge">shell32.dll</code>. If you wanted to avoid linking
to <code class="language-plaintext highlighter-rouge">shell32.dll</code> for <a href="https://randomascii.wordpress.com/2018/12/03/a-not-called-function-can-cause-a-5x-slowdown/">important reasons</a>, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.</p>

<p>The GetCommandLineW has a rough explanation: split arguments on whitespace
(not defined), quoting is involved, and there’s something about counting
backslashes, but only if they stop on a quote. It’s not quite enough to
implement your own, and if you test against it, it’s quickly apparent that
this documentation is at best incomplete. It links to a deprecated page
about <a href="https://docs.microsoft.com/en-us/previous-versions/17w5ykft(v=vs.85)">parsing C++ command line arguments</a> with a few more details.
Unfortunately the algorithm described on this page is not the algorithm
used by GetCommandLineW, nor is it used by any runtime I could find. It
even varies between Microsoft’s own CRTs. There is no canonical command
line parsing result, not even a <em>de facto</em> standard.</p>

<p>I eventually came across David Deley’s <a href="https://daviddeley.com/autohotkey/parameters/parameters.htm">How Command Line Parameters Are
Parsed</a>, which is the closest there is to an authoritative document on
the matter (<a href="https://web.archive.org/web/20210615061518/http://www.windowsinspired.com/how-a-windows-programs-splits-its-command-line-into-individual-arguments/">also</a>). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. <code class="language-plaintext highlighter-rouge">argv[0]</code>) follows entirely different
rules, which really confused me for while. The <a href="https://source.winehq.org/git/wine.git/blob/5a66eab72:/dlls/shcore/main.c#l264">Wine documentation</a>
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.</p>

<h3 id="my-commandlinetoargvw">My CommandLineToArgvW</h3>

<p>Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s <a href="/blog/2020/12/31/">a
little state machine</a>: <strong><a href="https://github.com/skeeto/scratch/blob/master/parsers/cmdline.c#L27"><code class="language-plaintext highlighter-rouge">cmdline.c</code></a></strong>. The interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_to_argv8</span><span class="p">(</span><span class="k">const</span> <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike the others, mine encodes straight into <a href="https://simonsapin.github.io/wtf-8/">WTF-8</a>, a superset of
UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative
lines of code: invisible since it involves <em>not</em> reacting to ill-formed
input. If you use the new-ish UTF-8 manifest Win32 feature then your
program cannot handle command line strings with ill-formed UTF-16, a
problem solved by WTF-8.</p>

<p>As documented, that <code class="language-plaintext highlighter-rouge">argv</code> must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case <code class="language-plaintext highlighter-rouge">argc</code> is 16,383 (plus one more <code class="language-plaintext highlighter-rouge">argv</code> slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the <code class="language-plaintext highlighter-rouge">argv</code> is
actually used as a <code class="language-plaintext highlighter-rouge">char</code> buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">argc</span> <span class="o">=</span> <span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmdline_fetch</span><span class="p">(),</span> <span class="n">argv</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">main</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also: Note the <code class="language-plaintext highlighter-rouge">FUZZ</code> option in my source. It has been pretty thoroughly
<a href="/blog/2019/01/25/">fuzz tested</a>. It didn’t find anything, but it does make me more
confident in the result.</p>

<p>I also peeked at some language runtimes to see how others handle it. Just
as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft
CRT. Also expected, CPython implicitly does whatever the underlying C
runtime does, so its exact command line behavior depends on which version
of Visual Studio was used to build the Python binary. OpenJDK
<a href="https://github.com/openjdk/jdk/blob/jdk-17+35/src/jdk.jpackage/windows/native/common/WinSysInfo.cpp#L141">pragmatically calls CommandLineToArgvW</a>. Go (gc) <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/os/exec_windows.go#115">does its own
parsing</a>, with behavior mixed between CommandLineToArgvW and some of
Microsoft’s CRTs, but not quite matching either.</p>

<h3 id="building-a-command-line-string">Building a command line string</h3>

<p>I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an <code class="language-plaintext highlighter-rouge">argv</code>. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.</p>

<p>Naturally, I also wrote my own implementation (same source):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_from_argv8</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Like before, it accepts a WTF-8 <code class="language-plaintext highlighter-rouge">argv</code>, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one <em>can</em> fail when <code class="language-plaintext highlighter-rouge">argv</code> is too large, it returns
zero for an error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"python.exe"</span><span class="p">,</span> <span class="s">"-c"</span><span class="p">,</span> <span class="n">code</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
<span class="kt">wchar_t</span> <span class="n">cmd</span><span class="p">[</span><span class="n">CMDLINE_CMD_MAX</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cmdline_from_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">CMDLINE_CMD_MAX</span><span class="p">,</span> <span class="n">argv</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"argv too large"</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">CreateProcessW</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="cm">/*...*/</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"CreateProcessW failed"</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>How do others handle this?</p>

<ul>
  <li>
    <p>The <a href="https://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c?h=emacs-27.2#n2009">aged Emacs implementation</a> is written in C rather than Lisp,
steeped in history with vestigial wrong turns. Emacs still only calls
the “narrow” CreateProcessA despite having every affordance to do
otherwise, and <a href="https://github.com/skeeto/emacsql/issues/77#issuecomment-887125675">uses the wrong encoding at that</a>. A personal
source of headaches.</p>
  </li>
  <li>
    <p>CPython uses Python rather than C via <a href="https://github.com/python/cpython/blob/3.10/Lib/subprocess.py#L529"><code class="language-plaintext highlighter-rouge">subprocess.list2cmdline</code></a>.
While <a href="https://bugs.python.org/issue10838">undocumented</a>, it’s accessible on any platform and easy to
test against various inputs. Try it out!</p>
  </li>
  <li>
    <p>Go (gc) is <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/syscall/exec_windows.go#101">just as delightfully boring I’d expect</a>.</p>
  </li>
  <li>
    <p>OpenJDK <a href="https://github.com/openjdk/jdk/blob/jdk-17%2B35/src/java.base/windows/classes/java/lang/ProcessImpl.java#L229">optimistically optimizes</a> for command line strings under
80 bytes, and like Emacs, displays the weathering of long use.</p>
  </li>
</ul>

<p>I don’t plan to write a language implementation anytime soon, where this
might be needed, but it’s nice to know I’ve already solved this problem
for myself!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Some sanity for C and C++ development on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/12/30/"/>
    <id>urn:uuid:2e417030-915f-4897-99ff-2a0dafd0ac89</id>
    <updated>2021-12-30T23:25:53Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>A hard reality of C and C++ software development on Windows is that there
has never been a good, native C or C++ standard library implementation for
the platform. A standard library should abstract over the underlying host
facilities in order to ease portable software development. On Windows, C
and C++ is so poorly hooked up to operating system interfaces that most
portable or mostly-portable software — programs which work perfectly
elsewhere — are subtly broken on Windows, particularly outside of the
English-speaking world. The reasons are almost certainly political,
originally motivated by vendor lock-in, than technical, which adds insult
to injury. This article is about what’s wrong, how it’s wrong, and some
easy techniques to deal with it in portable software.</p>

<p>There are <a href="/blog/2016/06/13/">multiple C implementations</a>, so how could they all be
bad, even the <a href="/blog/2018/04/13/">early ones</a>? Microsoft’s C runtime has defined how
the standard library should work on the platform, and everyone else
followed along for the sake of compatibility. I’m excluding <a href="https://www.cygwin.com/">Cygwin</a> and
its major fork, <a href="https://www.msys2.org/">MSYS2</a>, despite not inheriting any of these flaws. They
change so much that they’re effectively whole new platforms, not truly
“native” to Windows.</p>

<p>In practice, C++ standard libraries are implemented on top of a C standard
library, which is why C++ shares the same problems. CPython dodges these
issues: Though written in C, on Windows it bypasses the broken C standard
library and directly calls the proprietary interfaces. Other language
implementations, such “gc” Go, simply aren’t built on C at all, and
instead do things correctly in the first place — the behaviors the C
runtimes should have had all along.</p>

<p>If you’re just working on one large project, bypassing the C runtime isn’t
such a big deal, and you’re likely already doing so to access important
platform functionality. You don’t really even need a C runtime. However,
if you write many small programs, <a href="https://github.com/skeeto/scratch">as I do</a>, writing the same
special Windows support for each one ends up being most of the work, and
honestly makes properly supporting Windows not worth the trouble. I end up
just accepting the broken defaults most of the time.</p>

<p>Before diving into the details, if you’re looking for a quick-and-easy
solution for the Mingw-w64 toolchain, <a href="/blog/2020/05/15/">including w64devkit</a>, which
magically makes your C and C++ console programs behave well on Windows,
I’ve put together a “library” named <strong><a href="https://github.com/skeeto/scratch/tree/master/libwinsane">libwinsane</a></strong>. It solves all
problems discussed in this article, except for one. No source changes
required, simply link it into your program.</p>

<h3 id="what-exactly-is-broken">What exactly is broken?</h3>

<p>The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix,
and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy
API, where an active <em>code page</em> maps 256 bytes onto (up to) 256 specific
characters. On typical machines configured for European languages, this
means <a href="https://en.wikipedia.org/wiki/Windows-1252">code page 1252</a>. <a href="http://simonsapin.github.io/wtf-8/">Roughly speaking</a>, Windows
internally uses UTF-16, and calls through the narrow interface use the
active code page to translate the narrow strings to wide strings. The
result is that calls through the narrow API have limited access to the
system.</p>

<p>The UTF-8 encoding was invented in 1992 and standardized by January 1993.
UTF-8 was adopted by the unix world over the following years due to <a href="/blog/2017/10/06/#what-is-utf-8">its
backwards-compatibility</a> with its existing interfaces. Programs
could read and write Unicode data, access Unicode paths, pass Unicode
arguments, and get and set Unicode environment variables without needing
to change anything. Today UTF-8 has become the dominant text encoding
format in the world, in large part due to the world wide web.</p>

<p>In July 1993, Microsoft introduced the wide Windows API with the release
of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather
than UTF-8. This turned out to be a mistake, since <a href="http://utf8everywhere.org/">UTF-16 is inferior to
UTF-8 in practically every way</a>, though admittedly some problems
weren’t so obvious at the time.</p>

<p>The major problem: <strong>The C and C++ standard libraries only hook up to the
narrow Windows interfaces</strong>. The standard library, and therefore typical
portable software on Windows, cannot handle anything but ASCII. The
effective result is that these programs:</p>

<ul>
  <li>Cannot accept non-ASCII arguments</li>
  <li>Cannot get/set non-ASCII environment variables</li>
  <li>Cannot access non-ASCII paths</li>
  <li>Cannot read and write non-ASCII on a console</li>
</ul>

<p>Doing any of these requires calling proprietary functions, treating
Windows as a special target. It’s part of what makes correctly porting
software to Windows so painful.</p>

<p>The sensible solution would have been for the C runtime to speak UTF-8 and
connect to the wide API. Alternatively, the narrow API could have been
changed over to UTF-8, phasing out the old code page concept. In theory
this is what the UTF-8 “code page” is about, though it doesn’t always
work. There would have been compatibility problems with abruptly making
such a change, but until very recently, <em>this wasn’t even an option</em>. Why
couldn’t there be a switch I could flip to get sane behavior that works
like every other platform?</p>

<h3 id="how-to-mostly-fix-unicode-support">How to mostly fix Unicode support</h3>

<p>In 2019, Microsoft introduced a feature to allow programs to <a href="https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page">request
UTF-8 as their active code page on start</a>, along with supporting
UTF-8 on more narrow API functions. This is like the magic switch I
wanted, except that it involves embedding some ugly XML into your binary
in a particular way. At least it’s now an option.</p>

<p>For Mingw-w64, that means writing a resource file like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include &lt;winuser.h&gt;
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"
</code></pre></div></div>

<p>Compiling it with <code class="language-plaintext highlighter-rouge">windres</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ windres -o manifest.o manifest.rc
</code></pre></div></div>

<p>Then linking that into your program. Amazingly it mostly works! Programs
can access Unicode arguments, Unicode environment variables, and Unicode
paths, including with <code class="language-plaintext highlighter-rouge">fopen</code>, just as it’s worked on other platforms for
decades. Since the active code page is set at load time, it happens before
<code class="language-plaintext highlighter-rouge">argv</code> is constructed (from <code class="language-plaintext highlighter-rouge">GetCommandLineA</code>), which is why that works
out.</p>

<p>Alternatively you could create a “side-by-side assembly” placing that XML
in a file with the same name as your EXE but with <code class="language-plaintext highlighter-rouge">.manifest</code> suffix
(after the <code class="language-plaintext highlighter-rouge">.exe</code> suffix), then placing that next to your EXE. Just be
mindful that there’s a “side-by-side” cache (WinSxS), and so it might not
immediately pick up your changes.</p>

<p>What <em>doesn’t</em> work is console input and output since the console is
external to the process, and so isn’t covered by the process’s active code
page. It must be configured separately using a proprietary call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SetConsoleOutputCP</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">);</span>
</code></pre></div></div>

<p>Annoying, but at least it’s not <em>that</em> painful. This only covers output,
though, meaning programs can only print UTF-8. Unfortunately <a href="https://github.com/microsoft/terminal/issues/4551#issuecomment-585487802">UTF-8 input
still doesn’t work</a>, and setting the input code page doesn’t do
anything despite reporting success:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SetConsoleCP</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">);</span>  <span class="c1">// doesn't work</span>
</code></pre></div></div>

<p>If you care about reading interactive Unicode input, you’re <a href="/blog/2020/05/04/">stuck
bypassing the C runtime</a> since it’s still broken.</p>

<h3 id="text-stream-translation">Text stream translation</h3>

<p>Another long-standing issue is that C and C++ on Windows has distinct
“text” and “binary” streams, which it inherited from DOS. Mainly this
means automatic newline conversion between CRLF and LF. The C standard
explicitly allows for this, though unix-like platforms have never actually
distinguished between text and binary streams.</p>

<p>The standard also specifies that standard input, output, and error are all
open as text streams, and there’s no portable method to change the stream
mode to binary — a serious deficiency with the standard. On unix-likes
this doesn’t matter, but on Windows it means programs can’t read or write
binary data on standard streams without calling a non-standard function.
It also means reading and writing standard streams is slow, <a href="/blog/2021/12/04/">frequently a
bottleneck</a> unless I route around it.</p>

<p>Personally, I like <a href="/blog/2020/06/29/">writing binary data to standard output</a>,
<a href="/blog/2020/11/24/">including video</a>, and sometimes <a href="/blog/2017/07/02/">binary filters</a> that also read
binary input. I do it so often that in probably half my C programs I have
this snippet in <code class="language-plaintext highlighter-rouge">main</code> just so they work correctly on Windows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="cp">#ifdef _WIN32
</span>    <span class="kt">int</span> <span class="nf">_setmode</span><span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
    <span class="n">_setmode</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mh">0x8000</span><span class="p">);</span>
    <span class="n">_setmode</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mh">0x8000</span><span class="p">);</span>
    <span class="cp">#endif
</span></code></pre></div></div>

<p>That incantation sets standard input and output in the C runtime to binary
mode without the need to include a header, making it compact, simple, and
self-contained.</p>

<p>This built-in newline translation, along with the Windows standard text
editor, Notepad, <a href="https://devblogs.microsoft.com/commandline/extended-eol-in-notepad/">lagging decades behind</a>, meant that many other
programs, including Git, grew their own, annoying, newline conversion
<a href="https://github.com/skeeto/w64devkit/issues/10">misfeatures</a> that cause <a href="https://github.com/skeeto/binitools/commit/2efd690c3983856c9633b0be66d57483491d1e10">other problems</a>.</p>

<h3 id="libwinsane">libwinsane</h3>

<p>I introduced libwinsane at the beginning of the article, which fixes all
this simply by being linked into a program. It includes the magic XML
manifest <code class="language-plaintext highlighter-rouge">.rsrc</code> section, configures the console for UTF-8 output, and
sets standard streams to binary before <code class="language-plaintext highlighter-rouge">main</code> (via a GCC constructor). I
called it a “library”, but it’s actually a single object file. It can’t be
a static library since it must be linked into the program despite not
actually being referenced by the program.</p>

<p>So normally this program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="n">argv</span><span class="p">[</span><span class="n">argc</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%zu %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled and run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\&gt;cc -o example example.c
C:\&gt;example π
1 p
</code></pre></div></div>

<p>As usual, the Unicode argument is silently mangled into one byte. Linked
with libwinsane, it just works like everywhere else:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\&gt;gcc -o example example.c libwinsane.o
C:\&gt;example π
2 π
</code></pre></div></div>

<p>If you’re maintaining a substantial program, you probably want to copy and
integrate the necessary parts of libwinsane into your project and build,
rather than always link against this loose object file. This is more for
convenience and for succinctly capturing the concept. You may even want to
<a href="https://github.com/skeeto/hastyhex/blob/f03b6e0f/hastyhex.c#L298-L309">enable ANSI escape processing</a> in your version.</p>

<p><strong>Update December 2024</strong>: Pavel Galkin <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cdf749edc-0413-4735-9cf2-c77db202cc6e@app.fastmail.com%3E">demonstrates how <code class="language-plaintext highlighter-rouge">libwinsane.o</code>
changes the console state</a>, which affects all processes associated
with the terminal. This is mostly unavoidable, and it’s one reason I’ve
since concluded that UTF-8 manifests are a poor solution. Better to <a href="/blog/2023/01/18/">solve
the problem using a platform layer</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>More DLL fun with w64devkit: Go, assembly, and Python</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/06/29/"/>
    <id>urn:uuid:b2c53451-b12a-4f1a-a475-6c81096c9b5a</id>
    <updated>2021-06-29T21:50:30Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>My previous article explained <a href="/blog/2021/05/31/">how to work with dynamic-link libraries
(DLLs) using w64devkit</a>. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, <a href="/blog/2020/05/15/">w64devkit</a> is a great complement to Go and reliably
fullfills all the needs of <a href="https://golang.org/cmd/cgo/">cgo</a> — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.</p>

<h3 id="go-bootstrap-and-cgo">Go: bootstrap and cgo</h3>

<p>The primary Go implementation, confusingly <a href="https://golang.org/doc/faq#What_compiler_technology_is_used_to_build_the_compilers">named “gc”</a>, is an
<a href="/blog/2020/01/21/">incredible piece of software engineering</a>. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.</p>

<p>You will need the Go 1.4 source, <a href="https://dl.google.com/go/go1.4-bootstrap-20171003.tar.gz">go1.4-bootstrap-20171003.tar.gz</a>.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).</p>

<p>Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use <a href="/blog/2021/02/08/"><code class="language-plaintext highlighter-rouge">cmd.exe</code></a> explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" &gt;&gt;~/.profile
$ source ~/.profile
</code></pre></div></div>

<p>Not only is Go now available, so is the full power of cgo. (Including <a href="https://dave.cheney.net/2016/01/18/cgo-is-not-go">its
costs</a> if used.)</p>

<h3 id="vim-suggestions">Vim suggestions</h3>

<p>Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
<code class="language-plaintext highlighter-rouge">goimports</code> and a couple of corrections to Vim’s built-in Go support (<code class="language-plaintext highlighter-rouge">[[</code>
and <code class="language-plaintext highlighter-rouge">]]</code> navigation). The included <code class="language-plaintext highlighter-rouge">ctags</code> understands Go, so tags
navigation works the same as it does with C. <code class="language-plaintext highlighter-rouge">\i</code> saves the current
buffer, runs <code class="language-plaintext highlighter-rouge">goimports</code>, and populates the quickfix list with any errors.
Similarly <code class="language-plaintext highlighter-rouge">:make</code> invokes <code class="language-plaintext highlighter-rouge">go build</code> and, as expected, populates the
quickfix list.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code>autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="k">setlocal</span> <span class="nb">makeprg</span><span class="p">=</span><span class="k">go</span>\ build
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">silent</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">i</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">update</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">cexpr</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"goimports -w "</span> <span class="p">.</span> <span class="nb">expand</span><span class="p">(</span><span class="s2">"%"</span><span class="p">))</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">silent</span> <span class="k">edit</span><span class="p">&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">[[</span>
<span class="se">    \</span> ?^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">]]</span>
<span class="se">    \</span> /^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>Go only comes with <code class="language-plaintext highlighter-rouge">gofmt</code> but <code class="language-plaintext highlighter-rouge">goimports</code> is just one command away, so
there’s little excuse not to have it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go install golang.org/x/tools/cmd/goimports@latest
</code></pre></div></div>

<p>Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.</p>

<h3 id="cgo-dlls">cgo DLLs</h3>

<p>The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
<code class="language-plaintext highlighter-rouge">import "C"</code>. The imported <code class="language-plaintext highlighter-rouge">C</code> object provides access to C types and
functions. Go functions marked with an <code class="language-plaintext highlighter-rouge">//export</code> comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.</p>

<p>To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Which DLL am I running?</span>
<span class="kt">int</span> <span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Generate 64 bits from a CSPRNG.</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Compute the Euclidean norm.</span>
<span class="kt">float</span> <span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s a C implementation which I’m calling “version 1”.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;math.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;ntsecapi.h&gt;</span><span class="cp">
</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">int</span>
<span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span>
<span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">RtlGenRandom</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">x</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">float</span>
<span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As discussed in the previous article, each function is exported using
<code class="language-plaintext highlighter-rouge">__declspec</code> so that they’re available for import. As before:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o hello1.dll hello1.c
</code></pre></div></div>

<p>Side note: This could be trivially converted into a C++ implementation
just by adding <code class="language-plaintext highlighter-rouge">extern "C"</code> to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.</p>

<p>Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>

<span class="k">import</span> <span class="s">"C"</span>
<span class="k">import</span> <span class="p">(</span>
	<span class="s">"crypto/rand"</span>
	<span class="s">"encoding/binary"</span>
	<span class="s">"math"</span>
<span class="p">)</span>

<span class="c">//export version</span>
<span class="k">func</span> <span class="n">version</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="kt">int</span> <span class="p">{</span>
	<span class="k">return</span> <span class="m">2</span>
<span class="p">}</span>

<span class="c">//export rand64</span>
<span class="k">func</span> <span class="n">rand64</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">buf</span> <span class="p">[</span><span class="m">8</span><span class="p">]</span><span class="kt">byte</span>
	<span class="n">rand</span><span class="o">.</span><span class="n">Read</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="n">r</span> <span class="o">:=</span> <span class="n">binary</span><span class="o">.</span><span class="n">LittleEndian</span><span class="o">.</span><span class="n">Uint64</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="p">}</span>

<span class="c">//export dist</span>
<span class="k">func</span> <span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">)</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="kt">float64</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)))</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the use of C types for all arguments and return values. The <code class="language-plaintext highlighter-rouge">main</code>
function is required since this is the main package, but it will never be
called. The DLL is built like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go build -buildmode=c-shared -o hello2.dll hello2.go
</code></pre></div></div>

<p>Without the <code class="language-plaintext highlighter-rouge">-o</code> option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.</p>

<p>What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using <code class="language-plaintext highlighter-rouge">--out-implib</code>. For Go we have to handle this ourselves via
<code class="language-plaintext highlighter-rouge">gendef</code> and <code class="language-plaintext highlighter-rouge">dlltool</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def
</code></pre></div></div>

<p>The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.</p>

<h3 id="nasm-assembly-dll">NASM assembly DLL</h3>

<p>We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.</p>

<p>w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has <a href="https://elronnd.net/writ/2021-02-13_att-asm.html">friendlier syntax</a>. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">bits</span> <span class="mi">64</span>

<span class="nf">section</span> <span class="nv">.text</span>

<span class="nf">global</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nf">export</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nl">DllMainCRTStartup:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">version</span>
<span class="nf">export</span> <span class="nv">version</span>
<span class="nl">version:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">3</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">rand64</span>
<span class="nf">export</span> <span class="nv">rand64</span>
<span class="nl">rand64:</span>
	<span class="nf">rdrand</span> <span class="nb">rax</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nf">export</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nl">dist:</span>
	<span class="nf">mulss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">mulss</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">addss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">sqrtss</span> <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">global</code> directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
<code class="language-plaintext highlighter-rouge">export</code> directive is Windows-specific and is equivalent to <code class="language-plaintext highlighter-rouge">dllexport</code> in
C.</p>

<p>Every DLL must have an entrypoint, usually named <code class="language-plaintext highlighter-rouge">DllMainCRTStartup</code>. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.</p>

<p>Here’s how to assemble and link the DLL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o
</code></pre></div></div>

<h3 id="call-the-dlls-from-python">Call the DLLs from Python</h3>

<p>Python has a nice, built-in C interop, <code class="language-plaintext highlighter-rouge">ctypes</code>, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ctypes</span>

<span class="k">def</span> <span class="nf">load</span><span class="p">(</span><span class="n">version</span><span class="p">):</span>
    <span class="n">hello</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="sa">f</span><span class="s">"./hello</span><span class="si">{</span><span class="n">version</span><span class="si">}</span><span class="s">.dll"</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">(</span><span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">,</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_ulonglong</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="k">return</span> <span class="n">hello</span>

<span class="k">for</span> <span class="n">hello</span> <span class="ow">in</span> <span class="n">load</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"version"</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">())</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"rand   "</span><span class="p">,</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">()</span><span class="si">:</span><span class="mi">016</span><span class="n">x</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"dist   "</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
</code></pre></div></div>

<p>After loading the DLL with <code class="language-plaintext highlighter-rouge">CDLL</code> the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0
</code></pre></div></div>

<p>That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to build and use DLLs on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/05/31/"/>
    <id>urn:uuid:6b64024a-6945-4bff-8226-33b9357babda</id>
    <updated>2021-05-31T02:13:40Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I’ve recently been involved with a couple of discussions about Windows’
dynamic linking. One was <a href="https://begriffs.com/">Joe Nelson</a> in considering how to make
<a href="https://github.com/begriffs/libderp">libderp</a> accessible on Windows, and the other was about <a href="/blog/2020/05/15/">w64devkit</a>,
my Mingw-w64 distribution. I use these techniques so infrequently that I
need to figure it all out again each time I need it. Unfortunately there’s
a whole lot of outdated and incorrect information online which gets in the
way every time this happens. While it’s all fresh in my head, I will now
document what I know works.</p>

<p>In this article, all commands and examples are being run in the context of
w64devkit (1.8.0).</p>

<h3 id="mingw-w64">Mingw-w64</h3>

<p>If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, <code class="language-plaintext highlighter-rouge">square</code>,
that returns the square of its input (<code class="language-plaintext highlighter-rouge">square.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The header file (<code class="language-plaintext highlighter-rouge">square.h</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>To build a stripped, size-optimized DLL, <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o square.dll square.c
</code></pre></div></div>

<p>Now a test program to link against it (<code class="language-plaintext highlighter-rouge">main.c</code>), which “imports” <code class="language-plaintext highlighter-rouge">square</code>
from <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Linking and testing it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p>It’s that simple. Or more traditionally, using the <code class="language-plaintext highlighter-rouge">-l</code> flag:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s -L. main.c -lsquare
</code></pre></div></div>

<p>Given <code class="language-plaintext highlighter-rouge">-lxyz</code> GCC will look for <code class="language-plaintext highlighter-rouge">xyz.dll</code> in the library path.</p>

<h4 id="viewing-exported-symbols">Viewing exported symbols</h4>

<p>Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s <code class="language-plaintext highlighter-rouge">nm -D</code>, but despite what
the internet will tell you, this tool does not support DLLs. <code class="language-plaintext highlighter-rouge">objdump</code>
will print the exports as part of the “private” headers (<code class="language-plaintext highlighter-rouge">-p</code>). A bit of
<code class="language-plaintext highlighter-rouge">awk</code> can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, <code class="language-plaintext highlighter-rouge">exports.sh</code>, that composes <code class="language-plaintext highlighter-rouge">objdump</code> and
<code class="language-plaintext highlighter-rouge">awk</code> into the tool I want:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">printf</span> <span class="s1">'LIBRARY %s\nEXPORTS\n'</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
objdump <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> | <span class="nb">awk</span> <span class="s1">'/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'</span>
</code></pre></div></div>

<p>Running this on <code class="language-plaintext highlighter-rouge">square.dll</code> above:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square
</code></pre></div></div>

<p>This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
<a href="https://sourceware.org/binutils/docs/binutils/def-file-format.html"><code class="language-plaintext highlighter-rouge">.def</code> file format</a> (<a href="https://www.willus.com/mingw/yongweiwu_stdcall.html">also</a>), which will be particularly
useful in a moment.</p>

<p>Mingw-w64 has a <code class="language-plaintext highlighter-rouge">gendef</code> tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square
</code></pre></div></div>

<p>Alternatively Visual Studio provides <code class="language-plaintext highlighter-rouge">dumpbin</code>. It’s not as concise as
<code class="language-plaintext highlighter-rouge">exports.sh</code> but it’s a lot less verbose than <code class="language-plaintext highlighter-rouge">objdump -p</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...
</code></pre></div></div>

<h4 id="mingw-w64-improved">Mingw-w64 (improved)</h4>

<p>You can get by without knowing anything more, which is usually enough for
those looking to support Windows as a secondary platform, even just as a
cross-compilation target. However, with a bit more work we can do better.
Imagine doing the above with a non-trivial program. GCC doesn’t know which
functions are part of the API and which are not. Obviously static
functions should not be exported, but what about non-static functions
visible between translation units (i.e. object files)?</p>

<p>For instance, suppose <code class="language-plaintext highlighter-rouge">square.c</code> also has this function which is not part
of its API but may be called by another translation unit.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>Now when I build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square
</code></pre></div></div>

<p>On the other side, when I build <code class="language-plaintext highlighter-rouge">main.c</code> how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).</p>

<p>On Windows both are solved by adding <code class="language-plaintext highlighter-rouge">__declspec</code> notation on both sides.
In <code class="language-plaintext highlighter-rouge">square.c</code> the exports are marked as <code class="language-plaintext highlighter-rouge">dllexport</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>In the header, it’s marked as an import:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p>The mere presence of <code class="language-plaintext highlighter-rouge">dllexport</code> tells the linker to only export those
functions marked as exports, and so <code class="language-plaintext highlighter-rouge">internal_func</code> disappears from the
exports list. Convenient!</p>

<p>On the import side, during compilation of the original program, GCC
assumed <code class="language-plaintext highlighter-rouge">square</code> wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a <a href="https://www.airs.com/blog/archives/41">PLT</a>). With
<code class="language-plaintext highlighter-rouge">dllimport</code>, GCC knows it’s an imported function and so doesn’t go through
a trampoline.</p>

<p>While generally unnecessary for the GNU toolchain, it’s good hygiene to
use <code class="language-plaintext highlighter-rouge">__declspec</code>. It’s also mandatory when using <a href="https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B">MSVC</a>, in case you
care about that as well.</p>

<h3 id="msvc">MSVC</h3>

<p>Mingw-w64-compiled DLLs will work with <code class="language-plaintext highlighter-rouge">LoadLibrary</code> out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (<code class="language-plaintext highlighter-rouge">loadlib.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">LoadLibrary</span><span class="p">(</span><span class="s">"square.dll"</span><span class="p">);</span>
    <span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">square</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"square"</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with MSVC <code class="language-plaintext highlighter-rouge">cl</code> (via <a href="/blog/2016/06/13/#visual-c"><code class="language-plaintext highlighter-rouge">vcvars.bat</code></a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo loadlib.c
$ ./loadlib
4
</code></pre></div></div>

<p>However, the MSVC linker, unlike Binutils <code class="language-plaintext highlighter-rouge">ld</code>, cannot link directly with
DLLs. It requires an <em>import library</em>. Conventionally this matches the DLL
name but has a <code class="language-plaintext highlighter-rouge">.lib</code> extension — <code class="language-plaintext highlighter-rouge">square.lib</code> in this case. The Mingw-w64
ecosystem conventionally uses <code class="language-plaintext highlighter-rouge">.dll.a</code>, as in <code class="language-plaintext highlighter-rouge">square.dll.a</code>, in order to
distinguish it from a static library, but it’s the same format. The most
convenient way to get an import library is to ask GCC to generate one at
link-time via <code class="language-plaintext highlighter-rouge">--out-implib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c
</code></pre></div></div>

<p>Back to <code class="language-plaintext highlighter-rouge">cl</code>, just add <code class="language-plaintext highlighter-rouge">square.lib</code> as another input. You don’t actually
need <code class="language-plaintext highlighter-rouge">square.dll</code> present at link time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /Os main.c square.lib
$ ./main
4
</code></pre></div></div>

<p>What if you already have the DLL and you just need an import library? GNU
Binutils’ <code class="language-plaintext highlighter-rouge">dlltool</code> can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a <code class="language-plaintext highlighter-rouge">.def</code>
file enumerating the exports. (Why?) What luck that we have a tool for
this!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll &gt;square.def
$ dlltool --input-def square.def --output-lib square.lib
</code></pre></div></div>

<h3 id="reversing-directions">Reversing directions</h3>

<p>Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with <code class="language-plaintext highlighter-rouge">dllexport</code>. The <code class="language-plaintext highlighter-rouge">/LD</code> (case
sensitive) is just like GCC’s <code class="language-plaintext highlighter-rouge">-shared</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">cl</code> outputs three files: <code class="language-plaintext highlighter-rouge">square.dll</code>, <code class="language-plaintext highlighter-rouge">square.lib</code>, and <code class="language-plaintext highlighter-rouge">square.exp</code>.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.</p>

<p>This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at
least for C interfaces that <a href="/blog/2023/08/27/">don’t share CRT objects</a>.</p>

<h3 id="tying-it-all-together">Tying it all together</h3>

<p>If your program is designed to be portable, those <code class="language-plaintext highlighter-rouge">__declspec</code> will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.</p>

<p>The strategy will be to mark all API functions with <code class="language-plaintext highlighter-rouge">SQUARE_API</code> and
expand that to whatever is necessary at the time. When building a library,
it will expand to <code class="language-plaintext highlighter-rouge">dllexport</code>, or default visibility on unix-likes. When
consuming a library it will expand to <code class="language-plaintext highlighter-rouge">dllimport</code>, or nothing outside of
Windows. The new <code class="language-plaintext highlighter-rouge">square.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="cp">#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>The new <code class="language-plaintext highlighter-rouge">square.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SQUARE_BUILD
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">main.c</code> remains the same. When compiling on unix-like systems, add the
<code class="language-plaintext highlighter-rouge">-fvisibility=hidden</code> to hide all symbols by default so that this macro
can reveal them.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4
</code></pre></div></div>

<h3 id="makefile-ideas">Makefile ideas</h3>

<p>While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different <a href="/blog/2017/08/20/">Makefiles</a> per toolchain
and target. Aside from the <code class="language-plaintext highlighter-rouge">SQUARE_API</code> type of macros, the source code
can fortunately remain fairly agnostic about it.</p>

<p>Here’s what I might use as <code class="language-plaintext highlighter-rouge">NMakefile</code> for MSVC <code class="language-plaintext highlighter-rouge">nmake</code>:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>     <span class="o">=</span> cl /nologo
<span class="nv">CFLAGS</span> <span class="o">=</span> /Os

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.h square.lib</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> main.c square.lib

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> /LD <span class="nv">$(CFLAGS)</span> square.c

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="p">-</span>del /f main.exe square.dll square.lib square.exp
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nmake /nologo /f NMakefile
</code></pre></div></div>

<p>For w64devkit and cross-compiling, <code class="language-plaintext highlighter-rouge">Makefile.w64</code>, which includes
import library generation for the sake of MSVC consumers:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.dll square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c square.dll <span class="nv">$(LDLIBS)</span>

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nt">-Wl</span>,--out-implib,<span class="err">$</span><span class="o">(</span>@:dll<span class="o">=</span>lib<span class="o">)</span> <span class="se">\</span>
	    <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main.exe square.dll square.lib
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make -f Makefile.w64
</code></pre></div></div>

<p>And a <code class="language-plaintext highlighter-rouge">Makefile</code> for everyone else:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span> <span class="nt">-fvisibility</span><span class="o">=</span>hidden
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main libsquare.so</span>

<span class="nl">main</span><span class="o">:</span> <span class="nf">main.c libsquare.so square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c ./libsquare.so <span class="nv">$(LDLIBS)</span>

<span class="nl">libsquare.so</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main libsquare.so
</code></pre></div></div>

<p>Now that I have this article, I’m glad I won’t have to figure this all out
again next time I need it!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>A guide to Windows application development using w64devkit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/03/11/"/>
    <id>urn:uuid:b04dbe3d-2e79-4afd-ad20-6ce0b232242e</id>
    <updated>2021-03-11T01:40:31Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>There’s a trend of building services where a monolithic application is
better suited, or using JavaScript and Python then being stumped by their
troublesome deployment story. This leads to solutions like <a href="https://deftly.net/posts/2017-06-01-measuring-the-weight-of-an-electron.html">bundling an
entire web browser</a> with an application, or using containers to
circumscribe <a href="https://research.swtch.com/deps">a sprawling dependency tree made of mystery meat</a>.</p>

<p>My <a href="/blog/2020/05/15/">small development distribution</a> for Windows, <a href="https://github.com/skeeto/w64devkit">w64devkit</a>,
is my own little way of pushing back against this trend where it affects
me most. Following in the footsteps of projects like <a href="https://handmadehero.org/">Handmade Hero</a>
and <a href="https://www.youtube.com/playlist?list=PLlaINRtydtNWuRfd4Ra3KeD6L9FP_tDE7">Making a Video Game from Scratch</a>, this is my guide to
no-nonsense software development using my development kit. It’s an
overview of the tooling and development workflow, and I’ve tried not to
assume too much knowledge of the reader. Being a guide rather than manual,
it is incomplete on its own, and I link to substantial external resources
to fill in the gaps. The guide is capped with a small game I wrote
entirely using my development kit, serving as a demonstration of what
sorts of things are not only possible, but quite reasonably attainable.</p>

<!--more-->

<video src="https://nullprogram.s3.amazonaws.com/asteroids/asteroids.mp4" width="600" height="600" controls="">
</video>

<p>Game repository: <a href="https://github.com/skeeto/asteroids-demo">https://github.com/skeeto/asteroids-demo</a><br />
Guide to source: <a href="https://idle.nprescott.com/2021/understanding-asteroids.html">Understanding Asteroids</a></p>

<h3 id="initial-setup">Initial setup</h3>

<p>Of course you cannot use the development kit if you don’t have it yet. Go
to the <a href="https://github.com/skeeto/w64devkit/releases">releases section</a> and download the latest release. It will be
a .zip file named <code class="language-plaintext highlighter-rouge">w64devkit-x.y.z.zip</code> where <code class="language-plaintext highlighter-rouge">x.y.z</code> is the version.</p>

<p>You will need to unzip the development kit before using it. Windows has
built-in support for .zip files, so you can either right-click to access
“Extract All…” or navigate into it as a folder then drag-and-drop the
<code class="language-plaintext highlighter-rouge">w64devkit</code> directory somewhere outside the .zip file. It doesn’t care
where it’s unzipped (aka it’s “portable”), so put it where ever is
convenient: your desktop, user profile directory, a thumb drive, etc. You
can move it later if you change your mind just so long as you’re not
actively running it. If you decide you don’t need it anymore then delete
it.</p>

<h3 id="entering-the-development-environment">Entering the development environment</h3>

<p>There is a <code class="language-plaintext highlighter-rouge">w64devkit.exe</code> in the unzipped <code class="language-plaintext highlighter-rouge">w64devkit</code> directory. This is
the easiest way to enter the development environment, and will not require
system configuration changes. This program puts the kit’s programs in the
<code class="language-plaintext highlighter-rouge">PATH</code> environment variable then runs a Bourne shell — the standard unix
shell. Aside from the text editor, this is the primary interface for
developing software. In time you may even extend this environment with
your own tools.</p>

<p>If you want an additional “terminal” window, run <code class="language-plaintext highlighter-rouge">w64devkit.exe</code> again. If
you use it a lot, you may want to create a shortcut and even pin it to
your task bar.</p>

<p>Whether on Windows or unix-like systems, when you type a command into the
system shell it uses the <code class="language-plaintext highlighter-rouge">PATH</code> environment variable to locate the actual
program to run for that command. In practice, the <code class="language-plaintext highlighter-rouge">PATH</code> variable is a
concatenation of multiple directories, and the shell searches these
directories in order. On unix-like systems, <code class="language-plaintext highlighter-rouge">PATH</code> elements are separated
by colons. However, Windows uses colons to delimit drive letters, so its
<code class="language-plaintext highlighter-rouge">PATH</code> elements are separated by semicolons.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Prepending to PATH on unix</span>
<span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>

<span class="c"># Prepending to PATH on Windows (w64devkit)</span>
<span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin;</span><span class="nv">$PATH</span><span class="s2">"</span>
</code></pre></div></div>

<p>For more advanced users: Rather than use <code class="language-plaintext highlighter-rouge">w64devkit.exe</code>, you could “Edit
environment variables for your account” and manually add w64devkit’s <code class="language-plaintext highlighter-rouge">bin</code>
directory to your <code class="language-plaintext highlighter-rouge">PATH</code>, making the tools generally available everywhere
on your system. If you’ve gone this route, you can start a Bourne shell at
any time with <code class="language-plaintext highlighter-rouge">sh -l</code>. (The <code class="language-plaintext highlighter-rouge">-l</code> option requests a login shell.)</p>

<p>Also borrowed from the unix world is the concept of a <em>home directory</em>,
specified by the <code class="language-plaintext highlighter-rouge">HOME</code> environment variable. By default this will be your
user profile directory, typically <code class="language-plaintext highlighter-rouge">C:/Users/$USER</code>. Login shells always
start in the home directory. This directory is often indicated by tilde
(<code class="language-plaintext highlighter-rouge">~</code>), and many programs automatically expand a leading tilde to the home
directory.</p>

<h3 id="shell-basics">Shell basics</h3>

<p>The shell is a command interpreter. It’s named such because <a href="https://www.youtube.com/watch?v=tc4ROCJYbm0&amp;t=4m57s">it was
originally a <em>shell</em> around the operating system kernel</a> — the user
interface to the kernel. Your system’s graphical interface — Windows
Explorer, or <code class="language-plaintext highlighter-rouge">Explorer.exe</code> — is really just a kind of shell, too. That
shell is oriented around the mouse and graphics. This is fine for some
tasks, but a keyboard-oriented command shell is far better suited for
development tasks. It’s more efficient, but more importantly its features
are composable: Complex operations and processes can be <a href="https://www.youtube.com/watch?v=bKzonnwoR2I">constructed
from</a> simple, easy-to-understand tools. Embrace it!</p>

<p>In the shell you can navigate between directories with <code class="language-plaintext highlighter-rouge">cd</code>, make
directories with <code class="language-plaintext highlighter-rouge">mkdir</code>, remove files with <code class="language-plaintext highlighter-rouge">rm</code>, regular expression text
searches with <code class="language-plaintext highlighter-rouge">grep</code>, etc. Run <code class="language-plaintext highlighter-rouge">busybox</code> to see a listing of the available
standard commands. Unfortunately there are no manual pages, but you can
access basic usage information for any command with <code class="language-plaintext highlighter-rouge">busybox CMD --help</code>.</p>

<p>Windows’ standard command shell is <code class="language-plaintext highlighter-rouge">cmd.exe</code>. Unfortunately this shell is
terrible and exists mostly for legacy compatibility. The intended
replacement is PowerShell for users who regularly use a shell. However,
PowerShell is fundamentally broken, does virtually everything incorrectly,
and manages to be even worse than <code class="language-plaintext highlighter-rouge">cmd.exe</code>. Besides, sticking to POSIX
shell conventions significantly improves build portability, and unix tool
knowledge is transferable to basically every other operating system.</p>

<p>Unix’s standard shell was the Bourne shell, <code class="language-plaintext highlighter-rouge">sh</code>. The shells in use today
are Bourne shell clones with a superset of its features. The most popular
interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist
shell) has become popular for non-interactive use (scripting). The shell
included with w64devkit is the BusyBox fork of the Almquist shell (<code class="language-plaintext highlighter-rouge">ash</code>),
closely related to dash. The Almquist shell has almost no non-interactive
features beyond the standard Bourne shell, and so as far as scripts are
concerned can be regarded as a plain Bourne shell clone. That’s why I
typically refer to it by the name <code class="language-plaintext highlighter-rouge">sh</code>.</p>

<p>However, BusyBox’s Almquist shell has interactive features much like Bash,
and Bash users should be quite comfortable. It’s not just tab-completion
but a slew of Emacs-like keybindings:</p>

<ul>
  <li><kbd>Ctrl-r</kbd>: search backwards in history</li>
  <li><kbd>Ctrl-s</kbd>: search forwards in history</li>
  <li><kbd>Ctrl-p</kbd>: previous command (Up)</li>
  <li><kbd>Ctrl-n</kbd>: next command (Down)</li>
  <li><kbd>Ctrl-a</kbd>: cursor to the beginning of line (Home)</li>
  <li><kbd>Ctrl-e</kbd>: cursor to the end of line (End)</li>
  <li><kbd>Alt-b</kbd>: cursor back one word</li>
  <li><kbd>Alt-f</kbd>: cursor forward one word</li>
  <li><kbd>Ctrl-l</kbd>: clear the screen</li>
  <li><kbd>Alt-d</kbd>: delete word after the cursor</li>
  <li><kbd>Ctrl-w</kbd>: delete the word before the cursor</li>
  <li><kbd>Ctrl-k</kbd>: delete to the end of the line</li>
  <li><kbd>Ctrl-u</kbd>: delete to the beginning of the line</li>
  <li><kbd>Ctrl-f</kbd>: cursor forward one character (Right)</li>
  <li><kbd>Ctrl-b</kbd>: cursor backward one character (Left)</li>
  <li><kbd>Ctrl-d</kbd>: delete character under the cursor (Delete)</li>
  <li><kbd>Ctrl-h</kbd>: delete character before the cursor (Backspace)</li>
</ul>

<p>Take special note of Ctrl-r, which is the most important and powerful
shortcut of the bunch. Frequent use is a good habit. Don’t mash the up
arrow to search through the command history.</p>

<p>Special note for Cygwin and MSYS2 users: the shell is aware of Windows
paths and does not present a virtual unix file system scheme. This has
important consequences for scripting, both good and bad. The shell even
supports backslash as a directory separator, though you should of course
prefer forward slashes.</p>

<h4 id="shell-customization">Shell customization</h4>

<p>Login shells (<code class="language-plaintext highlighter-rouge">-l</code>) evaluate the contents of <code class="language-plaintext highlighter-rouge">~/.profile</code> on startup. This
is your chance to customize the shell configuration, such as setting
environment variables or defining aliases and functions. For instance, if
you wanted the prompt to show the working directory in green you’d set
<code class="language-plaintext highlighter-rouge">PS1</code> in your <code class="language-plaintext highlighter-rouge">~/.profile</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">PS1</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">printf</span> <span class="s1">'\x1b[33;1m\\w\x1b[0m$ '</span><span class="si">)</span><span class="s2">"</span>
</code></pre></div></div>

<p>If you find yourself using the same command sequences or set of options
again and again, you might consider putting those commands into a script,
and then installing that script somewhere on your <code class="language-plaintext highlighter-rouge">PATH</code> so that you can
run it as a new command. First make a directory to hold your scripts, say
in <code class="language-plaintext highlighter-rouge">~/bin</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/bin
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">~/.profile</code> prepend it to your <code class="language-plaintext highlighter-rouge">PATH</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin;</span><span class="nv">$PATH</span><span class="s2">"</span>
</code></pre></div></div>

<p>If you don’t want to start a fresh shell to try it out, then load the new
configuration in your current shell:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source</span> ~/.profile
</code></pre></div></div>

<p>Suppose you keep getting the <code class="language-plaintext highlighter-rouge">tar</code> switches mixed up and you’d like to
just have an <code class="language-plaintext highlighter-rouge">untar</code> command that does the right thing. Create a file
named <code class="language-plaintext highlighter-rouge">untar</code> or <code class="language-plaintext highlighter-rouge">untar.sh</code> in <code class="language-plaintext highlighter-rouge">~/bin</code> with these contents:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">tar</span> <span class="nt">-xaf</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>Now a command like <code class="language-plaintext highlighter-rouge">untar something.tar.gz</code> will extract the archive
contents.</p>

<p>To learn more about Bourne shell scripting, the POSIX <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html">shell command
language specification</a> is a good reference. All of the features
listed in that document are available to your shell scripts.</p>

<h3 id="text-editing">Text editing</h3>

<p>The development kit includes the powerful and popular text editor
<a href="https://www.vim.org/">Vim</a>. It takes effort to learn, but is well worth the investment.
It’s packed with features, but since you only need a small number of them
on a regular basis it’s not as daunting as it might appear. Using Vim
effectively, you will write and edit text so much more quickly than
before. That includes not just code, but prose: READMEs, documentation,
etc.</p>

<p>(The catch: Non-modal editing will forever feel frustratingly inefficient.
That’s not because you will become unpracticed at it, or even have trouble
code switching between input styles, but because you’ll now be aware how
bad it is. Ignorance is bliss.)</p>

<p>Vim includes its own tutorial for absolute beginners which you can access
with the <code class="language-plaintext highlighter-rouge">vimtutor</code> command. It will run in the console window and guide
you through the basics in about half an hour. Do not be afraid to return
to the tutorial at any time since this is the stuff you need to know by
heart.</p>

<p>When it comes time to actually use Vim to write code, you can continue
writing code via the terminal interface (<code class="language-plaintext highlighter-rouge">vim</code>), or you can run the
graphical interface (<code class="language-plaintext highlighter-rouge">gvim</code>). The latter is recommended since it has some
nice quality-of-life features, but it’s not strictly necessary. When
starting the GUI, put an ampersand (<code class="language-plaintext highlighter-rouge">&amp;</code>) on the command so that it runs in
the background. For instance this brings up the editor with two files open
but leaves the shell running in the foreground so you can continue using
it while you edit:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gvim main.c Makefile &amp;
</code></pre></div></div>

<p>Vim’s defaults are good but imperfect. Before getting started with
actually editing code you should establish at least the following minimal
configuration in <code class="language-plaintext highlighter-rouge">~/_vimrc</code>. (To understand these better, use <code class="language-plaintext highlighter-rouge">:help</code> to
jump the built-in documentation.)</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set</span> <span class="nb">hidden</span> <span class="nb">encoding</span><span class="p">=</span>utf<span class="m">-8</span> <span class="nb">shellslash</span>
<span class="k">filetype</span> plugin <span class="nb">indent</span> <span class="k">on</span>
<span class="nb">syntax</span> <span class="k">on</span>
</code></pre></div></div>

<p>The graphical interface defaults to a white background. Many people prefer
“dark mode” when editing code, so inverting this is simply a matter of
choosing a dark color scheme. Vim comes with a handful of color schemes,
around half of which have dark backgrounds. Use <code class="language-plaintext highlighter-rouge">:colorscheme</code> to change
it, and put it in your <code class="language-plaintext highlighter-rouge">~/_vimrc</code> to persist it.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">colorscheme</span> slate
</code></pre></div></div>

<p>The default graphical interface includes a menu bar and tool bar. There
are better ways to accomplish all these operations, none of which require
touching the mouse, so consider removing all that junk:</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set</span> <span class="nb">guioptions</span><span class="p">=</span>ac
</code></pre></div></div>

<p>Finally, since the development kit is oriented around C and C++, here’s my
own entire Vim configuration for C which makes it obey my own style:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set cinoptions+=t0,l1,:0 cinkeys-=0#
</code></pre></div></div>

<p>Once you’re comfortable with the basics, the best next step is to read
<a href="https://pragprog.com/titles/dnvim2/practical-vim-second-edition/"><em>Practical Vim: Edit Text at the Speed of Thought</em></a> by Drew Neil.
It’s an opinionated guide to Vim that instills good habits. If you want
something cost-free to whet your appetite, check out <a href="https://www.moolenaar.net/habits.html"><em>Seven habits of
effective text editing</em></a>.</p>

<h3 id="writing-an-application">Writing an application</h3>

<p>We’ve established a shell and text editor. Next is the development
workflow for writing an actual application. Ultimately you will invoke a
compiler from within Vim, which will parse compiler messages and take you
directly to the parts of your source code that need attention. Before we
get that far, let’s start with the basics.</p>

<p>The classic example is the “hello world” program, which we’ll suppose is
in a file called <code class="language-plaintext highlighter-rouge">hello.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"Hello, world!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While this development kit provides a version of the GNU compiler, <code class="language-plaintext highlighter-rouge">gcc</code>,
this guide mostly speaks of it in terms of the generic unix C compiler
name, <code class="language-plaintext highlighter-rouge">cc</code>. Unix-like systems install <code class="language-plaintext highlighter-rouge">cc</code> as an alias for the system’s
default C compiler, and w64devkit is no exception.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>This command creates <code class="language-plaintext highlighter-rouge">hello.exe</code> from <code class="language-plaintext highlighter-rouge">hello.c</code>. Since this is not (yet?)
on your <code class="language-plaintext highlighter-rouge">PATH</code>, you must invoke it via a path name (i.e. the command must
include a slash), since otherwise the shell will search for it via the
<code class="language-plaintext highlighter-rouge">PATH</code> variable. Typically this means putting <code class="language-plaintext highlighter-rouge">./</code> in front of the program
name, meaning “run the program in the current directory”. As a convenience
you do not need to include the <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./hello
</code></pre></div></div>

<p>Unlike the <code class="language-plaintext highlighter-rouge">untar</code> shell script from before, this <code class="language-plaintext highlighter-rouge">hello.exe</code> is entirely
independent of w64devkit. You can share it with anyone running Windows and
they’ll be able to execute it. There’s a little bit of runtime embedded in
the executable, but the bulk of the runtime is in the operating system
itself. I want to highlight this point because <em>most programming languages
don’t work like this</em>, or at least doing so is unnatural with lots of
compromises. The users of your software do not need to install a runtime
or other supporting software. They just run the executable you give them!</p>

<p>That executable is probably pretty small, less than 50kB — basically a
miracle by today’s standards. Sure, it’s hardly doing anything right now,
but you can add a whole lot more functionality without that executable
getting much bigger. In fact, it’s entirely unoptimized right now and
could be even smaller. Passing the <code class="language-plaintext highlighter-rouge">-Os</code> flag tells the compiler to
optimize for size and <code class="language-plaintext highlighter-rouge">-s</code> flag tells the linker to strip out unneeded
information.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Os</span> <span class="nt">-s</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>That cuts the program down to around a third of its previous size. If
necessary you can still do even better than this, but that’s outside the
scope of this guide.</p>

<p>So far the program could still be valid enough to compile but contain
obvious mistakes. The compiler can warn about many of these mistakes, and
so it’s always worth enabling these warnings. This requires two flags:
<code class="language-plaintext highlighter-rouge">-Wall</code> (“all” warnings) and <code class="language-plaintext highlighter-rouge">-Wextra</code> (extra warnings).</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>When you’re working on a program, you often don’t want optimization
enabled since it makes it more difficult to debug. However, some warnings
aren’t fired unless optimization is enabled. Fortunately there’s an
optimization level to resolve this, <code class="language-plaintext highlighter-rouge">-Og</code> (optimize for debugging).
Combine this with <code class="language-plaintext highlighter-rouge">-g3</code> to embed debug information in the program. This
will be handy later.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>These are the compiler flags you typically want to enable while developing
your software. When you distribute it, you’d use either <code class="language-plaintext highlighter-rouge">-Os -s</code> (optimize
for size) or <code class="language-plaintext highlighter-rouge">-O3 -s</code> (optimize for speed).</p>

<h4 id="makefiles">Makefiles</h4>

<p>I mentioned running the compiler from Vim. This isn’t done directly but
via special build script called a Makefile. You invoke the <code class="language-plaintext highlighter-rouge">make</code> program
from Vim, which invokes the compiler as above. The simplest Makefile would
look like this, in a file literally named <code class="language-plaintext highlighter-rouge">Makefile</code>:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">hello.exe</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">cc</span> <span class="err">-Wall</span> <span class="err">-Wextra</span> <span class="err">-Og</span> <span class="err">-g3</span> <span class="err">-o</span> <span class="err">hello.exe</span> <span class="err">hello.c</span>
</code></pre></div></div>

<p>This tells <code class="language-plaintext highlighter-rouge">make</code> that the file named <code class="language-plaintext highlighter-rouge">hello.exe</code> is derived from another
file called <code class="language-plaintext highlighter-rouge">hello.c</code>, and the tab-indented line is the recipe for doing
so. Running the <code class="language-plaintext highlighter-rouge">make</code> command will run the compiler command if and only
if <code class="language-plaintext highlighter-rouge">hello.c</code> is newer than <code class="language-plaintext highlighter-rouge">hello.exe</code>.</p>

<p>To run <code class="language-plaintext highlighter-rouge">make</code> from Vim, use the <code class="language-plaintext highlighter-rouge">:make</code> command inside Vim. It will not
only run <code class="language-plaintext highlighter-rouge">make</code> but also capture its output in an internal buffer called
the <em>quickfix list</em>. If there is any warning or error, Vim will jump to
it. Use <code class="language-plaintext highlighter-rouge">:cn</code> (next) and <code class="language-plaintext highlighter-rouge">:cp</code> (prev) to move between issues and correct
them, or <code class="language-plaintext highlighter-rouge">:cc</code> to re-display the current issue. When you’re done fixing
the issues, run <code class="language-plaintext highlighter-rouge">:make</code> again to start the cycle over.</p>

<p>Try that now by changing the printed message and recompiling from within
Vim. Intentionally create an error (bad syntax, too many arguments, etc.)
and see what happens.</p>

<p>Makefiles are a powerful and conventional way to build C and C++ software.
Since the development kit includes the standard set of unix utilities,
it’s very easy to write portable Makefiles that work across a variety a
operating systems and environments. Your software isn’t necessarily tied
to Windows just because you’re using a Windows-based development
environment. If you want to learn how Makefiles work and how to use them
effectively, read <a href="/blog/2017/08/20/"><em>A Tutorial on Portable Makefiles</em></a>. From here on
I’ll assume you’ve read that tutorial.</p>

<p>Ultimately I’d probably write my “hello world” Makefile like so:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>
<span class="nv">EXE</span>     <span class="o">=</span> .exe

<span class="nl">hello$(EXE)</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">hello.c</span> <span class="err">$(LDLIBS)</span>
</code></pre></div></div>

<p>When building a release, optimize for size or speed:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">CFLAGS</span><span class="o">=</span><span class="nt">-Os</span> <span class="nv">LDFLAGS</span><span class="o">=</span><span class="nt">-s</span>
</code></pre></div></div>

<p>This is very much a Windows-first style of Makefile, but still allows it
to be comfortably used on other systems. On Linux this <code class="language-plaintext highlighter-rouge">make</code> invocation
strips away the <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">EXE</span><span class="o">=</span>
</code></pre></div></div>

<p>For a Windows-second Makefile, remove the line with <code class="language-plaintext highlighter-rouge">EXE = .exe</code>. This
allows <code class="language-plaintext highlighter-rouge">EXE</code> to come from the environment. So, for instance, I already
define the <code class="language-plaintext highlighter-rouge">EXE</code> environment variable in my w64devkit <code class="language-plaintext highlighter-rouge">~/.profile</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">EXE</span><span class="o">=</span>.exe
</code></pre></div></div>

<p>On Linux running <code class="language-plaintext highlighter-rouge">make</code> does the right thing, as does running <code class="language-plaintext highlighter-rouge">make</code> on
Windows. No special configuration required.</p>

<p>If my software is truly limited to Windows, I’m likely still interested in
supporting cross-compilation. A common convention for GNU toolchains is a
<code class="language-plaintext highlighter-rouge">CROSS</code> Makefile macro. For example:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CROSS</span>   <span class="o">=</span>
<span class="nv">CC</span>      <span class="o">=</span> <span class="nv">$(CROSS)</span>gcc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">hello.exe</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">hello.c</span> <span class="err">$(LDLIBS)</span>
</code></pre></div></div>

<p>On Windows I just run <code class="language-plaintext highlighter-rouge">make</code>, but on Linux I’d set <code class="language-plaintext highlighter-rouge">CROSS</code> appropriately.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">CROSS</span><span class="o">=</span>x86_64-w64-mingw32-
</code></pre></div></div>

<h4 id="navigating">Navigating</h4>

<p>What happens if you’re working on a larger program and you need to jump to
the definition of a function, macro, or variable? It would be tedious to
use <code class="language-plaintext highlighter-rouge">grep</code> all the time to find definitions. The development kit includes
a solid implementation of <code class="language-plaintext highlighter-rouge">ctags</code> for building a <em>tags database</em> lists the
locations for various kinds of definitions, and Vim knows how to read this
database. Most often you’ll want to run it recursively like so:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctags <span class="nt">-R</span>
</code></pre></div></div>

<p>You can of course do this from Vim, too: <code class="language-plaintext highlighter-rouge">:!ctags -R</code></p>

<p>With the cursor over an identifier, press <code class="language-plaintext highlighter-rouge">CTRL-]</code> to jump to a definition
for that name. Use <code class="language-plaintext highlighter-rouge">:tn</code> and <code class="language-plaintext highlighter-rouge">:tp</code> to move between different definitions
(e.g. when the name is overloaded). Or if you have a tag in mind rather
than a name listed in the buffer, use the <code class="language-plaintext highlighter-rouge">:tag</code> command to jump by name.
Vim maintains a tag stack and jump list for going back and forth, like the
backward and forward buttons in a browser.</p>

<h4 id="debugging">Debugging</h4>

<p>I had mentioned that the <code class="language-plaintext highlighter-rouge">-g3</code> option embeds extra information in the
executable. This is for debuggers, and the development kit includes the
GNU Debugger, <code class="language-plaintext highlighter-rouge">gdb</code>, to help you debug your programs. To use it, invoke
GDB on your executable:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb hello.exe
</code></pre></div></div>

<p>From here you can set breakpoints and such, then run the program with
<code class="language-plaintext highlighter-rouge">start</code> or <code class="language-plaintext highlighter-rouge">run</code>, then <code class="language-plaintext highlighter-rouge">step</code> through it line by line. See <a href="https://beej.us/guide/bggdb/"><em>Beej’s Quick
Guide to GDB</em></a> for a guide. During development, always run your
program through GDB, and never exit GDB. See also: <a href="/blog/2022/06/26/"><em>Assertions should be
more debugger-oriented</em></a>.</p>

<h4 id="learning-c-and-c">Learning C and C++</h4>

<p>So far this guide hasn’t actually assumed any C knowledge. One of the best
ways to learn C is by reading the highly-regarded <a href="https://en.wikipedia.org/wiki/The_C_Programming_Language"><em>The C Programming
Language</em></a> and doing the exercises. Alternatively, cost-free options
are <a href="http://beej.us/guide/bgc/"><em>Beej’s Guide to C Programming</em></a> and <a href="https://modernc.gforge.inria.fr/"><em>Modern C</em></a> (more
advanced). You can use the development kit to go through any of these.</p>

<p>I’ve focused on C, but everything above also applies to C++. To learn C++
<a href="https://www.stroustrup.com/tour2.html"><em>A Tour of C++</em></a> is a safe bet.</p>

<h3 id="demonstration">Demonstration</h3>

<p>To illustrate how much you can do with nothing beyond than this 76MB
development kit, here’s a taste in the form of a weekend project: an
<a href="https://github.com/skeeto/asteroids-demo">Asteroids Clone for Windows</a>. That’s the game in the video at the
top of this guide.</p>

<p>The development kit doesn’t include Git so you’d need to install it
separately in order to clone the repository, but you could at least skip
that and download a .zip snapshot of the source. It has no third-party
dependencies yet it includes hardware-accelerated graphics, real-time
sound mixing, and gamepad input. Building a larger and more complex game
is much less about tooling and more about time and skill. That’s what I
mean about w64devkit being <a href="/blog/2020/09/25/">(almost) everything you need</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Well-behaved alias commands on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/02/08/"/>
    <id>urn:uuid:d1c90d96-3696-4183-a52b-b10598a630c7</id>
    <updated>2021-02-08T20:32:45Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>Since its inception I’ve faced a dilemma with <a href="https://github.com/skeeto/w64devkit">w64devkit</a>, my
<a href="/blog/2020/09/25/">all-in-one</a> Mingw-w64 toolchain and <a href="/blog/2020/05/15/">development environment
distribution for Windows</a>. A major goal of the project is no
installation: unzip anywhere and it’s ready to go as-is. However, full
functionality requires alias commands, particularly for BusyBox applets,
and the usual solutions are neither available nor viable. It seemed that
an installer was needed to assemble this last puzzle piece. This past
weekend I finally discovered a tidy and complete solution that solves this
problem for good.</p>

<p>That solution is a small C source file, <a href="https://github.com/skeeto/w64devkit/blob/master/src/alias.c"><code class="language-plaintext highlighter-rouge">alias.c</code></a>. This article is
about why it’s necessary and how it works.</p>

<h3 id="hard-and-symbolic-links">Hard and symbolic links</h3>

<p>Some alias commands are for convenience, such as a <code class="language-plaintext highlighter-rouge">cc</code> alias for <code class="language-plaintext highlighter-rouge">gcc</code> so
that build systems need not assume any particular C compiler. Others are
essential, such as an <code class="language-plaintext highlighter-rouge">sh</code> alias for “<code class="language-plaintext highlighter-rouge">busybox sh</code>” so that it’s available
as a shell for <code class="language-plaintext highlighter-rouge">make</code>. These aliases are usually created with links, hard
or symbolic. A GCC installation might include (roughly) a symbolic link
created like so:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ln</span> <span class="nt">-s</span> gcc cc
</code></pre></div></div>

<p>BusyBox looks at its <code class="language-plaintext highlighter-rouge">argv[0]</code> on startup, and if it names an applet
(<code class="language-plaintext highlighter-rouge">ls</code>, <code class="language-plaintext highlighter-rouge">sh</code>, <code class="language-plaintext highlighter-rouge">awk</code>, etc.), it behaves like that applet. Typically BusyBox
aliases are installed as hard links to the original binary, and there’s
even a <code class="language-plaintext highlighter-rouge">busybox --install</code> to set these up. Both kinds of aliases are
cheap and effective.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ln </span>busybox sh
<span class="nb">ln </span>busybox <span class="nb">ls
ln </span>busybox <span class="nb">awk</span>
</code></pre></div></div>

<p>Unfortunately links are not supported by .zip files on Windows. They’d
need to be created by a dedicated installer. As a result, I’ve strongly
recommended that users run “<code class="language-plaintext highlighter-rouge">busybox --install</code>” at some point to
establish the BusyBox alias commands. While w64devkit works without them,
it works better with them. Still, that’s an installation step!</p>

<p>An alternative option is to simply include a full copy of the BusyBox
binary for each applet — all 150 of them — simulating hard links. BusyBox
is small, around 4kB per applet on average, but it’s not quite <em>that</em>
small. Since the .zip format doesn’t use block compression — files are
compressed individually — this duplication will appear in the .zip itself.
My 573kB BusyBox build duplicated 150 times would double the distribution
size and increase the installation footprint by 25%. It’s not worth the
cost.</p>

<p>Since .zip is so limited, perhaps I should use a different distribution
format that supports links. However, another w64devkit goal is making no
assumptions about what other tools are installed. Windows natively
supports .zip, even if that support isn’t so great (poor performance, low
composability, missing features, etc.). With nothing more than the
w64devkit .zip on a fresh, offline Windows installation, you can begin
efficiently developing professional, native applications in under a
minute.</p>

<h3 id="scripts-as-aliases">Scripts as aliases</h3>

<p>With links off the table, the next best option is a shell script. On
unix-like systems shell scripts are an effective tool for creating complex
alias commands. Unlike links, they can manipulate the argument list. For
instance, w64devkit includes a <code class="language-plaintext highlighter-rouge">c99</code> alias to invoke the C compiler
configured to use the C99 standard. To do this with a shell script:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">exec </span>cc <span class="nt">-std</span><span class="o">=</span>c99 <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>This prepends <code class="language-plaintext highlighter-rouge">-std=c99</code> to the argument list and passes through the rest
untouched via the Bourne shell’s special case <code class="language-plaintext highlighter-rouge">"$@"</code>. Because I used
<code class="language-plaintext highlighter-rouge">exec</code>, the shell process <em>becomes</em> the compiler in place. The shell
doesn’t hang around in the background. It’s just gone. This really quite
elegant and powerful.</p>

<p>The closest available on Windows is a .bat batch file. However, like some
other parts of DOS and Windows, the Batch language was designed as though
its designer once glimpsed at someone using a unix shell, perhaps looking
over their shoulder, then copied some of the ideas without understanding
them. As a result, it’s not nearly as useful or powerful. Here’s the Batch
equivalent:</p>

<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@cc <span class="na">-std</span><span class="o">=</span><span class="kd">c99</span> <span class="err">%</span><span class="o">*</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">@</code> is necessary because Batch prints its commands by default (Bourne
shell’s <code class="language-plaintext highlighter-rouge">-x</code> option), and <code class="language-plaintext highlighter-rouge">@</code> disables it. Windows lacks the concept of
<code class="language-plaintext highlighter-rouge">exec(3)</code>, so Batch file interpreter <code class="language-plaintext highlighter-rouge">cmd.exe</code> continues running alongside
the compiler. A little wasteful but that hardly matters. What does matter
though is that <code class="language-plaintext highlighter-rouge">cmd.exe</code> doesn’t behave itself! If you, say, Ctrl+C to
cancel compilation, you will get the infamous “Terminate batch job (Y/N)?”
prompt which interferes with other programs running in the same console.
The so-called “batch” script isn’t a batch job at all: It’s interactive.</p>

<p>I tried to use Batch files for BusyBox applets, but this issue came up
constantly and made this approach impractical. Nearly all BusyBox applets
are non-interactive, and lots of things break when they aren’t. Worst of
all, you can easily end up with layers of <code class="language-plaintext highlighter-rouge">cmd.exe</code> clobbering each other
to ask if they should terminate. It was frustrating.</p>

<p>The prompt is hardcoded in <code class="language-plaintext highlighter-rouge">cmd.exe</code> and cannot be disabled. Since so much
depends on <code class="language-plaintext highlighter-rouge">cmd.exe</code> remaining exactly the way it is, Microsoft will never
alter this behavior either. After all, that’s why they made PowerShell a
new, separate tool.</p>

<p>Speaking of PowerShell, could we use that instead? Unfortunately not:</p>

<ol>
  <li>
    <p>It’s installed by default on Windows, but is not necessarily enabled.
One of my own use cases for w64devkit involves systems where PowerShell
is disabled by policy. A common policy is it can be used interactively
but not run scripts (“Running scripts is disabled on this system”).</p>
  </li>
  <li>
    <p>PowerShell is not a first class citizen on Windows, and will likely
never be. Even under the friendliest policy it’s not normally possible
to put a PowerShell script on the <code class="language-plaintext highlighter-rouge">PATH</code> and run it by name. (I’m sure
there are ways to make this work via system-wide configuration, but
that’s off the table.)</p>
  </li>
  <li>
    <p>Everything in PowerShell is broken. For example, it does not support
input redirection with files, and instead you must use the <code class="language-plaintext highlighter-rouge">cat</code>-like
command, <code class="language-plaintext highlighter-rouge">Get-Content</code>, to pipe file contents. However, <code class="language-plaintext highlighter-rouge">Get-Content</code>
translates its input and quietly damages your data. There is no way to
disable this “feature” in the version of PowerShell that ships with
Windows, meaning it cannot accomplish the simplest of tasks. This is
just one of many ways that PowerShell is broken beyond usefulness.</p>
  </li>
</ol>

<p>Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts
are still not first class citizens since Windows doesn’t know what to do
with them. Fixing would require system-wide configuration, antithetical to
the philosophy of the project.</p>

<h3 id="solution-compiled-shell-scripts">Solution: compiled shell “scripts”</h3>

<p>My working solution is inspired by an insanely clever hack used by my
favorite media player, <a href="https://mpv.io/">mpv</a>. The Windows build is strange at first
glance, containing two binaries, <code class="language-plaintext highlighter-rouge">mpv.exe</code> (large) and <code class="language-plaintext highlighter-rouge">mpv.com</code> (tiny).
Is that COM as in <a href="/blog/2014/12/09/">an old-school 16-bit DOS binary</a>? No, that’s just
a trick that works around a Windows limitation.</p>

<p>The Windows technology is broken up into subsystems. Console programs run
in the Console subsystem. Graphical programs run in the Windows subsystem.
<a href="/blog/2017/11/30/">The original WSL</a> was a subsystem. Unfortunately this design means
that a program must statically pick a subsystem, hardcoded into the binary
image. The program cannot select a subsystem dynamically. For example,
this is why Java installations have both <code class="language-plaintext highlighter-rouge">java.exe</code> and <code class="language-plaintext highlighter-rouge">javaw.exe</code>, and
Emacs has <code class="language-plaintext highlighter-rouge">emacs.exe</code> and <code class="language-plaintext highlighter-rouge">runemacs.exe</code>. Different binaries for different
subsystems.</p>

<p>On Linux, a program that wants to do graphics just talks to the Xorg
server or Wayland compositor. It can dynamically choose to be a terminal
application or a graphical application. Or even both at once. This is
exactly the behavior of <code class="language-plaintext highlighter-rouge">mpv</code>, and it faces a dilemma on Windows: With
subsystems, how can it be both?</p>

<p>The trick is based on the environment variable <code class="language-plaintext highlighter-rouge">PATHEXT</code> which tells
Windows how to prioritize executables with the same base name but
different file extensions. If I type <code class="language-plaintext highlighter-rouge">mpv</code> and it finds both <code class="language-plaintext highlighter-rouge">mpv.exe</code> and
<code class="language-plaintext highlighter-rouge">mpv.com</code>, which binary will run? It will be the first listed in
<code class="language-plaintext highlighter-rouge">PATHEXT</code>, and by default that starts with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PATHEXT=.COM;.EXE;.BAT;...
</code></pre></div></div>

<p>So it will run <code class="language-plaintext highlighter-rouge">mpv.com</code>, which is actually a plain old <a href="https://wiki.osdev.org/PE">PE+</a> <code class="language-plaintext highlighter-rouge">.exe</code>
in disguise. The Windows subsystem <code class="language-plaintext highlighter-rouge">mpv.exe</code> gets the shortcut and file
associations while Console subsystem <code class="language-plaintext highlighter-rouge">mpv.com</code> catches command line
invocations and serves as console liaison as it invokes the real
<code class="language-plaintext highlighter-rouge">mpv.exe</code>. Ingenious!</p>

<p>I realized I can pull a similar trick to create command aliases — not the
<code class="language-plaintext highlighter-rouge">.com</code> trick, but the miniature flagger program. If only I could compile
each of those Batch files to tiny, well-behaved <code class="language-plaintext highlighter-rouge">.exe</code> files so that it
wouldn’t rely on the badly-behaved <code class="language-plaintext highlighter-rouge">cmd.exe</code>…</p>

<h4 id="tiny-c-programs">Tiny C programs</h4>

<p>Years ago <a href="/blog/2016/01/31/">I wrote about tiny, freestanding Windows executables</a>.
That research paid off here since that’s exactly what I want. The alias
command program need only manipulate its command line, invoke another
program, then wait for it to finish. This doesn’t require the C library,
just a handful of <code class="language-plaintext highlighter-rouge">kernel32.dll</code> calls. My alias command programs can be
so small that would no longer matter that I have 150 of them, and I get
complete control over their behavior.</p>

<p>To compile, I use <code class="language-plaintext highlighter-rouge">-nostdlib</code> and <code class="language-plaintext highlighter-rouge">-ffreestanding</code> to disable all system
libraries, <code class="language-plaintext highlighter-rouge">-lkernel32</code> to pull that one back in, <code class="language-plaintext highlighter-rouge">-Os</code> (optimize for
size), and <code class="language-plaintext highlighter-rouge">-s</code> (strip) all to make the result as small as possible.</p>

<p>I don’t want to write a little program for each alias command. Instead
I’ll use a couple of C defines, <code class="language-plaintext highlighter-rouge">EXE</code> and <code class="language-plaintext highlighter-rouge">CMD</code>, to inject the target
command at compile time. So this Batch file:</p>

<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@target <span class="kd">arg1</span> <span class="kd">arg2</span> <span class="err">%</span><span class="o">*</span>
</code></pre></div></div>

<p>Is equivalent to this alias compilation:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc <span class="nt">-DEXE</span><span class="o">=</span><span class="s2">"target.exe"</span> <span class="nt">-DCMD</span><span class="o">=</span><span class="s2">"target arg1 arg2"</span> <span class="se">\</span>
    <span class="nt">-s</span> <span class="nt">-Os</span> <span class="nt">-nostdlib</span> <span class="nt">-ffreestanding</span> <span class="nt">-o</span> alias.exe alias.c <span class="nt">-lkernel32</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXE</code> string is the actual <em>module</em> name, so the <code class="language-plaintext highlighter-rouge">.exe</code> extension is
required. The <code class="language-plaintext highlighter-rouge">CMD</code> string replaces the first complete token of the
command line string (think <code class="language-plaintext highlighter-rouge">argv[0]</code>) and may contain arbitrary additional
arguments (e.g. <code class="language-plaintext highlighter-rouge">-std=c99</code>). Both are handled as wide strings (<code class="language-plaintext highlighter-rouge">L"..."</code>)
since the alias program uses the wide Win32 API in order to be fully
transparent. Though unfortunately at this time it makes no difference: All
currently aliased programs use the “ANSI” API since the underlying C and
C++ standard libraries only use the ANSI API. (As far as I know, nobody
has ever written fully-functional C and C++ standard libraries for
Windows, not even Microsoft.)</p>

<p>You might wonder why the heck I’m gluing strings together for the
arguments. These will need to be parsed (word split, etc.) by someone
else, so shouldn’t I construct an argv array instead? That’s not how it
works on Windows: Programs receive a flat command string and are expected
to parse it themselves following <a href="https://docs.microsoft.com/en-us/previous-versions/17w5ykft(v=vs.85)">the format specification</a>. When
you write a C program, the C runtime does this for you to provide the
usual argv array.</p>

<p>This is upside down. The caller creating the process already has arguments
split into an argv array — or something like it — but Win32 requires the
caller to encode the argv array as a string following a special format so
that the recipient can immediately decode it. Why marshaling rather than
pass structured data in the first place? Why does Win32 only supply a
decoder (<a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw"><code class="language-plaintext highlighter-rouge">CommandLineToArgv</code></a>) and not an encoder (e.g. the missing
<code class="language-plaintext highlighter-rouge">ArgvToCommandLine</code>)? Hey, I don’t make the rules; I just have to live
with them.</p>

<p>You can look at the original source for the details, but the summary is
that I supply my own <code class="language-plaintext highlighter-rouge">xstrlen()</code>, <code class="language-plaintext highlighter-rouge">xmemcpy()</code>, and partial Win32 command
line parser — just enough to identify the first token, even if that token
is quoted. It glues the strings together, calls <code class="language-plaintext highlighter-rouge">CreateProcessW</code>, waits
for it to exit (<code class="language-plaintext highlighter-rouge">WaitForSingleObject</code>), retrieves the exit code
(<code class="language-plaintext highlighter-rouge">GetExitCodeProcess</code>), and exits with the same status. (The stuff that
comes for free with <code class="language-plaintext highlighter-rouge">exec(3)</code>.)</p>

<p>This all compiles to a 4kB executable, mostly padding, which is small
enough for my purposes. These compress to an acceptable 1kB each in the
.zip file. Smaller would be nicer, but this would require at minimum a
custom linker script, and even smaller would require hand-crafted
assembly.</p>

<p>This lingering issue solved, w64devkit now works better than ever. The
<code class="language-plaintext highlighter-rouge">alias.c</code> source is included in the kit in case you need to make any of
your own well-behaved alias commands.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>w64devkit: (Almost) Everything You Need</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/09/25/"/>
    <id>urn:uuid:e594c82d-a2e1-4035-8527-1b998045ceeb</id>
    <updated>2020-09-25T00:04:11Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="rant"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=24586556">on Hacker News</a>.</em></p>

<p><a href="/blog/2020/05/15/">This past May</a> I put together my own C and C++ development
distribution for Windows called <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>. The <em>entire</em>
release weighs under 80MB and requires no installation. Unzip and run it
in-place anywhere. It’s also entirely offline. It will never
automatically update, or even touch the network. In mere seconds any
Windows system can become a reliable development machine. (To further
increase reliability, <a href="https://jacquesmattheij.com/why-johnny-wont-upgrade/">disconnect it from the internet</a>.) Despite
its simple nature and small packaging, w64devkit is <em>almost</em> everything
you need to develop <em>any</em> professional desktop application, from a
command line utility to a AAA game.</p>

<!--more-->

<p>I don’t mean this in some <a href="/blog/2016/04/30/">useless Turing-complete sense</a>, but in
a practical, <em>get-stuff-done</em> sense. It’s much more a matter of
<em>know-how</em> than of tools or libraries. So then what is this “almost”
about?</p>

<ul>
  <li>
    <p>The distribution does not have WinAPI documentation. It’s notoriously
<a href="http://laurencejackson.com/win32/">difficult to obtain</a> and, besides, unfriendly to redistribution.
It’s essential for interfacing with the operating system and difficult
to work without. Even a dead tree reference book would suffice.</p>
  </li>
  <li>
    <p>Depending on what you’re building, you may still need specialized
tools. For instance, game development requires <a href="https://www.blender.org/">tools for editing art
assets</a>.</p>
  </li>
  <li>
    <p>There is no formal source control system. Git is excluded per the
issues noted in the announcement, and my next option, <a href="https://wiki.debian.org/UsingQuilt">Quilt</a>,
has similar limitations. However, <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code> <em>are</em> included,
and are sufficient for a kind of old-school, patch-based source
control. I’ve used it successfully when dogfooding w64devkit in a
fresh Windows installation.</p>
  </li>
</ul>

<h3 id="everything-else">Everything else</h3>

<p>As I said in my announcement, w64devkit includes a powerful text editor
that fulfills all text editing needs, from code to documentation. The
editor includes a tutorial (<code class="language-plaintext highlighter-rouge">vimtutor</code>) and complete, built-in manual
(<code class="language-plaintext highlighter-rouge">:help</code>) in case you’re not yet familiar with it.</p>

<p>What about navigation? Use the included <a href="https://github.com/universal-ctags/ctags">ctags</a> to generate a
tags database (<code class="language-plaintext highlighter-rouge">ctags -R</code>), then <a href="http://vimdoc.sourceforge.net/htmldoc/tagsrch.html#tagsrch.txt">jump instantly</a> to any
definition at any time. No need for <a href="https://old.reddit.com/r/vim/comments/b3yzq4/a_lsp_client_maintainers_view_of_the_lsp_protocol/">that Language Server Protocol
rubbish</a>. This does not mean you must laboriously type identifiers
as you work. Use <a href="https://georgebrock.github.io/talks/vim-completion/">built-in completion</a>!</p>

<p>Build system? That’s also covered, via a Windows-aware unix-like
environment that includes <code class="language-plaintext highlighter-rouge">make</code>. <a href="/blog/2017/08/20/">Learning how to use it</a> is a
breeze. Software is by its nature unavoidably complicated, so <a href="/blog/2017/03/30/">don’t
make it more complicated than necessary</a>.</p>

<p>What about debugging? Use the debugger, GDB. Performance problems? Use
the profiler, gprof. Inspect compiler output either by asking for it
(<code class="language-plaintext highlighter-rouge">-S</code>) or via the disassembler (<code class="language-plaintext highlighter-rouge">objdump -d</code>). No need to go online for
the <a href="https://godbolt.org/">Godbolt Compiler Explorer</a>, as slick as it is. If the compiler
output is insufficient, use <a href="/blog/2015/07/10/">SIMD intrinsics</a>. In the worst case
there are two different assemblers available. Real time graphics? Use an
operating system API like OpenGL, DirectX, or Vulkan.</p>

<p>w64devkit <em>really is</em> nearly everything you need in a <a href="https://www.youtube.com/watch?v=W3ml7cO96F0&amp;t=1h25m50s">single, no
nonsense, fully-<em>offline</em> package</a>! It’s difficult to emphasize this
point as much as I’d like. When interacting with the broader software
ecosystem, I often despair that <a href="https://www.youtube.com/watch?v=ZSRHeXYDLko">software development has lost its
way</a>. This distribution is my way of carving out an escape from some
of the insanity. As a C and C++ toolchain, w64devkit by default produces
lean, sane, trivially-distributable, offline-friendly artifacts. All
runtime components in the distribution are <a href="https://drewdevault.com/dynlib">static link only</a>,
so no need to distribute DLLs with your application either.</p>

<h3 id="customize-the-distribution-own-the-toolchain">Customize the distribution, own the toolchain</h3>

<p>While most users would likely stick to my published releases, building
w64devkit is a two-step process with a single build dependency, Docker.
Anyone can easily customize it for their own needs. Don’t care about
C++? Toss it to shave 20% off the distribution. Need to tune the runtime
for a specific microarchitecture? Tweak the compiler flags.</p>

<p>One of the intended strengths of open source is users can modify
software to suit their needs. With w64devkit, you <em>own the toolchain</em>
itself. It is <a href="https://research.swtch.com/deps">one of your dependencies</a> after all. Unfortunately
the build initially requires an internet connection even when working
from source tarballs, but at least it’s a one-time event.</p>

<p>If you choose to <a href="https://github.com/nothings/stb">take on dependencies</a>, and you build those
dependencies using w64devkit, all the better! You can tweak them to your
needs and choose precisely how they’re built. You won’t be relying on
the goodwill of internet randos nor the generosity of a free package
registry.</p>

<h3 id="customization-examples">Customization examples</h3>

<p>Building existing software using w64devkit is probably easier than
expected, particularly since much of it has already been “ported” to
MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure
scripts. They never work in w64devkit despite having everything they
technically need. So other than that, here’s a demonstration of building
some popular software.</p>

<p>One of <a href="/blog/2016/09/02/">my coworkers</a> uses his own version of <a href="https://www.chiark.greenend.org.uk/~sgtatham/putty/">PuTTY</a>
patched to play more nicely with Emacs. If you wanted to do the same,
grab the source tarball, unpack it using the provided tools, then in the
unpacked source:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make -C windows -f Makefile.mgw
</code></pre></div></div>

<p>You’ll have a custom-built putty.exe, as well as the other tools. If you
have any patches, apply those first!</p>

<p>Would you like to embed an extension language in your application? Lua
is a solid choice, in part because it’s such a well-behaved dependency.
After unpacking the source tarball:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make PLAT=mingw
</code></pre></div></div>

<p>This produces a complete Lua compiler, runtime, and library. It’s not
even necessary to use the Makefile, as it’s nearly as simple as “<code class="language-plaintext highlighter-rouge">cc
*.c</code>” — painless to integrate or embed into any project.</p>

<p>Do you enjoy NetHack? Perhaps you’d like to <a href="https://bilious.alt.org/">try a few of the custom
patches</a>. This one is a little more complicated, but I was able to
build NetHack 3.6.6 like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"
</code></pre></div></div>

<p>NetHack has <a href="https://wiki.gentoo.org/wiki/Gcc_10_porting_notes/fno_common">a bug necessitating <code class="language-plaintext highlighter-rouge">-fcommon</code></a>. If you have any
patches, apply them with <code class="language-plaintext highlighter-rouge">patch</code> before the last step. I won’t belabor it
here, but with just a little more effort I was also able to produce a
NetHack binary with curses support via <a href="https://pdcurses.org/">PDCurses</a> — statically-linked
of course.</p>

<p>How about my archive encryption tool, <a href="https://github.com/skeeto/enchive">Enchive</a>? The one that
<a href="/blog/2018/04/13/">even works with 16-bit DOS compilers</a>. It requires nothing special
at all!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make
</code></pre></div></div>

<p>w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM.
This means you can modify and recompile these tools without going
through the Docker build. Sadly <a href="https://frippery.org/busybox/">busybox-w32</a> cannot host itself,
though it’s close. I’d <em>love</em> if w64devkit could fully host itself, and
so Docker — and therefore an internet connection and such — would only
be needed to bootstrap, but unfortunately that’s not realistic given the
state of the GNU components.</p>

<h3 id="offline-and-reliable">Offline and reliable</h3>

<p>Software development has increasingly become <a href="https://deftly.net/posts/2017-06-01-measuring-the-weight-of-an-electron.html">dependent on a constant
internet connection</a>. Robust, offline tooling and development is
undervalued.</p>

<p>Consider: Does your current project depend on an external service? Do
you pay for this service to ensure that it remains up? If you pull your
dependencies from a repository, how much do you trust those who maintain
the packages? <a href="https://drewdevault.com/2020/02/06/Dependencies-and-maintainers.html">Do you even know their names?</a> What would be your
project’s fate if that service went down permanently? It will someday,
though hopefully only after your project is dead and forgotten. If you
have the ability to work permanently offline, then you already have
happy answers to all these questions.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>w64devkit: a Portable C and C++ Development Kit for Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/05/15/"/>
    <id>urn:uuid:d600d846-3692-474f-adbf-45db63079581</id>
    <updated>2020-05-15T03:43:04Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23292161">on Hacker News</a>.</em></p>

<p>As a computer engineer, my job is to use computers to solve important
problems. Ideally my solutions will be efficient, and typically that
means making the best use of the resources at hand. Quite often these
resources are machines running Windows and, despite my misgivings about
the platform, there is much to be gained by properly and effectively
leveraging it.</p>

<p>Sometimes <a href="/blog/2018/11/15/">targeting Windows while working from another platform</a>
is sufficient, but other times I must work on the platform itself. There
<a href="/blog/2016/06/13/">are various options available</a> for C development, and I’ve
finally formalized my own development kit: <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>.</p>

<!--more-->

<p>For most users, the value is in the <strong>78MiB .zip</strong> available in the
“Releases” on GitHub. This (relatively) small package includes a
state-of-the-art C and C++ compiler (<a href="http://mingw-w64.org/">latest GCC</a>), a <a href="https://www.vim.org/">powerful
text editor</a>, <a href="https://www.gnu.org/software/gdb/">debugger</a>, a <a href="https://www.nasm.us/">complete x86 assembler</a>,
and <a href="https://frippery.org/busybox/">miniature unix environment</a>. It’s “portable” in that there’s no
installation. Just unzip it and start using it in place. With w64devkit,
it literally takes a few seconds on any Windows to get up and running
with a fully-featured, fully-equipped, first-class <a href="https://sanctum.geek.nz/arabesque/unix-as-ide-introduction/">development
environment</a>.</p>

<p>The development kit is cross-compiled entirely from source using Docker,
though Docker is not needed to actually use it. The repository is just a
Dockerfile and some documentation. The only build dependency is Docker
itself. It’s also easy to customize it for your own personal use, or to
audit and build your own if, for whatever reason, you didn’t trust my
distribution. This is in stark contrast to Windows builds of most open
source software where the build process is typically undocumented,
under-documented, obtuse, or very complicated.</p>

<h3 id="from-script-to-docker">From script to Docker</h3>

<p>Publishing this is not necessarily a commitment to always keep w64devkit
up to date, but this Dockerfile <em>is</em> derived from (and replaces) a shell
script I’ve been using continuously <a href="/blog/2018/04/13/#a-better-alternative">for over two years now</a>. In
this period, every time GCC has made a release, I’ve built myself a new
development kit, so I’m already in the habit.</p>

<p>I’ve been using Docker on and off for about 18 months now. It’s an
oddball in that it’s something I learned on the job rather than my own
time. I formed an early impression that still basically holds: <strong>The
main purpose of Docker is to contain and isolate misbehaved software to
improve its reliability</strong>. Well-behaved, well-designed software benefits
little from containers.</p>

<p>My unusual application of Docker here is no exception. <a href="/blog/2017/03/30/">Most software
builds are needlessly complicated and fragile</a>, especially
Autoconf-based builds. Ironically, the worst configure scripts I’ve
dealt with come from GNU projects. They waste time on superfluous checks
(“Does your compiler define <code class="language-plaintext highlighter-rouge">size_t</code>?”) then produce a build that
doesn’t work anyway because you’re doing something slightly unusual.
Worst of all, despite my best efforts, the build will be contaminated by
the state of the system doing the build.</p>

<p>My original build script was fragile by extension. It would work on one
system, but not another due to some subtle environment change — a
slightly different system header that reveals a build system bug
(<a href="https://gcc.gnu.org/legacy-ml/gcc/2017-05/msg00219.html">example in GCC</a>), or the system doesn’t have a file at a certain
hard-coded absolute path that shouldn’t be hard-coded. Converting my
script to a Dockerfile locks these problems in place and makes builds
much more reliable and repeatable. The misbehavior is contained and
isolated by Docker.</p>

<p>Unfortunately it’s not <em>completely</em> contained. In each case I use make’s
<code class="language-plaintext highlighter-rouge">-j</code> option to parallelize the build since otherwise it would take
hours. Some of the builds have subtle race conditions, and some bad luck
in timing can cause a build to fail. Docker is good about picking up
where it left off, so it’s just a matter of trying again.</p>

<p>In one case a build failed because Bison and flex were not installed
even though they’re not normally needed. Some dependency isn’t expressed
correctly, and unlucky ordering leads to an unused <code class="language-plaintext highlighter-rouge">.y</code> file having the
wrong timestamp. Ugh. I’ve had this happen a lot more in Docker than
out, probably because file system operations are slow inside Docker and
it creates greater timing variance.</p>

<h3 id="other-tools">Other tools</h3>

<p>The README explains some of my decisions, but I’ll summarize a few here:</p>

<ul>
  <li>
    <p>Git. Important and useful, so I’d love to have it. But it has a weird
installation (many <a href="https://github.com/skeeto/w64devkit/issues/1">.zip-unfriendly symlinks</a>) tightly-coupled
with msys2, and its build system does not support cross-compilation.
I’d love to see a clean, straightforward rewrite of Git in a single,
appropriate implementation language. Imagine installing the latest Git
with <code class="language-plaintext highlighter-rouge">go get git-scm.com/git</code>. (<em>Update</em>: <a href="https://github.com/libgit2/libgit2/pull/5507">libgit2 is working on
it</a>!)</p>
  </li>
  <li>
    <p>Bash. It’s a much nicer interactive shell than BusyBox-w32 <code class="language-plaintext highlighter-rouge">ash</code>. But
the build system doesn’t support cross-compilation, and I’m not sure
it supports Windows without some sort of compatibility layer anyway.</p>
  </li>
  <li>
    <p>Emacs. Another powerful editor. But the build system doesn’t support
cross-compilation. It’s also <em>way</em> too big.</p>
  </li>
  <li>
    <p>Go. Tempting to toss it in, but <a href="/blog/2020/01/21/">Go already does this all correctly
and effectively</a>. It simply doesn’t require a specialized
distribution. It’s trivial to manage a complete Go toolchain with
nothing but Go itself on any system. People may say its language
design comes from the 1970s, but the tooling is decades ahead of
everyone else.</p>
  </li>
</ul>

<h3 id="alternatives">Alternatives</h3>

<p>For a long, long time Cygwin filled this role for me. However, I never
liked its bulky nature, the complete opposite of portable. Cygwin
processes always felt second-class on Windows, particularly in that it
has its own view of the file system compared to other Windows processes.
They could never fully cooperate. I also don’t like that there’s no
toolchain for cross-compiling with Cygwin as a target — e.g. compile
Cygwin binaries from Linux. Finally <a href="/blog/2017/11/30/">it’s been essentially obsoleted by
WSL</a> which matches or surpasses it on every front.</p>

<p>There’s msys and <a href="https://www.msys2.org/">msys2</a>, which are a bit lighter. However, I’m
still in an isolated, second-class environment with weird path
translation issues. These tools <em>do</em> have important uses, and it’s the
only way to compile most open source software natively on Windows. For
those builds that don’t support cross-compilation, it’s <em>the</em> only path
for producing Windows builds. It’s just not what I’m looking for when
developing my own software.</p>

<p><em>Update</em>: <a href="https://github.com/mstorsjo/llvm-mingw">llvm-mingw</a> is an eerily similar project using Docker
the same way, but instead builds LLVM.</p>

<h3 id="using-docker-for-other-builds">Using Docker for other builds</h3>

<p>I also <a href="https://github.com/skeeto/gnupg-windows-build">converted my GnuPG build script</a> to a Dockerfile. Of
course I don’t plan to actually <em>use</em> GnuPG on Windows. I just need it
<a href="/blog/2019/07/10/">for passphrase2pgp</a>, which I test against GnuPG. This tests the
Windows build.</p>

<p>In the future I may extend this idea to a few other tools I don’t intend
to include with w64devkit. If you have something in mind, you could use
my Dockerfiles as a kind of starter template.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to Read UTF-8 Passwords on the Windows Console</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/05/04/"/>
    <id>urn:uuid:338ca754-e19e-4ae0-add8-639d69967c22</id>
    <updated>2020-05-04T02:14:34Z</updated>
    <category term="win32"/><category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23064864">on Hacker News</a>.</em></p>

<p>Suppose you’re writing a command line program that <a href="/blog/2017/03/12/">prompts the user for
a password or passphrase</a>, and Windows is one of the supported
platforms (<a href="/blog/2018/04/13/">even very old versions</a>). This program uses <a href="/blog/2019/05/29/">UTF-8
for its string representation</a>, <a href="http://utf8everywhere.org/">as it should</a>, and so
ideally it receives the password from the user encoded as UTF-8. On most
platforms this is, for the most part, automatic. However, on Windows
finding the correct answer to this problem is a maze where all the signs
lead towards dead ends. I recently navigated this maze and found the way
out.</p>

<!--more-->

<p>I knew it was possible because <a href="/blog/2019/07/10/">my passphrase2pgp tool</a> has been
using the <a href="https://pkg.go.dev/golang.org/x/crypto/ssh/terminal">golang.org/x/crypto/ssh/terminal</a> package, which gets it
very nearly perfect. Though they were still fixing subtle bugs <a href="https://github.com/golang/crypto/commit/6d4e4cb37c7d6416dfea8472e751c7b6615267a6">as
recently as 6 months ago</a>.</p>

<p>The first step is to ignore just everything you find online, because
it’s either wrong or it’s solving a slightly different problem. I’ll
discuss the dead ends later and focus on the solution first. Ultimately
I want to implement this on Windows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Display prompt then read zero-terminated, UTF-8 password.</span>
<span class="c1">// Return password length with terminator, or zero on error.</span>
<span class="kt">int</span> <span class="nf">read_password</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prompt</span><span class="p">);</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">int</code> for the length rather than <code class="language-plaintext highlighter-rouge">size_t</code> because it’s a
password and should not even approach <code class="language-plaintext highlighter-rouge">INT_MAX</code>.</p>

<h3 id="the-correct-way">The correct way</h3>

<p>For the impatient:
<a href="https://github.com/skeeto/scratch/blob/master/misc/read-password-w32.c" class="download"><strong>complete, working, ready-to-use example</strong></a></p>

<p>On a unix-like system, the program would:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">open(2)</code> the special <code class="language-plaintext highlighter-rouge">/dev/tty</code> file for reading and writing</li>
  <li><code class="language-plaintext highlighter-rouge">write(2)</code> the prompt</li>
  <li><code class="language-plaintext highlighter-rouge">tcgetattr(3)</code> and <code class="language-plaintext highlighter-rouge">tcsetattr(3)</code> to disable <code class="language-plaintext highlighter-rouge">ECHO</code></li>
  <li><code class="language-plaintext highlighter-rouge">read(2)</code> a line of input</li>
  <li>Restore the old terminal attributes with <code class="language-plaintext highlighter-rouge">tcsetattr(3)</code></li>
  <li><code class="language-plaintext highlighter-rouge">close(2)</code> the file</li>
</ol>

<p>A great advantage of this approach is that it doesn’t depend on standard
input and standard output. Either or both can be redirected elsewhere,
and this function still interacts with the user’s terminal. The Windows
version will have the same advantage.</p>

<p>Despite some tempting shortcuts that don’t work, the steps on Windows
are basically the same but with different names. There are a couple
subtleties and extra steps. I’ll be ignoring errors in my code snippets
below, but the complete example has full error handling.</p>

<h4 id="create-console-handles">Create console handles</h4>

<p>Instead of <code class="language-plaintext highlighter-rouge">/dev/tty</code>, the program opens two files: <code class="language-plaintext highlighter-rouge">CONIN$</code> and
<code class="language-plaintext highlighter-rouge">CONOUT$</code> using <a href="https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilea"><code class="language-plaintext highlighter-rouge">CreateFileA()</code></a>. Note: The “A” stands for ANSI,
as opposed to “W” for wide (Unicode). This refers to the encoding of the
file name, not to how the file contents are encoded. <code class="language-plaintext highlighter-rouge">CONIN$</code> is opened
for both reading and writing because write permissions are needed to
change the console’s mode.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">CreateFileA</span><span class="p">(</span>
    <span class="s">"CONIN$"</span><span class="p">,</span>
    <span class="n">GENERIC_READ</span> <span class="o">|</span> <span class="n">GENERIC_WRITE</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="n">OPEN_EXISTING</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span>
<span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">ho</span> <span class="o">=</span> <span class="n">CreateFileA</span><span class="p">(</span>
    <span class="s">"CONOUT$"</span><span class="p">,</span>
    <span class="n">GENERIC_WRITE</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="n">OPEN_EXISTING</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span>
<span class="p">);</span>
</code></pre></div></div>

<h4 id="print-the-prompt">Print the prompt</h4>

<p>To write the prompt, call <a href="https://docs.microsoft.com/en-us/windows/console/writeconsole"><code class="language-plaintext highlighter-rouge">WriteConsoleA()</code></a> on the output handle.
On its own, this assumes the prompt is plain ASCII (i.e. <code class="language-plaintext highlighter-rouge">"password:
"</code>), not UTF-8 (i.e. <code class="language-plaintext highlighter-rouge">"contraseña: "</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WriteConsoleA</span><span class="p">(</span><span class="n">ho</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prompt</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>If the prompt may contain UTF-8 data, perhaps because it displays a
username or isn’t in English, you have two options:</p>

<ul>
  <li>Convert the prompt to UTF-16 and call <code class="language-plaintext highlighter-rouge">WriteConsoleW()</code> instead.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">SetConsoleOutputCP()</code> with <code class="language-plaintext highlighter-rouge">CP_UTF8</code> (65001). This is a global
(to the console) setting and should be restored when done.</li>
</ul>

<h4 id="disable-echo">Disable echo</h4>

<p>Next use <a href="https://docs.microsoft.com/en-us/windows/console/getconsolemode"><code class="language-plaintext highlighter-rouge">GetConsoleMode()</code></a> and <a href="https://docs.microsoft.com/en-us/windows/console/setconsolemode"><code class="language-plaintext highlighter-rouge">SetConsoleMode()</code></a> to
disable echo. The console usually has <code class="language-plaintext highlighter-rouge">ENABLE_PROCESSED_INPUT</code> already
set, which tells the console to handle CTRL-C and such, but I set it
explicitly just in case. I also set <code class="language-plaintext highlighter-rouge">ENABLE_LINE_INPUT</code> so that the user
can use backspace and so that the entire line is delivered at once.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">orig</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">GetConsoleMode</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">);</span>

<span class="n">DWORD</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">orig</span><span class="p">;</span>
<span class="n">mode</span> <span class="o">|=</span> <span class="n">ENABLE_PROCESSED_INPUT</span><span class="p">;</span>
<span class="n">mode</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">ENABLE_ECHO_INPUT</span><span class="p">;</span>
<span class="n">SetConsoleMode</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>There are reports that <code class="language-plaintext highlighter-rouge">ENABLE_LINE_INPUT</code> limits reads to 254 bytes,
but I was unable to reproduce it. My full example can read huge
passwords without trouble.</p>

<p>The old mode is saved in <code class="language-plaintext highlighter-rouge">orig</code> so that it can be restored later.</p>

<h4 id="read-the-password">Read the password</h4>

<p>Here’s where you have to pay the piper. As of the date of this article,
<strong>the Windows API offers no method for reading UTF-8 input from the
console</strong>. Give up on that hope now. If you use the “ANSI” functions to
read input under any configuration, they will to the usual Windows thing
of <em>silently mangling your input</em>.</p>

<p>So you <em>must</em> use the UTF-16 API, <a href="https://docs.microsoft.com/en-us/windows/console/readconsole"><code class="language-plaintext highlighter-rouge">ReadConsoleW()</code></a>, and then
<a href="/blog/2017/10/06/">encode it</a> yourself. Fortunately Win32 provides a UTF-8 encoder,
<a href="https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte"><code class="language-plaintext highlighter-rouge">WideCharToMultiByte()</code></a>, which will even handle surrogate pairs
for all those people who like putting <code class="language-plaintext highlighter-rouge">PILE OF POO</code> (<code class="language-plaintext highlighter-rouge">U+1F4A9</code>) in their
passwords:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SIZE_T</span> <span class="n">wbuf_len</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">wbuf</span><span class="p">);</span>
<span class="n">WCHAR</span> <span class="o">*</span><span class="n">wbuf</span> <span class="o">=</span> <span class="n">HeapAlloc</span><span class="p">(</span><span class="n">GetProcessHeap</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf_len</span><span class="p">);</span>
<span class="n">DWORD</span> <span class="n">nread</span><span class="p">;</span>
<span class="n">ReadConsoleW</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">,</span> <span class="n">len</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nread</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">wbuf</span><span class="p">[</span><span class="n">nread</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// truncate "\r\n"</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">WideCharToMultiByte</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">SecureZeroMemory</span><span class="p">(</span><span class="n">wbuf</span><span class="p">,</span> <span class="n">wbuf_len</span><span class="p">);</span>
<span class="n">HeapFree</span><span class="p">(</span><span class="n">GetProcessHeap</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">);</span>
</code></pre></div></div>

<p>I use <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlsecurezeromemory"><code class="language-plaintext highlighter-rouge">SecureZeroMemory()</code></a> to erase the UTF-16 version of the
password before freeing the buffer. The <code class="language-plaintext highlighter-rouge">+ 2</code> in the allocation is for
the CRLF line ending that will later be chopped off. The error handling
version checks that the input did indeed end with CRLF. Otherwise it was
truncated (too long).</p>

<h4 id="clean-up">Clean up</h4>

<p>Finally print a newline since the user-typed one wasn’t echoed, restore
the old console mode, close the console handles, and return the final
encoded length:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;
</code></pre></div></div>

<p>The error checking version doesn’t check for errors from any of these
functions since either they cannot fail, or there’s nothing reasonable
to do in the event of an error.</p>

<h3 id="dead-ends">Dead ends</h3>

<p>If you look around the Win32 API you might notice <code class="language-plaintext highlighter-rouge">SetConsoleCP()</code>. A
reasonable person might think that setting the “code page” to UTF-8
(<code class="language-plaintext highlighter-rouge">CP_UTF8</code>) might configure the console to encode input in UTF-8. The
good news is Windows will no longer mangle your input as before. The bad
news is that it will be mangled differently.</p>

<p>You might think you can use the CRT function <code class="language-plaintext highlighter-rouge">_setmode()</code> with
<code class="language-plaintext highlighter-rouge">_O_U8TEXT</code> on the <code class="language-plaintext highlighter-rouge">FILE *</code> connected to the console. This does nothing
useful. (The only use for <code class="language-plaintext highlighter-rouge">_setmode()</code> is with <code class="language-plaintext highlighter-rouge">_O_BINARY</code>, to disable
braindead character translation on standard input and output.) The best
you’ll be able to do with the CRT is the same sort of wide character
read using non-standard functions, followed by conversion to UTF-8.</p>

<p><a href="https://docs.microsoft.com/en-us/windows/win32/api/wincred/nf-wincred-creduicmdlinepromptforcredentialsa"><code class="language-plaintext highlighter-rouge">CredUICmdLinePromptForCredentials()</code></a> promises to be both a
mouthful of a function name, and a prepacked solution to this problem.
It only delivers on the first. This function seems to have broken some
time ago and nobody at Microsoft noticed — probably because <em>nobody has
ever used this function</em>. I couldn’t find a working example, nor a use
in any real application. When I tried to use it, I got a nonsense error
code it never worked. There’s a GUI version of this function that <em>does</em>
work, and it’s a viable alternative for certain situations, though not
mine.</p>

<p>At my most desperate, I hoped <code class="language-plaintext highlighter-rouge">ENABLE_VIRTUAL_TERMINAL_PROCESSING</code> would
be a magical switch. On Windows 10 it magically enables some ANSI escape
sequences. The documentation in no way suggests it <em>would</em> work, and I
confirmed by experimentation that it does not. Pity.</p>

<p>I spent a lot of time searching down these dead ends until finally
settling with <code class="language-plaintext highlighter-rouge">ReadConsoleW()</code> above. I hoped it would be more
automatic, but I’m glad I have at least <em>some</em> solution figured out.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Fibers: the Most Elegant Windows API</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/03/28/"/>
    <id>urn:uuid:abad2340-99e5-4d72-857c-848e37b4af73</id>
    <updated>2019-03-28T22:26:05Z</updated>
    <category term="win32"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19520078">on Hacker News</a>.</em></p>

<p>The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly,
and lacking good taste. Microsoft has done a pretty commendable job with
backwards compatibility, but the trade-off is that the API is filled to
the brim with historical cruft. Every hasty, poor design over the
decades is carried forward forever, and, in many cases, even built upon,
which essentially doubles down on past mistakes. POSIX certainly has its
own ugly corners, but those are the exceptions. In the Windows API,
elegance is the exception.</p>

<!--more-->

<p>That’s why, when I recently revisited the <a href="https://docs.microsoft.com/en-us/windows/desktop/procthread/fibers">Fibers API</a>, I was
pleasantly surprised. It’s one of the exceptions — much cleaner than the
optional, deprecated, and now obsolete <a href="/blog/2017/06/21/#coroutines">POSIX equivalent</a>. It’s
not quite an apples-to-apples comparison since the POSIX version is
slightly more powerful, and more complicated as a result. I’ll cover the
difference in this article.</p>

<p>For the last part of this article, I’ll walk through an async/await
framework build on top of fibers. The framework allows coroutines in C
programs to await on arbitrary kernel objects.</p>

<p><a href="https://github.com/skeeto/fiber-await"><strong>Fiber Async/await Demo</strong></a></p>

<h3 id="fibers">Fibers</h3>

<p>Windows fibers are really just <a href="https://blog.varunramesh.net/posts/stackless-vs-stackful-coroutines/">stackful</a>, symmetric coroutines.
From a different point of view, they’re cooperatively scheduled threads,
which is the source of the analogous name, <em>fibers</em>. They’re symmetric
because all fibers are equal, and no fiber is the “main” fiber. If <em>any</em>
fiber returns from its start routine, the program exits. (Older versions
of Wine will crash when this happens, but it was recently fixed.) It’s
equivalent to the process’ main thread returning from <code class="language-plaintext highlighter-rouge">main()</code>. The
initial fiber is free to create a second fiber, yield to it, then the
second fiber destroys the first.</p>

<p>For now I’m going to focus on the core set of fiber functions. There are
some additional capabilities I’m going to ignore, including support for
<em>fiber local storage</em>. The important functions are just these five:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">CreateFiber</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">stack_size</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">proc</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">SwitchToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
<span class="n">bool</span>  <span class="nf">ConvertFiberToThread</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">ConvertThreadToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">DeleteFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
</code></pre></div></div>

<p>To emphasize its simplicity, I’ve shown them here with more standard
prototypes than seen in their formal documentation. That documentation
uses the clunky Windows API typedefs still burdened with its 16-bit
heritage — e.g. <code class="language-plaintext highlighter-rouge">LPVOID</code> being a “long pointer” from the segmented memory
of the 8086:</p>

<ul>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-createfiber">CreateFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-switchtofiber">SwitchToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-convertfibertothread">ConvertFiberToThread</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-convertthreadtofiber">ConvertThreadToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-deletefiber">DeleteFiber</a></li>
</ul>

<p>Fibers are represented using opaque, void pointers. Maybe that’s a little
<em>too</em> simple since it’s easy to misuse in C, but I like it. The return
values for <code class="language-plaintext highlighter-rouge">CreateFiber()</code> and <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code> are void pointers
since these both create fibers.</p>

<p>The fiber start routine returns nothing and takes a void “user pointer”.
That’s nearly what I’d expect, except that it would probably make more
sense for a fiber to return <code class="language-plaintext highlighter-rouge">int</code>, which is <a href="/blog/2016/01/31/">more in line with</a>
<code class="language-plaintext highlighter-rouge">main</code> / <code class="language-plaintext highlighter-rouge">WinMain</code> / <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> / <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>. As I said,
when any fiber returns from its start routine, it’s like returning from
the main function, so it should probably have returned an integer.</p>

<p>A fiber may delete itself, which is the same as exiting the thread.
However, a fiber cannot yield (e.g. <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>) to itself. That’s
undefined behavior.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">coup</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">king</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"Long live the king!"</span><span class="p">);</span>
    <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">king</span><span class="p">);</span>
    <span class="n">ConvertFiberToThread</span><span class="p">();</span> <span class="cm">/* seize the main thread */</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">king</span> <span class="o">=</span> <span class="n">ConvertThreadToFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">pretender</span> <span class="o">=</span> <span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">coup</span><span class="p">,</span> <span class="n">king</span><span class="p">);</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">pretender</span><span class="p">);</span>
    <span class="n">abort</span><span class="p">();</span> <span class="cm">/* unreachable */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Only fibers can yield to fibers, but when the program starts up, there
are no fibers. At least one thread must first convert itself into a
fiber using <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code>, which returns the fiber object
that represents itself. It takes one argument analogous to the last
argument of <code class="language-plaintext highlighter-rouge">CreateFiber()</code>, except that there’s no start routine to
accept it. The process is reversed with <code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code>.</p>

<p>Fibers don’t belong to any particular thread and can be scheduled on any
thread <em>if</em> properly synchronized. Obviously one should never yield to the
same fiber in two different threads at the same time.</p>

<h3 id="contrast-with-posix">Contrast with POSIX</h3>

<p>The equivalent POSIX systems was context switching. It’s also stackful
and symmetric, but it has just three important functions:
<a href="http://man7.org/linux/man-pages/man3/setcontext.3.html"><code class="language-plaintext highlighter-rouge">getcontext(3)</code></a>, <a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">makecontext(3)</code></a>, and
<a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">swapcontext</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">getcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">makecontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(),</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">int</span>  <span class="nf">swapcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">oucp</span><span class="p">,</span> <span class="k">const</span> <span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
</code></pre></div></div>

<p>These are roughly equivalent to <a href="https://docs.microsoft.com/en-us/windows/desktop/api/winnt/nf-winnt-getcurrentfiber"><code class="language-plaintext highlighter-rouge">GetCurrentFiber()</code></a>,
<code class="language-plaintext highlighter-rouge">CreateFiber()</code>, and <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>. There is no need for
<code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code> since threads can context switch without
preparation. There’s also no <code class="language-plaintext highlighter-rouge">DeleteFiber()</code> because the resources are
managed by the program itself. That’s where POSIX contexts are a little
bit more powerful.</p>

<p>The first argument to <code class="language-plaintext highlighter-rouge">CreateFiber()</code> is the desired stack size, with
zero indicating the default stack size. The stack is allocated and freed
by the operating system. The downside is that the caller doesn’t have a
choice in managing the lifetime of this stack and how it’s allocated. If
you’re frequently creating and destroying coroutines, those stacks are
constantly being allocated and freed.</p>

<p>In <code class="language-plaintext highlighter-rouge">makecontext(3)</code>, the caller allocates and supplies the stack. Freeing
that stack is equivalent to destroying the context. A program that
frequently creates and destroys contexts can maintain a stack pool or
otherwise more efficiently manage their allocation. This makes it more
powerful, but it also makes it a little more complicated. It would be hard
to remember how to do all this without a careful reading of the
documentation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Create a context */</span>
<span class="n">ucontext_t</span> <span class="n">ctx</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">SIGSTKSZ</span><span class="p">);</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_size</span> <span class="o">=</span> <span class="n">SIGSTKSZ</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_link</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">getcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">);</span>
<span class="n">makecontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">proc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="cm">/* Destroy a context */</span>
<span class="n">free</span><span class="p">(</span><span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span><span class="p">);</span>
</code></pre></div></div>

<p>Note how <code class="language-plaintext highlighter-rouge">makecontext(3)</code> is variadic (<code class="language-plaintext highlighter-rouge">...</code>), passing its arguments on
to the start routine of the context. This seems like it might be better
than a user pointer. Unfortunately it’s not, since those arguments are
strictly limited to <em>integers</em>.</p>

<p>Ultimately I like the fiber API better. The first time I tried it out, I
could guess my way through it without looking closely at the
documentation.</p>

<h3 id="async--await-with-fibers">Async / await with fibers</h3>

<p>Why was I looking at the Fiber API? I’ve known about coroutines for
years but I didn’t understand how they could be useful. Sure, the
function can yield, but what other coroutine should it yield to? It
wasn’t until I was <a href="/blog/2019/03/10/">recently bit by the async/await bug</a> that I
finally saw a “killer feature” that justified their use. Generators come
pretty close, though.</p>

<p>Windows fibers are a coroutine primitive suitable for async/await in C
programs, where <a href="/blog/2019/03/22/">it can also be useful</a>. To prove that it’s
possible, I built async/await on top of fibers in <a href="https://github.com/skeeto/fiber-await/blob/master/async.c">95 lines of code</a>.</p>

<p>The alternatives are to use a <a href="https://www.gnu.org/software/pth/">third-party coroutine library</a> or to
do it myself <a href="/blog/2015/05/15/">with some assembly programming</a>. However, having it
built into the operating system is quite convenient! It’s unfortunate
that it’s limited to Windows. Ironically, though, everything I wrote for
this article, including the async/await demonstration, was originally
written on Linux using Mingw-w64 and tested using <a href="https://www.winehq.org/">Wine</a>. Only
after I was done did I even try it on Windows.</p>

<p>Before diving into how it works, there’s a general concept about the
Windows API that must be understood: <strong>All kernel objects can be in
either a signaled or unsignaled state.</strong> The API provides functions that
block on a kernel object until it is signaled. The two important ones
are <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobject"><code class="language-plaintext highlighter-rouge">WaitForSingleObject()</code></a> and <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitformultipleobjects"><code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code></a>.
The latter behaves very much like <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX.</p>

<p>Usually the signal is tied to some useful event, like a process or
thread exiting, the completion of an I/O operation (i.e. asynchronous
overlapped I/O), a semaphore being incremented, etc. It’s a generic way
to wait for some event. <strong>However, instead of blocking the thread,
wouldn’t it be nice to <em>await</em> on the kernel object?</strong> In my <code class="language-plaintext highlighter-rouge">aio</code>
library for Emacs, the fundamental “wait” object was a promise. For this
API it’s a kernel object handle.</p>

<p>So, the await function will take a kernel object, register it with the
scheduler, then yield to the scheduler. The scheduler — which is a
global variable, so there’s only one scheduler per process — looks like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">main_fiber</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">handles</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">fibers</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">dead_fiber</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span> <span class="n">async_loop</span><span class="p">;</span>
</code></pre></div></div>

<p>While fibers are symmetric, coroutines in my async/await implementation
are not. One fiber is the scheduler, <code class="language-plaintext highlighter-rouge">main_fiber</code>, and the other fibers
always yield to it.</p>

<p>There is an array of kernel object handles, <code class="language-plaintext highlighter-rouge">handles</code>, and an array of
<code class="language-plaintext highlighter-rouge">fibers</code>. The elements in these arrays are paired with each other, but
it’s convenient to store them separately, as I’ll show soon. <code class="language-plaintext highlighter-rouge">fibers[0]</code>
is waiting on <code class="language-plaintext highlighter-rouge">handles[0]</code>, and so on.</p>

<p>The array is a fixed size, <code class="language-plaintext highlighter-rouge">MAXIMUM_WAIT_OBJECTS</code> (64), because there’s
a hard limit on the number of fibers that can wait at once. This
pathetically small limitation is an unfortunate, hard-coded restriction
of the Windows API. It kills most practical uses of my little library.
Fortunately there’s no limit on the number of handles we might want to
wait on, just the number of co-existing fibers.</p>

<p>When a fiber is about to return from its start routine, it yields one
last time and registers itself on the <code class="language-plaintext highlighter-rouge">dead_fiber</code> member. The scheduler
will delete this fiber as soon as it’s given control. Fibers never
<em>truly</em> return since that would terminate the program.</p>

<p>With this, the await function, <code class="language-plaintext highlighter-rouge">async_await()</code>, is pretty simple. It
registers the handle with the scheduler, then yields to the scheduler
fiber.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_await</span><span class="p">(</span><span class="n">HANDLE</span> <span class="n">h</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">h</span><span class="p">;</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">++</span><span class="p">;</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Caveat: The scheduler destroys this handle with <code class="language-plaintext highlighter-rouge">CloseHandle()</code> after it
signals, so don’t try to reuse it. This made my demonstration simpler,
but it might be better to not do this.</p>

<p>A fiber can exit at any time. Such an exit is inserted implicitly before
a fiber actually returns:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_exit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The start routine given to <code class="language-plaintext highlighter-rouge">async_start()</code> is actually wrapped in the
real start routine. This is how <code class="language-plaintext highlighter-rouge">async_exit()</code> is injected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">fiber_wrapper</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="o">*</span><span class="n">fw</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="n">fw</span><span class="o">-&gt;</span><span class="n">func</span><span class="p">(</span><span class="n">fw</span><span class="o">-&gt;</span><span class="n">arg</span><span class="p">);</span>
    <span class="n">async_exit</span><span class="p">();</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">async_start</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span> <span class="o">==</span> <span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="n">fw</span> <span class="o">=</span> <span class="p">{</span><span class="n">func</span><span class="p">,</span> <span class="n">arg</span><span class="p">};</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">fiber_wrapper</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fw</span><span class="p">));</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The library provides a single awaitable function, <code class="language-plaintext highlighter-rouge">async_sleep()</code>. It
creates a “waitable timer” object, starts the countdown, and returns it.
(Notice how <code class="language-plaintext highlighter-rouge">SetWaitableTimer()</code> is a typically-ugly Win32 function with
excessive parameters.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span>
<span class="nf">async_sleep</span><span class="p">(</span><span class="kt">double</span> <span class="n">seconds</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">promise</span> <span class="o">=</span> <span class="n">CreateWaitableTimer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">LARGE_INTEGER</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">t</span><span class="p">.</span><span class="n">QuadPart</span> <span class="o">=</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">long</span><span class="p">)(</span><span class="n">seconds</span> <span class="o">*</span> <span class="o">-</span><span class="mi">10000000</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">SetWaitableTimer</span><span class="p">(</span><span class="n">promise</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">t</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">promise</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A more realistic example would be overlapped I/O. For example, you’d
open a file (<code class="language-plaintext highlighter-rouge">CreateFile()</code>) in overlapped mode, then when you, say,
read from that file (<code class="language-plaintext highlighter-rouge">ReadFile()</code>) you create an event object
(<code class="language-plaintext highlighter-rouge">CreateEvent()</code>), populate an overlapped I/O structure with the event,
offset, and length, then finally await on the event object. The fiber
will be resumed when the operation is complete.</p>

<p>Side note: Unfortunately <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">overlapped I/O doesn’t work correctly for
files</a>, and many operations can’t be done asynchronously, like
opening files. When it comes to files, you’re <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">better off using
dedicated threads</a> as <a href="http://docs.libuv.org/en/v1.x/design.html#file-i-o">libuv does</a> instead of overlapped I/O.
You can still await on these operations. You’d just await on the signal
from the thread doing synchronous I/O, not from overlapped I/O.</p>

<p>The most complex part is the scheduler, and it’s really not complex at
all:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_run</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Wait for next event */</span>
        <span class="n">DWORD</span> <span class="n">nhandles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">;</span>
        <span class="n">HANDLE</span> <span class="o">*</span><span class="n">handles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">;</span>
        <span class="n">DWORD</span> <span class="n">r</span> <span class="o">=</span> <span class="n">WaitForMultipleObjects</span><span class="p">(</span><span class="n">nhandles</span><span class="p">,</span> <span class="n">handles</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">INFINITE</span><span class="p">);</span>

        <span class="cm">/* Remove event and fiber from waiting array */</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">];</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]);</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">--</span><span class="p">;</span>

        <span class="cm">/* Run the fiber */</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">fiber</span><span class="p">);</span>

        <span class="cm">/* Destroy the fiber if it exited */</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">);</span>
            <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is why the handles are in their own array. The array can be passed
directly to <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code>. The return value indicates which
handle was signaled. The handle is closed, the entry removed from the
scheduler, and then the fiber is resumed.</p>

<p>That <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> is what limits the number of fibers.
It’s not possible to wait on more than 64 handles at once! This is
hard-coded into the API. How? A return value of 64 is an error code, and
changing this would break the API. Remember what I said about being
locked into bad design decisions of the past?</p>

<p>To be fair, <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> was a doomed API anyway, just
like <code class="language-plaintext highlighter-rouge">select(2)</code> and <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX. It scales very poorly since the
entire array of objects being waited on must be traversed on each call.
That’s terribly inefficient when waiting on large numbers of objects.
This sort of problem is solved by interfaces like kqueue (BSD), epoll
(Linux), and IOCP (Windows). Unfortunately <a href="https://news.ycombinator.com/item?id=11866562">IOCP doesn’t really fit this
particular problem well</a> — awaiting on kernel objects — so I
couldn’t use it.</p>

<p>When the awaiting fiber count is zero and the scheduler has control, all
fibers must have completed and there’s nothing left to do. However, the
caller can schedule more fibers and then restart the scheduler if
desired.</p>

<p>That’s all there is to it. Have a look at <a href="https://github.com/skeeto/fiber-await/blob/master/demo.c"><code class="language-plaintext highlighter-rouge">demo.c</code></a> to see how
the API looks in some trivial examples. On Linux you can see it in
action with <code class="language-plaintext highlighter-rouge">make check</code>. On Windows, you just <a href="/blog/2016/06/13/">need to compile
it</a>, then run it like a normal program. If there was a better
function than <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> in the Windows API, I would
have considered turning this demonstration into a real library.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Blast from the Past: Borland C++ on Windows 98</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/04/13/"/>
    <id>urn:uuid:298d2dbe-31eb-30c4-21c2-e019dc5449f6</id>
    <updated>2018-04-13T20:01:31Z</updated>
    <category term="vim"/><category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>My first exposure to C and C++ was a little over 20 years ago. I
remember it being some version of <a href="https://en.wikipedia.org/wiki/Borland_C%2B%2B">Borland C++</a>, either 4.x
or 5.x, running on Windows 95. I didn’t have <a href="/blog/2016/09/02/">a mentor</a>, so I
did the best I could slowly working through what was probably a poorly
written beginner C++ book, typing out the examples and exercises with
little understanding. Since I didn’t learn much from the experience,
there was a 7 or 8 year gap before I’d revisit C and C++ in college.</p>

<p><a href="/img/win98/enchive.png"><img src="/img/win98/enchive-thumb.png" alt="" /></a></p>

<p>I thought it would be interesting to revisit this software, to
reevaluate it from a far more experienced perspective. Keep in mind
that C++ wasn’t even standardized yet, and the most recent C standard
was from 1989. Given this, what was it like to be a professional
software developer using a Borland toolchain on Windows 20 years ago?
Was it miserable, made bearable only by ignorance of how much better
the tooling could be? Or maybe it actually wasn’t so bad, and these
tools are better than I expect?</p>

<p>Ultimately my conclusion is that it’s a little bit of both. There are
some significant capability gaps compared to today, but the core
toolchain itself is actually quite reasonable, especially for the mid
1990s.</p>

<h3 id="the-setup">The setup</h3>

<p>Before getting into the evaluation, let’s discuss how I got it all up
and running. While it’s <em>technically</em> possible to run Windows 95 on a
modern x86-64 machine thanks to <a href="/blog/2014/12/09/">the architecture’s extreme backwards
compatibility</a>, it’s more compatible, simpler, and safer to
virtualize it. Most importantly, I can emulate older hardware that
will have better driver support.</p>

<p>Despite that early start in Windows all those years ago, I’m primarily
a Linux user. The premier virtualization solution on Linux these days
is KVM, a kernel module <a href="https://www.redhat.com/en/topics/virtualization/what-is-KVM">that turns Linux into a hypervisor</a> and
makes efficient use of hardware virtualization extensions.
Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m
using <a href="https://www.qemu.org/">QEmu</a> (with KVM disabled), a hardware emulator closely
associated with KVM. Since it doesn’t take advantage of hardware
virtualization extensions, it will be slower. This is fine since my
goal is to emulate slow, 20+ year old hardware anyway.</p>

<p>There’s very little practical difference between Windows 95 and
Windows 98. Since Windows 98 runs a lot smoother virtualized, I
decided to go with that instead. This will be perfectly sufficient for
my toolchain evaluation.</p>

<h4 id="software">Software</h4>

<p>To get started, I’ll need an installer for Windows 98. I thought this
would be difficult to find, but there’s a copy available on the
Internet Archive. I don’t know how “legitimate” this is, but it works.
Since it’s running in a virtual machine without network access, I also
don’t really care if this copy is somehow infected with malware.</p>

<p>Internet Archive: <a href="https://archive.org/details/win98se_201607">Windows 98 Second Edition</a></p>

<p>Also on the Internet Archive is a complete copy of Borland C++ 5.02,
with the same caveats of legitimacy. It works, which is good enough for
my purposes.</p>

<p>Internet Archive: <a href="https://archive.org/details/BorlandC5.02">Borland C++ 5.02</a></p>

<p>Thank you Internet Archive!</p>

<h4 id="hardware">Hardware</h4>

<p>I’ve got my software, now to set up the virtualized hardware. First I
create a drive image:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qemu-image create -fqcow2 win98.img 8G
</code></pre></div></div>

<p>I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a
virtual hard drive with modern sizes would probably break the
installer. This sort of issue is a common theme among old software,
where there may be complaints about negative available disk space due
to signed integer overflow.</p>

<p>I decided to give the machine 256MB of memory (<code class="language-plaintext highlighter-rouge">-m 256</code>). This is also a
little excessive, but I wanted to be sure memory didn’t limit Borland’s
capabilities. This amount of memory is close to the upper bound, and
going much beyond will likely cause problems with Windows 98.</p>

<p>For the CPU I settled on a Pentium (<code class="language-plaintext highlighter-rouge">-cpu pentium</code>). My original goal
was to go a little simpler with a 486 (<code class="language-plaintext highlighter-rouge">-cpu 486</code>), but the Windows 98
installer kept crashing when I tried this.</p>

<p>I experimented with different configurations for the network card, but
I couldn’t get anything to work. So I’ve disabled networking (<code class="language-plaintext highlighter-rouge">-net
none</code>). The only reason I’d want this is that it would be easier to
move files in and out of the virtual machine.</p>

<p>Finally, here’s how I ran QEmu. The last two lines are only needed when
installing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qemu-system-x86_64 \
    -localtime \
    -cpu pentium \
    -no-acpi \
    -no-hpet \
    -m 256 \
    -hda win98.img \
    -soundhw sb16 \
    -vga cirrus \
    -net none \
    -cdrom "Windows 98 Second Edition.iso" \
    -boot d
</code></pre></div></div>

<p><a href="/img/win98/install.png"><img src="/img/win98/install-thumb.png" alt="" /></a></p>

<h4 id="installation">Installation</h4>

<p>Installation is just a matter of following the instructions. You’ll
need that product key listed on the Internet Archive site.</p>

<p><a href="/img/win98/base.png"><img src="/img/win98/base-thumb.png" alt="" /></a></p>

<p>That copy of Borland is just a big .zip file. This presents two
problems.</p>

<ol>
  <li>
    <p>Without network access, I’ll need to figure out how to get this
inside the virtual machine.</p>
  </li>
  <li>
    <p>This version of Windows doesn’t come with software to unzip this
file. I’d need to find and install an unzip tool first.</p>
  </li>
</ol>

<p>Fortunately I can kill two birds with one stone by converting that .zip
archive into a .iso and mounting it in the virtual machine.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"
</code></pre></div></div>

<p>Then in the QEmu console (<kbd>C-A-2</kbd>) I attach it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>change ide1-cd0 borland.iso
</code></pre></div></div>

<p>This little trick of generating .iso files and mounting them is how I
will be moving all the other files into the virtual machine.</p>

<h3 id="borland-c">Borland C++</h3>

<p>The first thing I did was play around with with Borland IDE. This is
what I would have been using 20 years ago.</p>

<p><a href="/img/win98/ide.png"><img src="/img/win98/ide-thumb.png" alt="" /></a></p>

<p>Despite being Borland <em>C++</em>, I’m personally most interested in its ANSI
C compiler. As I already pointed out, this software pre-dates C++’s
standardization, and a lot has changed over the past two decades. On the
other hand, C <em>hasn’t really changed all that much</em>. The 1999 update to
the C standard (e.g. “C99”) was big and important, but otherwise little
has changed. The biggest drawback is the lack of “declare anywhere”
variables, including in for-loop initializers. Otherwise it’s the same
as writing C today.</p>

<p>To test drive the IDE, I made a couple of test projects, built and ran
them with different options, and poked around with the debugger. The
debugger is actually pretty decent, especially for the 1990s. It can be
operated via the IDE or standalone, so I could use it without firing up
the IDE and making a project.</p>

<p>The toolchain includes an assembler, and I can inspect the compiler’s
assembly output. To nobody’s surprise this is Intel-flavored assembly,
which <a href="http://x86asm.net/articles/what-i-dislike-about-gas/">is very welcome</a>. Imagining myself as a software developer
in the mid 1990s, this means I can see exactly what the compiler’s doing
as well as write some of the performance sensitive parts in assembly if
necessary.</p>

<p>The built-in editor is the worst part of the IDE, which is unfortunate
since it really spoils the whole experience. It’s easy to jump between
warnings and errors, it has incremental search, and it has good syntax
highlighting. But these are the only positive things I can say about it.
If I had to work with this editor full-time, I’d spend my days pretty
irritated.</p>

<h3 id="switch-to-command-line-tools">Switch to command line tools</h3>

<p>Like with the debugger, the Borland people did a good job modularizing
their development tools. As part of the installation process, all of the
Borland command line tools are added to the system <code class="language-plaintext highlighter-rouge">PATH</code> (reminder:
this is a single-user system). This includes compiler, linker,
assembler, debugger, and even an <a href="/blog/2017/08/20/">incomplete implementation</a> of
<code class="language-plaintext highlighter-rouge">make</code>.</p>

<p>With this, I can essentially pretend the IDE doesn’t exist and replace
that crummy editor with something better: Vim.</p>

<p>The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3,
released in 2010. I download those binaries, trim a few things from my
<a href="https://github.com/skeeto/dotfiles/blob/master/_vimrc">.vimrc</a>, and smuggle it all into my virtual machine via a
virtual CD. I’ve now got a powerful text editor in Windows 98 and my
situation has drastically improved.</p>

<p><a href="/img/win98/vim.png"><img src="/img/win98/vim-thumb.png" alt="" /></a></p>

<p>Since I hardly use features added since Vim 7.3, this feels <a href="/blog/2017/04/01/">right at
home</a> to me. I can <a href="/blog/2017/08/22/">invoke the build</a> from Vim, and it
can populate the quickfix list from Borland’s output, so I could
actually be fairly productive in these circumstances! I’m honestly
really impressed with how well this all works together.</p>

<p>At this point I only have two significant annoyances:</p>

<ol>
  <li>
    <p>Borland’s command line tools belong to that category of irritating
programs that print their version banner on every invocation.
There’s not even a command line switch to turn this off. All this
noise is quickly tiresome. The <a href="/blog/2016/06/13/">Visual Studio toolchain</a> does
the same thing by default, though it can be turned off (<code class="language-plaintext highlighter-rouge">-nologo</code>).
I dislike that some GNU tools also commit this sin, but at least
GNU limits this to interactive programs.</p>
  </li>
  <li>
    <p>The Windows/DOS command shell and console is <em>even worse</em> <a href="/blog/2017/11/30/">than it
is today</a>. I didn’t think that was possible. This is back when
it was still genuinely DOS and not just pretending to be (e.g. in
NT). The worst part by far is the lack of command history. There’s
no using the up-arrow to get previous commands. There’s no tab
completion. Forward slash is not a substitute for backslash in
paths. If I wanted to improve my productivity, replacing this
console and shell would be the first priority.</p>
  </li>
</ol>

<p><strong>Update</strong>: In an email, Aristotle Pagaltzis informed me that Windows 98
comes with <a href="https://en.wikipedia.org/wiki/DOSKEY">DOSKEY.COM</a>, which provides command history for
COMMAND.EXE. Alternatively there’s <a href="http://paulhoule.com/doskey/">Enhanced DOSKEY.com</a>, an
open source, alternative implementation that also provides tab
completion for commands and filesnames. This makes the console a lot
more usable (and, honestly, in some ways better than the modern
defaults).</p>

<h3 id="building-enchive-with-borland">Building Enchive with Borland</h3>

<p>Last year I wrote <a href="/blog/2017/03/12/">a backup encryption tool called Enchive</a>,
and I still use it regularly. One of my design goals was high
portability since it may be needed to decrypt something important in
the distant future. It should be as <a href="https://en.wikipedia.org/wiki/Software_rot">bit-rot</a>-proof as
possible. <strong>In software, the best way to <em>future</em>-proof is to
<em>past</em>-proof.</strong></p>

<p>If I had a time machine that could send source code back in time, and
I sent Enchive to a competant developer 20 years ago, would they be
able to compile it and run it? If the answer is yes, then that means
Enchive already has 20 years of future-proofing built into it.</p>

<p>To accomplish this, Enchive is 3,300 lines of strict ANSI C,
1989-style, with no dependencies other than the C standard library and
a handful of operating system functions — e.g. functionality not in
the C standard library. In practice, any ANSI C compiler targeting
either POSIX, or Windows 95 or later, should be able to compile it.</p>

<p>My Windows 98 virtual machine includes an ANSI C compiler, and can be
used to simulate this time machine. I generated an “amalgamation” build
(<code class="language-plaintext highlighter-rouge">make amalgamation</code>) — essentially a concatenation of all the source
files — and sent this into the virtual machine. Before Borland was able
to compile it, I needed to make three small changes.</p>

<p>First, Enchive includes <code class="language-plaintext highlighter-rouge">stdint.h</code> to get fixed-width integers needed
for the encryption routines. This header comes from C99, and C89 has
no equivalent. I anticipated this problem from the beginning and made
it easy for the person performing the build to correct it. This header
is included exactly once, in <code class="language-plaintext highlighter-rouge">config.h</code>, and this is placed at the top
of the amalgamation build. The include only needs to be replaced with
a handful of manual typedefs. For Borland that looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">char</span>    <span class="kt">uint8_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">short</span>   <span class="kt">uint16_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span>    <span class="kt">uint32_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="n">__int64</span> <span class="kt">uint64_t</span><span class="p">;</span>

<span class="k">typedef</span> <span class="kt">long</span>             <span class="kt">int32_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="n">__int64</span>          <span class="kt">int64_t</span><span class="p">;</span>

<span class="cp">#define INT8_C(n)   (n)
#define INT16_C(n)  (n)
#define INT32_C(n)  (n##U)
</span></code></pre></div></div>

<p>Second, in more recent versions of Windows, <code class="language-plaintext highlighter-rouge">GetFileAttributes()</code> can
return the value <code class="language-plaintext highlighter-rouge">INVALID_FILE_ATTRIBUTES</code>. Checking for an error that
cannot happen is harmless, but this value isn’t defined in Borland’s
SDK. I only had to eliminate that check.</p>

<p>Third, the <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa379942(v=vs.85).aspx"><code class="language-plaintext highlighter-rouge">CryptGenRandom()</code></a> interface isn’t defined in
Borland’s SDK. This is used by Enchive to generate keys. MSDN reports
this function wasn’t available until Windows XP, but it’s definitely
there in Windows 98, exported by ADVAPI32.dll. I’m able to call it,
though it always reports an error. Perhaps it’s been disabled in this
version due to <a href="https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States">cryptographic export restrictions</a>?</p>

<p>Regardless of what’s wrong, I ripped this out and replaced it with a
fatal error. This version of Enchive can’t generate new keys — unless
derived from a passphrase — nor encrypt files, including the use of a
protection key to encrypt the secret key. However, it <em>can</em> decrypt
files, which is the important part that needs to be future-proofed.</p>

<p>With this three changes — which took me about 10 minutes to sort out —
Enchive builds and runs, and it correctly decrypts files I encrypted on
Linux. So Enchive has at least 20 years of past-proofing! The
screenshot at the top of this article shows it running successfully in
an MS-DOS console window.</p>

<h3 id="whats-wrong-whats-missing">What’s wrong? What’s missing?</h3>

<p>I mentioned that there were some gaps. The most obvious is the lack of
the standard POSIX utilities, especially a decent shell. I don’t know if
any had been ported to Windows in the mid 1990s. But that could be
solved one way or another without too much trouble, even if it meant
doing some of that myself.</p>

<p>No, the biggest capability I’d miss, and which wouldn’t be easily
obtained, is Git, or a least a decent source control system. I really
don’t want to work without proper source control. Git’s support for
Windows is second tier, and the port to modern Windows is already a
bit of a hack. Getting it to run in Windows 98 would probably be a
challenge, especially if I had to compile it with Borland.</p>

<p>The other major issue is the lack of stability. In this experiment, I’ve
been seeing this screen <em>a lot</em>:</p>

<p><a href="/img/win98/bsod.png"><img src="/img/win98/bsod-thumb.png" alt="" /></a></p>

<p>I remember Windows crashing a lot back in those days, and it certainly
had a bad reputation for being unstable, but this is far worse than I
remembered. While the hardware emulator may be <em>somewhat</em> at fault here,
keep in mind that I never installed third party drivers. Most of these
crashes are Windows’ fault. I found I can reliably bring the whole
system down with a single <code class="language-plaintext highlighter-rouge">GetProcAddress()</code> call on a system DLL. The
only way I can imagine this instability was so tolerated back then was
general ignorance that computing could be so much better.</p>

<p>I was tempted to write this article in Vim on Windows 98, but all this
crashing made me too nervous. I didn’t want some stupid filesystem
corruption to wipe out my work. Too risky.</p>

<h3 id="a-better-alternative">A better alternative</h3>

<p>If I was stuck working in Windows 98 — or was at least targeting it as a
platform — but had access to a modern tooling ecosystem, could I do
better than Borland? Yes! Programs built by <a href="https://mingw-w64.org/doku.php">Mingw-w64</a> can be
run even as far back as Windows 95.</p>

<p>Now, there’s a catch. I thought it would be this simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i686-w64-mingw32-gcc -Os hello.c
</code></pre></div></div>

<p>But when I brought the resulting binary into the virtual machine it
crashed when ran it: illegal instruction. Turns out it contained a
conditional move (<code class="language-plaintext highlighter-rouge">cmov</code>) which is an instruction not available until
the Pentium Pro (686). The “pentium” emulation is just a 586.</p>

<p>I tried to disable <code class="language-plaintext highlighter-rouge">cmov</code> by picking the specific architecture:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i686-w64-mingw32-gcc -march=pentium -Os hello.c
</code></pre></div></div>

<p>This still didn’t work because the statically-linked part of the CRT
contained the <code class="language-plaintext highlighter-rouge">cmov</code>. I’d have to recompile that as well.</p>

<p>I could have switched the QEmu options to “upgrade” to a Pentium Pro,
but remember that my goal was really the 486. Fortunately this was easy
to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number
of times before, so I knew it wouldn’t be difficult.</p>

<p>I could go step by step, but it’s all fairly well documented in the
Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version),
and for the target I picked “i486-w64-mingw32”. When it was done I could
compile binaries on Linux to run in my Windows 98 virtual machine:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i486-w64-mingw32-gcc -Os hello.c
</code></pre></div></div>

<p>This should enable quite a bit of modern software to run inside my
virtual machine if I so wanted. I didn’t actually try this (yet?),
but, to take this concept all the way, I could use this cross-compiler
to cross-compile Mingw-w64 itself to run inside the virtual machine,
directly replacing Borland C++.</p>

<p>And the only thing I’d miss about Borland is its debugger.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Initial Evaluation of the Windows Subsystem for Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/11/30/"/>
    <id>urn:uuid:3edd1b7d-74c3-3dab-83b6-aa07ee54460f</id>
    <updated>2017-11-30T21:03:53Z</updated>
    <category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Recently I had my first experiences with the <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/"><em>Windows Subsystem for
Linux</em></a> (WSL), evaluating its potential as an environment for
getting work done. This subsystem, introduced to Windows 10 in August
2016, allows Windows to natively run x86 and x86-64 Linux binaries.
It’s essentially the counterpart to Wine, which allows Linux to
natively run Windows binaries.</p>

<p>WSL interfaces with Linux programs only at the kernel level, servicing
system calls the same way <a href="/blog/2015/05/15/">the Linux kernel would</a>. The
subsystem’s main job is translating Linux system calls into NT
requests. There’s a <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/wsl/">series of articles about its internals</a> if
you’re interested in learning more.</p>

<p>I was honestly impressed by how well this all works, especially since
Microsoft has long had an affinity for producing flimsy imitations
(Windows console, PowerShell, Arial, etc.). WSL’s design allows
Microsoft to dump an Ubuntu system wholesale inside Windows — and,
more recently, other Linux distributions — bypassing a bunch of
annoying issues, particularly in regards to glibc.</p>

<p>WSL processes can <code class="language-plaintext highlighter-rouge">exec(2)</code> Windows binaries, which then run in under
their appropriate subsystem, similar to <a href="https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/binfmt-misc.rst">binfmt</a> on Linux. In
theory this nice interop should allow for <em>some</em> automation
Linux-style even for Windows’ services and programs. More on that
later.</p>

<p>There are some notable issues, though.</p>

<h3 id="lack-of-device-emulation">Lack of device emulation</h3>

<p>No soundcard devices are exposed to the subsystem, so Linux programs
can’t play sound. There’s <a href="https://trzeci.eu/configure-graphic-and-sound-on-wsl/">a hack to talk PulseAudio</a> with a
Windows’ process that can access, but that’s about it. Generally
there’s not much reason to be playing media or games under WSL, but
this can be an annoyance if you’re, say, <a href="/blog/2017/11/03/">writing software that
synthesizes audio</a>.</p>

<p>Really, there’s almost no device emulation at all and <code class="language-plaintext highlighter-rouge">/proc</code> is
pretty empty. You won’t see hard drives or removable media under
<code class="language-plaintext highlighter-rouge">/dev</code>, nor will you see USB devices like webcams and
<a href="/blog/2016/11/05/">joysticks</a>. A lot of the useful things you might do on a Linux
system aren’t available under WSL.</p>

<h3 id="no-filesystem-in-userspace-fuse">No Filesystem in Userspace (FUSE)</h3>

<p>Microsoft hasn’t implemented any of the system calls for FUSE, so don’t
expect to use your favorite userspace filesystems. The biggest loss for
me is <a href="https://github.com/libfuse/sshfs">sshfs</a>, which I use frequently.</p>

<p>If FUSE <em>was</em> supported, it would be interesting to see how the rest of
Windows interacts with these mounted filesystems, if at all.</p>

<h3 id="fragile-services">Fragile services</h3>

<p>Services running under WSL are flaky. The big issue is that when the
initial WSL shell process exits, all WSL processes are killed and the
entire subsystem is torn down. This includes any services that are
running. That’s certainly surprising to anyone with experience running
services on any kind of unix system. This is probably the worst part
of WSL.</p>

<p>While systemd is the standard for Linux these days and may even be
“installed” in the WSL virtual filesystem, it’s not actually running
and you can’t use <code class="language-plaintext highlighter-rouge">systemctl</code> to interact with services. Services can
only be controlled the old fashioned way, and, per above, that initial
WSL console window has to remain open while services are running.</p>

<p>That’s a bit of a damper if you’re intending to spend a lot of time
remotely SSHing into your Windows 10 system. So yes, it’s trivial to run
an OpenSSH server under WSL, but it won’t feel like a proper system
service.</p>

<h3 id="limited-graphics-support">Limited graphics support</h3>

<p>WSL doesn’t come with an X server, so you have to supply one
separately (<a href="https://sourceforge.net/projects/xming/">Xming</a>, etc.) that runs outside WSL, as a normal
Windows process. WSL processes can connect to that server (<code class="language-plaintext highlighter-rouge">DISPLAY</code>)
allowing you to run most Linux graphical software.</p>

<p>However, this means there’s no hardware acceleration. There will be no
<a href="https://en.wikipedia.org/wiki/GLX">GLX extensions</a> available. If your goal is to run the Emacs or
Vim GUIs, that’s not a big deal, but it might matter if you were
interested in running a browser under WSL. It also means it’s not a
suitable environment for <a href="/blog/2015/06/06/">developing software using OpenGL</a>.</p>

<h3 id="filesystem-woes">Filesystem woes</h3>

<p>The filesystem manages to be both one of the smallest issues as well
as one of the biggest.</p>

<h4 id="filename-translation">Filename translation</h4>

<p>On the small issue side is filename translation. Under most Linux
filesystems — and even more broadly for unix — <a href="http://yarchive.net/comp/linux/case_insensitive_filenames.html">a filename is just a
bytestring</a>. They’re not necessarily UTF-8 or any other
particular encoding, and that’s partly why filenames are
case-sensitive — the meaning of case depends on the encoding.</p>

<p>However, Windows uses a <a href="/blog/2016/06/13/">pseudo-UTF-16 scheme</a> for filenames,
incompatible with bytestrings. Since WSL lives <em>within</em> a Windows’
filesystem, there must be some bijection between bytestring filenames
and pseudo-UTF-16 filenames. It will also have to reject filenames
that can’t be mapped. WSL does both.</p>

<p>I couldn’t find any formal documentation about how filename
translation works, but most of it can be reverse engineered through
experimentation. In practice, Linux filenames are <a href="/blog/2017/10/06/">UTF-8 encoded
strings</a>, and WSL’s translation takes advantage of this.
Filenames are decoded as UTF-8 and re-encoded as UTF-16 for Windows.
Any byte that doesn’t decode as valid UTF-8 is silently converted to
REPLACEMENT CHARACTER (U+FFFD), and decoding continues from the next
byte.</p>

<p>I wonder if there are security consequences for different filenames
silently mapping to the same underlying file.</p>

<p>Exercise for the reader: How is an unmatched surrogate half from
Windows translated to WSL, where it doesn’t have a UTF-8 equivalent? I
haven’t tried this yet.</p>

<p>Even for valid UTF-8, there are many bytes that most Linux filesystems
allow in filenames that Windows does not. This ranges from simple things
like ASCII backslash and colon — special components of Windows’ paths —
to unusual characters like newlines, escape, and other ASCII control
characters. There are two different ways these are handled:</p>

<ol>
  <li>
    <p>The C drive is available under <code class="language-plaintext highlighter-rouge">/mnt/c</code>, and WSL processes can access
regular Windows files under this “mountpoint.” Attempting to access
filenames with invalid characters under this mountpoint always
results in ENOENT: “No such file or directory.”</p>
  </li>
  <li>
    <p>Outside of <code class="language-plaintext highlighter-rouge">/mnt/c</code> is WSL territory, and Windows processes aren’t
supposed to touch these files. This allows for more freedom when
translating filenames. REPLACEMENT CHARACTER is still used for
invalid UTF-8 sequences, but the forbidden characters, including
backslashes, are all permitted. They’re translated to <code class="language-plaintext highlighter-rouge">#XXXX</code> where X
is hexadecimal for the normally invalid character. For example, <code class="language-plaintext highlighter-rouge">a:b</code>
becomes <code class="language-plaintext highlighter-rouge">a#003Ab</code>.</p>
  </li>
</ol>

<p>While WSL doesn’t let you get away with all the crazy, ill-advised
filenames that Linux allows, it’s still quite reasonable. Since Windows
and Linux filenames aren’t entirely compatible, there’s going to be some
trade-off no matter how this translation is done.</p>

<h4 id="filesystem-performance">Filesystem performance</h4>

<p>On the other hand, filesystem performance is abysmal, and I doubt the
subsystem is to blame. This isn’t a surprise to anyone who’s used
moderately-sized Git repositories on Windows, where the large numbers
of loose files brings things to a crawl. This has been a Windows issue
for years, and that’s even <em>before</em> you start plugging in the
typically “security” services — virus scanners, whitelists, etc. —
that are typically present on a Windows system and make this even
worse.</p>

<p>To test out WSL, I went around my normal business <a href="/blog/2017/06/19/">compiling
tools</a> and making myself at home, just as I would on Linux.
Doing nearly anything in WSL was noticably slower than doing the same
on Linux on the exact same hardware. I didn’t run any benchmarks, but
I’d expect to see around an order of magnitude difference on average
for filesystem operations. Building LLVM and Clang took a couple
hours rather than the typical 20 minutes.</p>

<p>I don’t expect this issue to get fixed anytime soon, and it’s probably
always going to be a notable limitation of WSL.</p>

<h3 id="so-is-wsl-useful">So is WSL useful?</h3>

<p>One of my hopes for WSL appears to be unfeasible. I thought it might
be a way to avoid <a href="/blog/2017/03/30/">porting software from POSIX to Win32</a>. I
could just supply Windows users with the same Linux binary and they’d
be fine. <del>However, WSL requires switching Windows into a special
“developer mode,” putting it well out of reach of the vast majority of
users, especially considering the typical corporate computing
environment that will lock this down. In practice, WSL is only useful
to developers. I’m sure this is no accident.</del> (Developer mode is <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/commandline/2017/10/11/whats-new-in-wsl-in-windows-10-fall-creators-update/">no
longer required</a> as of October 2017.)</p>

<p>Mostly I see WSL as a Cygwin killer. <a href="https://sanctum.geek.nz/arabesque/series/unix-as-ide/">Unix is my IDE</a> and, on
Windows, Cygwin has been my preferred go to for getting a solid unix
environment for software development. Unlike WSL, Cygwin processes can
make direct Win32 calls, which is occasionally useful. But, in exchange,
WSL will overall be better equipped. It has native Linux tools,
including a better suite of debugging tools — even better than you get
in Windows itself — Valgrind, strace, and properly-working GDB (always
been flaky in Cygwin). WSL is not nearly as good as actual Linux, but
it’s better than Cygwin <em>if</em> you can get access to it.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to Write Portable C Without Complicating Your Build</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/30/"/>
    <id>urn:uuid:e1651834-8033-3bfa-6eaf-00bc38a0584a</id>
    <updated>2017-03-30T04:06:58Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re writing a non-GUI C application intended to run on a
number of operating systems: Linux, the various BSDs, macOS, <a href="https://en.wikipedia.org/wiki/Illumos">classical
unix</a>, and perhaps even something as exotic as Windows. It might
sound like a rather complicated problem. These operating systems have
slightly different interfaces (or <em>very</em> different in one case), and they
run different variants of the standard unix tools — a problem for
portable builds.</p>

<p>With some up-front attention to detail, this is actually not terribly
difficult. <strong>Unix-like systems are probably the least diverse and least
buggy they’ve ever been.</strong> Writing portable code is really just a matter
of <strong>coding to the standards</strong> and ignoring extensions unless
<em>absolutely</em> necessary. Knowing what’s standard and what’s extension is
the tricky part, but I’ll explain how to find this information.</p>

<p>You might be tempted to reach for an <a href="https://undeadly.org/cgi?action=article;sid=20170930133438">overly complicated</a> solution
such as GNU Autoconf. Sure, it creates a configure script with the
familiar, conventional interface. This has real value. But do you
<em>really</em> need to run a single-threaded gauntlet of hundreds of
feature/bug tests for things that sometimes worked incorrectly in some
weird unix variant back in the 1990s? On a machine with many cores
(parallel build, <code class="language-plaintext highlighter-rouge">-j</code>), this may very well be the slowest part of the
whole build process.</p>

<p>For example, the configure script for Emacs checks that the compiler
supplies <code class="language-plaintext highlighter-rouge">stdlib.h</code>, <code class="language-plaintext highlighter-rouge">string.h</code>, and <code class="language-plaintext highlighter-rouge">getenv</code> — things that were
standardized nearly 30 years ago. It also checks for a slew of POSIX
functions that have been standard since 2001.</p>

<p>There’s a much easier solution: Document that the application requires,
say, C99 and POSIX.1-2001. It’s the responsibility of the person
building the application to supply these implementations, so there’s no
reason to waste time testing for it.</p>

<h3 id="how-to-code-to-the-standards">How to code to the standards</h3>

<p>Suppose there’s some function you want to use, but you’re not sure if
it’s standard or an extension. Or maybe you don’t know what standard it
comes from. Luckily the man pages document this stuff very well,
especially on Linux. Check the friendly “CONFORMING TO” section. For
example, look at <a href="https://manpages.debian.org/jessie/manpages-dev/getenv.3.en.html">getenv(3)</a>. Here’s what that section has to
say:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.
</code></pre></div></div>

<p>This says this function comes from the original C standard. It’s <em>always</em>
available on anything that claims to be a C implementation. The man page
also documents <code class="language-plaintext highlighter-rouge">secure_getenv()</code>, which is a GNU extension: to be avoided
in anything intended to be portable.</p>

<p>What about <a href="https://manpages.debian.org/jessie/manpages-dev/sleep.3.en.html">sleep(3)</a>?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    POSIX.1-2001.
</code></pre></div></div>

<p>This function isn’t part of standard C, but it’s available on any system
claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the
program needs to run on an operating system not implementing this POSIX
standard (i.e. Windows), you’ll need to call an alternative function,
probably inside a different <code class="language-plaintext highlighter-rouge">#if .. #endif</code> branch. More on this in a
moment.</p>

<p>If you’re coding to POSIX, you <a href="http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_02.html"><em>must</em> define the <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code>
feature test macro</a> to the standard you intend to use prior to
any system header includes:</p>

<blockquote>
  <p>A POSIX-conforming application should ensure that the feature test
macro <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> is defined before inclusion of any header.</p>
</blockquote>

<p>For example, to properly access POSIX.1-2001 functions in your
application, define <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> to <code class="language-plaintext highlighter-rouge">200112L</code>. With this defined,
it’s safe to assume access to all of C and everything from that standard
of POSIX. You can do this at the top of your sources, but I personally
like the tidiness of a global <code class="language-plaintext highlighter-rouge">config.h</code> that gets included before
everything.</p>

<h3 id="how-to-create-a-portable-build">How to create a portable build</h3>

<p>So you’ve written clean, portable C to the standards. How do you build
this application? The natural choice is <code class="language-plaintext highlighter-rouge">make</code>. It’s available
everywhere and it’s part of POSIX.</p>

<p>Again, the tricky part is teasing apart the standard from the extension.
I’m a long-time sinner in this regard, having far too often written
Makefiles that depend on GNU Make extensions. This is a real pain when
building programs on systems without the GNU utilities. I’ve been making
amends (and <a href="https://marc.info/?l=openbsd-bugs&amp;m=148815538325392&amp;w=2">finding</a> some <a href="https://marc.info/?l=openbsd-bugs&amp;m=148734102504016&amp;w=2">bugs</a> as a result).</p>

<p>No implementation makes the division clear in its documentation, and
especially don’t bother looking at the GNU Make manual. Your best
resource is <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">the standard itself</a>. If you’re already familiar with
<code class="language-plaintext highlighter-rouge">make</code>, coding to the standard is largely a matter of <em>unlearning</em> the
various extensions you know.</p>

<p>Outside of <a href="/blog/2016/04/30/">some hacks</a>, this means you don’t get conditionals
(<code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">else</code>, etc.). With some practice, both with sticking to portable
code and writing portable Makefiles, you’ll find that you <em>don’t really
need them</em>. Following the macro conventions will cover most situations.
For example:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CC</code>: the C compiler program</li>
  <li><code class="language-plaintext highlighter-rouge">CFLAGS</code>: flags to pass to the C compiler</li>
  <li><code class="language-plaintext highlighter-rouge">LDFLAGS</code>: flags to pass to the linker (via the C compiler)</li>
  <li><code class="language-plaintext highlighter-rouge">LDLIBS</code>: libraries to pass to the linker</li>
</ul>

<p>You don’t need to do anything weird with the assignments. The user
invoking <code class="language-plaintext highlighter-rouge">make</code> can override them easily. For example, here’s part of a
Makefile:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CC     = c99
CFLAGS = -Wall -Wextra -Os
</code></pre></div></div>

<p>But the user wants to use <code class="language-plaintext highlighter-rouge">clang</code>, and their system needs to explicitly
link <code class="language-plaintext highlighter-rouge">-lsocket</code> (e.g. Solaris). The user can override the macro
definitions on the command line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=clang LDLIBS=-lsocket
</code></pre></div></div>

<p>The same rules apply to the programs you invoke from the Makefile. Read
the standards documents and ignore your system’s man pages as to avoid
accidentally using an extension. It’s especially valuable to learn <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html">the
Bourne shell language</a> and avoid any accidental bashisms in your
Makefiles and scripts. The <code class="language-plaintext highlighter-rouge">dash</code> shell is good for testing your scripts.</p>

<p>Makefiles conforming to the standard will, unfortunately, be more verbose
than those taking advantage of a particular implementation. If you know
how to code Bourne shell — which is not terribly difficult to learn —
then you might even consider hand-writing a <code class="language-plaintext highlighter-rouge">configure</code> script to
generate the Makefile (a la metaprogramming). This gives you a more
flexible language with conditionals, and, being generated, redundancy in
the Makefile no longer matters.</p>

<p>As someone who frequently dabbles with BSD systems, my life has gotten a
lot easier since learning to write portable Makefiles and scripts.</p>

<h3 id="but-what-about-windows">But what about Windows</h3>

<p>It’s the elephant in the room and I’ve avoided talking about it so far.
If you want to <a href="/blog/2016/06/13/">build with Visual Studio’s command line tools</a> —
something I do on occasion — build portability goes out the window.
Visual Studio has <code class="language-plaintext highlighter-rouge">nmake.exe</code>, which nearly conforms to POSIX <code class="language-plaintext highlighter-rouge">make</code>.
However, without the standard unix utilities and with the completely
foreign compiler interface for <code class="language-plaintext highlighter-rouge">cl.exe</code>, there’s absolutely no hope of
writing a Makefile portable to this situation.</p>

<p>The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the
unix utilities, though it has <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">the problem</a> of linking against
<code class="language-plaintext highlighter-rouge">msvcrt.dll</code>. Another option is a separate Makefile dedicated to
<code class="language-plaintext highlighter-rouge">nmake.exe</code> and the Visual Studio toolchain. Good luck defining a
correctly working “clean” target with <code class="language-plaintext highlighter-rouge">del.exe</code>.</p>

<p>My preferred approach lately is an amalgamation build (as seen in
<a href="https://github.com/skeeto/enchive">Enchive</a>): Carefully concatenate all the application’s sources
into one giant source file. First concatenate all the headers in the
right order, followed by all the C files. Use <code class="language-plaintext highlighter-rouge">sed</code> to remove and local
includes. You can do this all on a unix system with the nice utilities,
then point <code class="language-plaintext highlighter-rouge">cl.exe</code> at the amalgamation for the Visual Studio build.
It’s not very useful for actual development (i.e. you don’t want to edit
the amalgamation), but that’s what MinGW-w64 resolves.</p>

<p>What about all those POSIX functions? You’ll need to find Win32
replacements on MSDN. I prefer to do this is by abstracting those
operating system calls. For example, compare POSIX <code class="language-plaintext highlighter-rouge">sleep(3)</code> and <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx">Win32
<code class="language-plaintext highlighter-rouge">Sleep()</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if defined(_WIN32)
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Sleep</span><span class="p">(</span><span class="n">s</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">);</span>  <span class="c1">// TODO: handle overflow, maybe</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* __unix__ */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sleep</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>  <span class="c1">// TODO: fix signal interruption</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Then the rest of the program calls <code class="language-plaintext highlighter-rouge">my_sleep()</code>. There’s another example
in <a href="/blog/2017/03/01/">the OpenMP article</a> with <code class="language-plaintext highlighter-rouge">pwrite(2)</code> and <code class="language-plaintext highlighter-rouge">WriteFile()</code>. This
demonstrates that supporting a bunch of different unix-like systems is
really easy, but introducing Windows portability adds a disproportionate
amount of complexity.</p>

<h4 id="caveat-paths-and-filenames">Caveat: paths and filenames</h4>

<p>There’s one major complication with filenames for applications portable
to Windows. In the unix world, filenames are null-terminated bytestrings.
Typically these are Unicode strings encoded as UTF-8, but it’s not
necessarily so. The kernel just sees bytestrings. A bytestring doesn’t
necessarily have a formal Unicode representation, which can be a problem
for <a href="https://www.python.org/dev/peps/pep-0383/">languages that want filenames to be Unicode strings</a>
(<a href="http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html">also</a>).</p>

<p>On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up
being neither. They’re really null-terminated unsigned 16-bit integer
arrays. It’s <em>almost</em> UTF-16 except that Windows allows unpaired
surrogates. This means Windows filenames <em>also</em> don’t have a formal
Unicode representation, but in a completely different way than unix. Some
<a href="https://simonsapin.github.io/wtf-8/">heroic efforts have gone into working around this issue</a>.</p>

<p>As a result, it’s highly non-trivial to correctly support all possible
filenames on both systems in the same program, <em>especially</em> when they’re
passed as command line arguments.</p>

<h3 id="summary">Summary</h3>

<p>The key points are:</p>

<ol>
  <li>Document the standards your application requires and strictly stick
to them.</li>
  <li>Ignore the vendor documentation if it doesn’t clearly delineate
extensions.</li>
</ol>

<p>This was all a discussion of non-GUI applications, and I didn’t really
touch on libraries. Many libraries are simple to access in the build
(just add it to <code class="language-plaintext highlighter-rouge">LDLIBS</code>), but some libraries — GUIs in particular — are
particularly complicated to manage portably and will require a more
complex solution (pkg-config, CMake, Autoconf, etc.).</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>OpenMP and pwrite()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/01/"/>
    <id>urn:uuid:dfdf8ca6-51aa-3a15-6bf0-98b39f20652a</id>
    <updated>2017-03-01T21:22:24Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>The most common way I introduce multi-threading to <a href="/blog/2015/07/10/">small C
programs</a> is with OpenMP (Open Multi-Processing). It’s typically
used as compiler pragmas to parallelize computationally expensive
loops — iterations are processed by different threads in some
arbitrary order.</p>

<p>Here’s an example that computes the <a href="/blog/2011/11/28/">frames of a video</a> in
parallel. Despite being computed out of order, each frame is written
in order to a large buffer, then written to standard output all at
once at the end.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_frames</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">output</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">DEFAULT_BETA</span><span class="p">;</span>

<span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="o">&amp;</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
</code></pre></div></div>

<p>Adding OpenMP to this program is much simpler than introducing
low-level threading semantics with, say, Pthreads. With care, there’s
often no need for explicit thread synchronization. It’s also fairly
well supported by many vendors, even Microsoft (up to OpenMP 2.0), so
a multi-threaded OpenMP program is quite portable without <code class="language-plaintext highlighter-rouge">#ifdef</code>.</p>

<p>There’s real value this pragma API: <strong>The above example would still
compile and run correctly even when OpenMP isn’t available.</strong> The
pragma is ignored and the program just uses a single core like it
normally would. It’s a slick fallback.</p>

<p>When a program really <em>does</em> require synchronization there’s
<code class="language-plaintext highlighter-rouge">omp_lock_t</code> (mutex lock) and the expected set of functions to operate
on them. This doesn’t have the nice fallback, so I don’t like to use
it. Instead, I prefer <code class="language-plaintext highlighter-rouge">#pragma omp critical</code>. It nicely maintains the
OpenMP-unsupported fallback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="cp">#pragma omp critical
</span>    <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would append the output to some output file in an arbitrary
order. The critical section <a href="/blog/2016/08/03/">prevents interleaving of
outputs</a>.</p>

<p>There are a couple of problems with this example:</p>

<ol>
  <li>
    <p>Only one thread can write at a time. If the write takes too long,
other threads will queue up behind the critical section and wait.</p>
  </li>
  <li>
    <p>The output frames will be out of order, which is probably
inconvenient for consumers. If the output is seekable this can be
solved with <code class="language-plaintext highlighter-rouge">lseek()</code>, but that only makes the critical section
even more important.</p>
  </li>
</ol>

<p>There’s an easy fix for both, and eliminates the need for a critical
section: POSIX <code class="language-plaintext highlighter-rouge">pwrite()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">pwrite</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">off_t</span> <span class="n">offset</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s like <code class="language-plaintext highlighter-rouge">write()</code> but has an offset parameter. Unlike <code class="language-plaintext highlighter-rouge">lseek()</code>
followed by a <code class="language-plaintext highlighter-rouge">write()</code>, multiple threads and processes can, in
parallel, safely write to the same file descriptor at different file
offsets. The catch is that <strong>the output must be a file, not a pipe</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">size</span> <span class="o">*</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s no critical section, the writes can interleave, and the output
is in order.</p>

<p>If you’re concerned about standard output not being seekable (it often
isn’t), keep in mind that it will work just fine when invoked like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./compute_frames &gt; frames.ppm
</code></pre></div></div>

<h3 id="windows-portability">Windows Portability</h3>

<p>I talked about OpenMP being really portable, then used POSIX
functions. Fortunately the Win32 <code class="language-plaintext highlighter-rouge">WriteFile()</code> function has an
“overlapped” parameter that works just like <code class="language-plaintext highlighter-rouge">pwrite()</code>. Typically
rather than call either directly, I’d wrap the write like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">out</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">written</span><span class="p">;</span>
    <span class="n">OVERLAPPED</span> <span class="n">offset</span> <span class="o">=</span> <span class="p">{.</span><span class="n">Offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">};</span>
    <span class="k">return</span> <span class="n">WriteFile</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">offset</span><span class="p">);</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* POSIX */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">offset</span><span class="p">)</span> <span class="o">==</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Except for switching to <code class="language-plaintext highlighter-rouge">write_frame()</code>, the OpenMP part remains
untouched.</p>

<h3 id="real-world-example">Real World Example</h3>

<p>Here’s an example in a real program:</p>

<p><a href="https://gist.github.com/skeeto/d7e17bb2aa40907a3405c3933cb1f936" class="download">julia.c</a></p>

<p>Notice because of <code class="language-plaintext highlighter-rouge">pwrite()</code> there’s no piping directly into
<code class="language-plaintext highlighter-rouge">ppmtoy4m</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./julia &gt; output.ppm
$ ppmtoy4m -F 60:1 &lt; output.ppm &gt; output.y4m
$ x264 -o output.mp4 output.y4m
</code></pre></div></div>

<p><a href="/video/?v=julia-256" class="download">output.mp4</a></p>

<video src="https://skeeto.s3.amazonaws.com/share/julia-256.mp4" controls="" loop="" crossorigin="anonymous">
</video>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Asynchronous Requests from Emacs Dynamic Modules</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/02/14/"/>
    <id>urn:uuid:00a59e4f-268c-343f-e6c6-bb23cde265de</id>
    <updated>2017-02-14T02:30:00Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>A few months ago I had a discussion with Vladimir Kazanov about his
<a href="https://github.com/vkazanov/toy-orgfuse">Orgfuse</a> project: a Python script that exposes an Emacs
Org-mode document as a <a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace">FUSE filesystem</a>. It permits other
programs to navigate the structure of an Org-mode document through the
standard filesystem APIs. I suggested that, with the new dynamic
modules in Emacs 25, Emacs <em>itself</em> could serve a FUSE filesystem. In
fact, support for FUSE services in general could be an package of his
own.</p>

<p>So that’s what he did: <a href="https://github.com/vkazanov/elfuse"><strong>Elfuse</strong></a>. It’s an old joke that
Emacs is an operating system, and here it is handling system calls.</p>

<p>However, there’s a tricky problem to solve, an issue also present <a href="/blog/2016/11/05/">my
joystick module</a>. Both modules handle asynchronous events —
filesystem requests or joystick events — but Emacs runs the event loop
and owns the main thread. The external events somehow need to feed
into the main event loop. It’s even more difficult with FUSE because
FUSE <em>also</em> wants control of its own thread for its own event loop.
This requires Elfuse to spawn a dedicated FUSE thread and negotiate a
request/response hand-off.</p>

<p>When a filesystem request or joystick event arrives, how does Emacs
know to handle it? The simple and obvious solution is to poll the
module from a timer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">queue</span> <span class="n">requests</span><span class="p">;</span>

<span class="n">emacs_value</span>
<span class="nf">Frequest_next</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emacs_value</span> <span class="n">next</span> <span class="o">=</span> <span class="n">Qnil</span><span class="p">;</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">queue_length</span><span class="p">(</span><span class="n">requests</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">queue_pop</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_empty</span><span class="p">,</span> <span class="n">request</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">request</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And then ask Emacs to check the module every, say, 10ms:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request--poll</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">next</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">when</span> <span class="nv">next</span>
      <span class="p">(</span><span class="nv">request-handle</span> <span class="nv">next</span><span class="p">))))</span>

<span class="p">(</span><span class="nv">run-at-time</span> <span class="mi">0</span> <span class="mf">0.01</span> <span class="nf">#'</span><span class="nv">request--poll</span><span class="p">)</span>
</code></pre></div></div>

<p>Blocking directly on the module’s event pump with Emacs’ thread would
prevent Emacs from doing important things like, you know, <em>being a
text editor</em>. The timer allows it to handle its own events
uninterrupted. It gets the job done, but it’s far from perfect:</p>

<ol>
  <li>
    <p>It imposes an arbitrary latency to handling requests. Up to the
poll period could pass before a request is handled.</p>
  </li>
  <li>
    <p>Polling the module 100 times per second is inefficient. Unless you
really enjoy recharging your laptop, that’s no good.</p>
  </li>
</ol>

<p>The poll period is a sliding trade-off between latency and battery
life. If only there was some mechanism to, ahem, <em>signal</em> the Emacs
thread, informing it that a request is waiting…</p>

<h3 id="sigusr1">SIGUSR1</h3>

<p>Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals,
which is exactly the mechanism we need. The interface is a “key”
binding on <code class="language-plaintext highlighter-rouge">special-event-map</code>, the keymap that handles these kinds of
events. When the signal arrives, Emacs queues it up for the main event
loop.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[sigusr1]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-handle</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>The module blocks on its own thread on its own event pump. When a
request arrives, it queues the request, rings the bell for Emacs to
come handle it (<code class="language-plaintext highlighter-rouge">raise()</code>), and waits on a semaphore. For illustration
purposes, assume the module reads requests from and writes responses
to a file descriptor, like a socket.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">event_fd</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">request</span> <span class="n">request</span><span class="p">;</span>
<span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Blocking read for request event */</span>
    <span class="n">read</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">));</span>

    <span class="cm">/* Put request on the queue */</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">queue_push</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">);</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">raise</span><span class="p">(</span><span class="n">SIGUSR1</span><span class="p">);</span>  <span class="c1">// TODO: Should raise() go inside the lock?</span>

    <span class="cm">/* Wait for Emacs */</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">))</span>
        <span class="p">;</span>

    <span class="cm">/* Reply with Emacs' response */</span>
    <span class="n">write</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">sem_wait()</code> is in a loop because signals will wake it up
prematurely. In fact, it may even wake up due to its own signal on the
line before. This is the only way this particular use of <code class="language-plaintext highlighter-rouge">sem_wait()</code>
might fail, so there’s no need to check <code class="language-plaintext highlighter-rouge">errno</code>.</p>

<p>If there are multiple module threads making requests to the same
global queue, the lock is necessary to protect the queue. The
semaphore is only for blocking the thread until Emacs has finished
writing its particular response. Each thread has its own semaphore.</p>

<p>When Emacs is done writing the response, it releases the module thread
by incrementing the semaphore. It might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">emacs_value</span>
<span class="nf">Frequest_complete</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">request</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">request</span><span class="p">)</span>
        <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="o">-&gt;</span><span class="n">sem</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">Qnil</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The top-level handler dispatches to the specific request handler,
calling <code class="language-plaintext highlighter-rouge">request-complete</code> above when it’s done.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request-handle</span> <span class="p">(</span><span class="nv">next</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">condition-case</span> <span class="nv">e</span>
      <span class="p">(</span><span class="nv">cl-ecase</span> <span class="p">(</span><span class="nv">request-type</span> <span class="nv">next</span><span class="p">)</span>
        <span class="p">(</span><span class="ss">:open</span>  <span class="p">(</span><span class="nv">request-handle-open</span>  <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:close</span> <span class="p">(</span><span class="nv">request-handle-close</span> <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:read</span>  <span class="p">(</span><span class="nv">request-handle-read</span>  <span class="nv">next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">error</span> <span class="p">(</span><span class="nv">request-respond-as-error</span> <span class="nv">next</span> <span class="nv">e</span><span class="p">)))</span>
  <span class="p">(</span><span class="nv">request-complete</span><span class="p">))</span>
</code></pre></div></div>

<p>This SIGUSR1+semaphore mechanism is roughly how Elfuse currently
processes requests.</p>

<h3 id="making-it-work-on-windows">Making it work on Windows</h3>

<p>Windows doesn’t have signals. This isn’t a problem for Elfuse since
Windows doesn’t have FUSE either. Nor does it matter for Joymacs since
XInput isn’t event-driven and always requires polling. But someday
someone will need this mechanism for a dynamic module on Windows.</p>

<p>Fortunately there’s a solution: <em>input language change</em> events,
<code class="language-plaintext highlighter-rouge">WM_INPUTLANGCHANGE</code>. It’s also on <code class="language-plaintext highlighter-rouge">special-event-map</code>:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[language-change]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-process</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>Instead of <code class="language-plaintext highlighter-rouge">raise()</code> (or <code class="language-plaintext highlighter-rouge">pthread_kill()</code>), broadcast the window event
with <code class="language-plaintext highlighter-rouge">PostMessage()</code>. Outside of invoking the <code class="language-plaintext highlighter-rouge">language-change</code> key
binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t
belong to any particular window. We don’t <em>really</em> want to change the
input language, after all.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PostMessageA</span><span class="p">(</span><span class="n">HWND_BROADCAST</span><span class="p">,</span> <span class="n">WM_INPUTLANGCHANGE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Naturally you’ll also need to replace the POSIX threading primitives
with the Windows versions (<code class="language-plaintext highlighter-rouge">CreateThread()</code>, <code class="language-plaintext highlighter-rouge">CreateSemaphore()</code>,
etc.). With a bit of abstraction in the right places, it should be
pretty easy to support both POSIX and Windows in these asynchronous
dynamic module events.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>How to Read and Write Other Process Memory</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/03/"/>
    <id>urn:uuid:205f20eb-a47e-3506-fd8f-4b416fc08133</id>
    <updated>2016-09-03T21:53:26Z</updated>
    <category term="win32"/><category term="linux"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>I recently put together a little game memory cheat tool called
<a href="https://github.com/skeeto/memdig">MemDig</a>. It can find the address of a particular game value
(score, lives, gold, etc.) after being given that value at different
points in time. With the address, it can then modify that value to
whatever is desired.</p>

<p>I’ve been using tools like this going back 20 years, but I never tried
to write one myself until now. There are many memory cheat tools to
pick from these days, the most prominent being <a href="http://www.cheatengine.org/">Cheat Engine</a>.
These tools use the platform’s debugging API, so of course any good
debugger could do the same thing, though a debugger won’t be
specialized appropriately (e.g. locating the particular address and
locking its value).</p>

<p>My motivation was bypassing an in-app purchase in a single player
Windows game. I wanted to convince the game I had made the purchase
when, in fact, I hadn’t. Once I had it working successfully, I ported
MemDig to Linux since I thought it would be interesting to compare.
I’ll start with Windows for this article.</p>

<h3 id="windows">Windows</h3>

<p>Only three Win32 functions are needed, and you could almost guess at
how it works.</p>

<ul>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms684320">OpenProcess()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms680553">ReadProcessMemory()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms681674">WriteProcessMemory()</a></li>
</ul>

<p>It’s very straightforward <s>and, for this purpose, is probably the
simplest API for any platform</s> (see update).</p>

<p>As you probably guessed, you first need to open the process, given its
process ID (integer). You’ll need to select the <em>desired access</em> bit a
bit set. To read memory, you need the <code class="language-plaintext highlighter-rouge">PROCESS_VM_READ</code> and
<code class="language-plaintext highlighter-rouge">PROCESS_QUERY_INFORMATION</code> rights. To write memory, you need the
<code class="language-plaintext highlighter-rouge">PROCESS_VM_WRITE</code> and <code class="language-plaintext highlighter-rouge">PROCESS_VM_OPERATION</code> rights. Alternatively
you could just ask for all rights with <code class="language-plaintext highlighter-rouge">PROCESS_ALL_ACCESS</code>, but I
prefer to be precise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">PROCESS_VM_READ</span> <span class="o">|</span>
               <span class="n">PROCESS_QUERY_INFORMATION</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_WRITE</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_OPERATION</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">access</span><span class="p">,</span> <span class="n">FALSE</span><span class="p">,</span> <span class="n">pid</span><span class="p">);</span>
</code></pre></div></div>

<p>And then to read or write:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span> <span class="c1">// target process address</span>
<span class="n">SIZE_T</span> <span class="n">written</span><span class="p">;</span>
<span class="n">ReadProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">WriteProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
</code></pre></div></div>

<p>Don’t forget to check the return value and verify <code class="language-plaintext highlighter-rouge">written</code>. Finally,
don’t forget to <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211">close it</a> when you’re done.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CloseHandle</span><span class="p">(</span><span class="n">proc</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s all there is to it. For the full cheat tool you’d need to find
the mapped regions of memory, via <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa366907">VirtualQueryEx</a>. It’s not
as simple, but I’ll leave that for another article.</p>

<h3 id="linux">Linux</h3>

<p>Unfortunately there’s no standard, cross-platform debugging API for
unix-like systems. Most have a ptrace() system call, though each works
a little differently. Note that ptrace() is not part of POSIX, but
appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere.
The following will all be specific to Linux, though the procedure is
similar on other unix-likes.</p>

<p>In typical Linux fashion, if it involves other processes, you use the
standard file API on the /proc filesystem. Each process has a
directory under /proc named as its process ID. In this directory is a
virtual file called “mem”, which is a file view of that process’
entire address space, including unmapped regions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">file</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s">"/proc/%ld/mem"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">pid</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that while you can open this file, you can’t actually
read or write on that file without attaching to the process as a
debugger. You’ll just get EIO errors. To attach, use ptrace() with
<code class="language-plaintext highlighter-rouge">PTRACE_ATTACH</code>. This asynchronously delivers a <code class="language-plaintext highlighter-rouge">SIGSTOP</code> signal to
the target, which has to be waited on with waitpid().</p>

<p>You could select the target address with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/lseek.html">lseek()</a>, but it’s
cleaner and more efficient just to do it all in one system call with
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html">pread()</a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">pwrite()</a>. I’ve left out the error
checking, but the return value of each function should be checked:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_ATTACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="kt">off_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">...;</span> <span class="c1">// target process address</span>
<span class="n">pread</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">pwrite</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>

<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_DETACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>The process will (and must) be stopped during this procedure, so do
your reads/writes quickly and get out. The kernel will deliver the
writes to the other process’ virtual memory.</p>

<p>Like before, don’t forget to close.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>To find the mapped regions in the real cheat tool, you would read and
parse the virtual text file /proc/<em>pid</em>/maps. I don’t know if I’d call
this stringly-typed method elegant — the kernel converts the data into
string form and the caller immediately converts it right back — but
that’s the official API.</p>

<p>Update: Konstantin Khlebnikov has pointed out the
<a href="http://man7.org/linux/man-pages/man2/process_vm_readv.2.html">process_vm_readv()</a> and <a href="http://man7.org/linux/man-pages/man2/process_vm_writev.2.html">process_vm_writev()</a>
system calls, available since Linux 3.2 (January 2012) and glibc 2.15
(March 2012). These system calls do not require ptrace(), nor does the
remote process need to be stopped. They’re equivalent to
ReadProcessMemory() and WriteProcessMemory(), except there’s no
requirement to first “open” the process.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Automatic Deletion of Incomplete Output Files</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/07/"/>
    <id>urn:uuid:431fafe9-6630-363e-4596-85eb3a289ec2</id>
    <updated>2016-08-07T02:00:37Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Conventionally, a program that creates an output file will delete its
incomplete output should an error occur while writing the file. It’s
risky to leave behind a file that the user may rightfully confuse for
a valid file. They might not have noticed the error.</p>

<p>For example, compression programs such as gzip, bzip2, and xz when
given a compressed file as an argument will create a new file with the
compression extension removed. They write to this file as the
compressed input is being processed. If the compressed stream contains
an error in the middle, the partially-completed output is removed.</p>

<p>There are exceptions of course, such as programs that download files
over a network. The partial result has value, especially if the
transfer can be <a href="https://tools.ietf.org/html/rfc7233">continued from where it left off</a>. The
convention is to append another extension, such as “.part”, to
indicate a partial output.</p>

<p>The straightforward solution is to always delete the file as part of
error handling. A non-interactive program would report the error on
standard error, delete the file, and exit with an error code. However,
there are at least two situations where error handling would be unable
to operate: unhandled signals (usually including a segmentation fault)
and power failures. A partial or corrupted output file will be left
behind, possibly looking like a valid file.</p>

<p>A common, more complex approach is to name the file differently from
its final name while being written. If written successfully, the
completed file is renamed into place. This is already <a href="http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/">required for
durable replacement</a>, so it’s basically free for many
applications. In the worst case, where the program is unable to clean
up, the obviously incomplete file is left behind only wasting space.</p>

<p>Looking to be more robust, I had the following misguided idea: <strong>Rely
completely on the operating system to perform cleanup in the case of a
failure.</strong> Initially the file would be configured to be automatically
deleted when the final handle is closed. This takes care of all
abnormal exits, and possibly even power failures. The program can just
exit on error without deleting the file. Once written successfully,
the automatic-delete indicator is cleared so that the file survives.</p>

<p>The target application for this technique supports both Linux and
Windows, so I would need to figure it out for both systems. On
Windows, there’s the flag <code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. I’d just need
to find a way to clear it. On POSIX, file would be unlinked while
being written, and linked into the filesystem on success. The latter
turns out to be a lot harder than I expected.</p>

<h3 id="solution-for-windows">Solution for Windows</h3>

<p>I’ll start with Windows since the technique actually works fairly well
here — ignoring the usual, dumb Win32 filesystem caveats. This is a
little surprising, since it’s usually Win32 that makes these things
far more difficult than they should be.</p>

<p>The primary Win32 function for opening and creating files is
<a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx">CreateFile</a>. There are many options, but the key is
<code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. Here’s how an application might typically
open a file for output.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">GENERIC_WRITE</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">create</span> <span class="o">=</span> <span class="n">CREATE_ALWAYS</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">FILE_FLAG_DELETE_ON_CLOSE</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">f</span> <span class="o">=</span> <span class="n">CreateFile</span><span class="p">(</span><span class="s">"out.tmp"</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">create</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>This special flag asks Windows to delete the file as soon as the last
handle to to <em>file object</em> is closed. Notice I said file object, not
file, since <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20160108-00/?p=92821">these are different things</a>. The catch: This flag
is a property of the file object, not the file, and cannot be removed.</p>

<p>However, the solution is simple. Create a new link to the file so that
it survives deletion. This even works for files residing on a network
shares.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CreateHardLink</span><span class="p">(</span><span class="s">"out"</span><span class="p">,</span> <span class="s">"out.tmp"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>  <span class="c1">// deletes out.tmp file</span>
</code></pre></div></div>

<p>The gotcha is that the underlying filesystem must be NTFS. FAT32
doesn’t support hard links. Unfortunately, since FAT32 remains the
least common denominator and is still widely used for removable media,
depending on the application, your users may expect support for saving
files to FAT32. A workaround is probably required.</p>

<h3 id="solution-for-linux">Solution for Linux</h3>

<p>This is where things really fall apart. It’s just <em>barely</em> possible on
Linux, it’s messy, and it’s not portable anywhere else. There’s no way
to do this for POSIX in general.</p>

<p>My initial thought was to create a file then unlink it. Unlike the
situation on Windows, files can be unlinked while they’re currently
open by a process. These files are finally deleted when the last file
descriptor (the last reference) is closed. Unfortunately, using
unlink(2) to remove the last link to a file prevents that file from
being linked again.</p>

<p>Instead, the solution is to use the relatively new (since Linux 3.11),
Linux-specific <code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag when creating the file. Instead of a
filename, this variation of open(2) takes a directory and creates an
unnamed, temporary file in it. These files are special in that they’re
permitted to be given a name in the filesystem at some future point.</p>

<p>For this example, I’ll assume the output is relative to the current
working directory. If it’s not, you’ll need to open an additional file
descriptor for the parent directory, and also use openat(2) to avoid
possible race conditions (since paths can change from under you). The
number of ways this can fail is already rapidly multiplying.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"."</span><span class="p">,</span> <span class="n">O_TMPFILE</span><span class="o">|</span><span class="n">O_WRONLY</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that only a handful of filesystems support <code class="language-plaintext highlighter-rouge">O_TMPFILE</code>.
It’s like the FAT32 problem above, but worse. You could easily end up
in a situation where it’s not supported, and will almost certainly
require a workaround.</p>

<p>Linking a file from a file descriptor is where things get messier. The
file descriptor must be linked with linkat(2) from its name on the
/proc virtual filesystem, constructed as a string. The following
snippet comes straight from the Linux open(2) manpage.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/proc/self/fd/%d"</span><span class="p">,</span> <span class="n">fd</span><span class="p">);</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">AT_FDCWD</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_SYMLINK_FOLLOW</span><span class="p">);</span>
</code></pre></div></div>

<p>Even on Linux, /proc isn’t always available, such as within a chroot
or a container, so this part can fail as well. In theory there’s a way
to do this with the Linux-specific <code class="language-plaintext highlighter-rouge">AT_EMPTY_PATH</code> and avoid /proc,
but I couldn’t get it to work.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: this doesn't actually work for me.</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_EMPTY_PATH</span><span class="p">);</span>
</code></pre></div></div>

<p>Given the poor portability (even within Linux), the number of ways
this can go wrong, and that a workaround is definitely needed anyway,
I’d say this technique is worthless. I’m going to stick with the
tried-and-true approach for this one.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Appending to a File from Multiple Processes</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/03/"/>
    <id>urn:uuid:93473b6d-3be3-3d0c-d7d5-6ad485c1e9a0</id>
    <updated>2016-08-03T16:17:44Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you have multiple processes appending output to the same file
without explicit synchronization. These processes might be working in
parallel on different parts of the same problem, or these might be
threads blocked individually reading different external inputs. There
are two concerns that come into play:</p>

<p>1) <strong>The append must be atomic</strong> such that it doesn’t clobber previous
    appends by other threads and processes. For example, suppose a
    write requires two separate operations: first moving the file
    pointer to the end of the file, then performing the write. There
    would be a race condition should another process or thread
    intervene in between with its own write.</p>

<p>2) <strong>The output will be interleaved.</strong> The primary solution is to
   design the data format as atomic records, where the ordering of
   records is unimportant — like rows in a relational database. This
   could be as simple as a text file with each line as a record. The
   concern is then ensuring records are written atomically.</p>

<p>This article discusses processes, but the same applies to threads when
directly dealing with file descriptors.</p>

<h3 id="appending">Appending</h3>

<p>The first concern is solved by the operating system, with one caveat.
On POSIX systems, opening a file with the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag will
guarantee that <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">writes always safely append</a>.</p>

<blockquote>
  <p>If the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and
no intervening file modification operation shall occur between
changing the file offset and the write operation.</p>
</blockquote>

<p>However, this says nothing about interleaving. <strong>Two processes
successfully appending to the same file will result in all their bytes
in the file in order, but not necessarily contiguously.</strong></p>

<p>The caveat is that not all filesystems are POSIX-compatible. Two
famous examples are NFS and the Hadoop Distributed File System (HDFS).
On these networked filesystems, appends are simulated and subject to
race conditions.</p>

<p>On POSIX systems, fopen(3) with the <code class="language-plaintext highlighter-rouge">a</code> flag <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html">will use
<code class="language-plaintext highlighter-rouge">O_APPEND</code></a>, so you don’t necessarily need to use open(2). On
Linux this can be verified for any language’s standard library with
strace.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/null"</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the result of the trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
</code></pre></div></div>

<p>For Win32, the equivalent is the <code class="language-plaintext highlighter-rouge">FILE_APPEND_DATA</code> access right, and
similarly <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/gg258116(v=vs.85).aspx">only applies to “local files.”</a></p>

<h3 id="interleaving-and-pipes">Interleaving and Pipes</h3>

<p>The interleaving problem has two layers, and gets more complicated the
more correct you want to be. Let’s start with pipes.</p>

<p>On POSIX, a pipe is unseekable and doesn’t have a file position, so
appends are the only kind of write possible. When writing to a pipe
(or FIFO), writes less than the system-defined <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> are
guaranteed to be atomic and non-interleaving.</p>

<blockquote>
  <p>Write requests of <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes or less shall not be interleaved
with data from other processes doing writes on the same pipe. Writes
of greater than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes may have data interleaved, on
arbitrary boundaries, with writes by other processes, […]</p>
</blockquote>

<p>The minimum value for <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> for POSIX systems is 512 bytes. On
Linux it’s 4kB, and on other systems <a href="http://ar.to/notes/posix">it’s as high as 32kB</a>.
As long as each record is less than 512 bytes, a simple write(2) will
due. None of this depends on a filesystem since no files are involved.</p>

<p>If more than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes isn’t enough, the POSIX writev(2) can be
used to <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/writev.html">atomically write up to <code class="language-plaintext highlighter-rouge">IOV_MAX</code> buffers</a> of
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes. The minimum value for <code class="language-plaintext highlighter-rouge">IOV_MAX</code> is 16, but is
typically 1024. This means the maximum safe atomic write size for
pipes — and therefore the largest record size — for a perfectly
portable program is 8kB (16✕512). On Linux it’s 4MB.</p>

<p>That’s all at the system call level. There’s another layer to contend
with: buffered I/O in your language’s standard library. Your program
may pass data in appropriately-sized pieces for atomic writes to the
I/O library, but it may be undoing your hard work, concatenating all
these writes into a buffer, splitting apart your records. For this
part of the article, I’ll focus on single-threaded C programs.</p>

<p>Suppose you’re writing a simple space-separated format with one line
per record.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">baz</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">condition</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d %d %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">,</span> <span class="n">baz</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Whether or not this works depends on how <code class="language-plaintext highlighter-rouge">stdout</code> is buffered. C
standard library streams (<code class="language-plaintext highlighter-rouge">FILE *</code>) have three buffering modes:
unbuffered, line buffered, and fully buffered. Buffering is configured
through setbuf(3) and setvbuf(3), and the initial buffering state of a
stream depends on various factors. For buffered streams, the default
buffer is at least <code class="language-plaintext highlighter-rouge">BUFSIZ</code> bytes, itself at least 256 (C99
§7.19.2¶7). Note: threads share this buffer.</p>

<p>Since each record in the above program easily fits inside 256 bytes,
if stdout is a line buffered pipe then this program will interleave
correctly on any POSIX system without further changes.</p>

<p>If instead your output is comma-separated values (CSV) and <a href="https://tools.ietf.org/html/rfc4180">your
records may contain new line characters</a>, there are two
approaches. In each, the record must still be no larger than
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes.</p>

<ul>
  <li>
    <p>Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3))
and output the entire buffer in a single fwrite(3). While I believe
this will always work in practice, it’s not guaranteed by the C
specification, which defines fwrite(3) as a series of fputc(3) calls
(C99 §7.19.8.2¶2).</p>
  </li>
  <li>
    <p>Fully buffered pipe: set a sufficiently large stream buffer and
follow each record with a fflush(3). Unlike fwrite(3) on an
unbuffered stream, the specification says the buffer will be
“transmitted to the host environment as a block” (C99 §7.19.3¶3),
so this should be perfectly correct on any POSIX system.</p>
  </li>
</ul>

<p>If your situation is more complicated than this, you’ll probably have
to bypass your standard library buffered I/O and call write(2) or
writev(2) yourself.</p>

<h4 id="practical-application">Practical Application</h4>

<p>If interleaving writes to a pipe stdout sounds contrived, here’s the
real life scenario: GNU xargs with its <code class="language-plaintext highlighter-rouge">--max-procs</code> (<code class="language-plaintext highlighter-rouge">-P</code>) option to
process inputs in parallel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xargs -n1 -P$(nproc) myprogram &lt; inputs.txt | cat &gt; outputs.csv
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">| cat</code> ensures the output of each <code class="language-plaintext highlighter-rouge">myprogram</code> process is
connected to the same pipe rather than to the same file.</p>

<p>A non-portable alternative to <code class="language-plaintext highlighter-rouge">| cat</code>, especially if you’re
dispatching processes and threads yourself, is the splice(2) system
call on Linux. It efficiently moves the output from the pipe to the
output file without an intermediate copy to userspace. GNU Coreutils’
cat doesn’t use this.</p>

<h4 id="win32-pipes">Win32 Pipes</h4>

<p>On Win32, <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365152(v=vs.85).aspx">anonymous pipes</a> have no semantics regarding
interleaving. <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365150(v=vs.85).aspx">Named pipes</a> have per-client buffers that
prevent interleaving. However, the pipe buffer size is unspecified,
and requesting a particular size is only advisory, so it comes down to
trial and error, though the unstated limits should be comparatively
generous.</p>

<h3 id="interleaving-and-files">Interleaving and Files</h3>

<p>Suppose instead of a pipe we have an <code class="language-plaintext highlighter-rouge">O_APPEND</code> file on POSIX. Common
wisdom states that the same <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> atomic write rule applies.
While this often works, especially on Linux, this is not correct. The
POSIX specification doesn’t require it and <a href="http://www.notthewizard.com/2014/06/17/are-files-appends-really-atomic/">there are systems where it
doesn’t work</a>.</p>

<p>If you know the particular limits of your operating system <em>and</em>
filesystem, and you don’t care much about portability, then maybe you
can get away with interleaving appends. For full portability, pipes
are required.</p>

<p>On Win32, writes on local files up to the underlying drive’s sector
size (typically 512 bytes to 4kB) are atomic. Otherwise the only
options are deprecated Transactional NTFS (TxF), or manually
synchronizing your writes. All in all, it’s going to take more work to
get correct.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My true use case for mucking around with clean, atomic appends is to
compute giant CSV tables in parallel, with the intention of later
loading into a SQL database (i.e. SQLite) for analysis. A more robust
and traditional approach would be to write results directly into the
database as they’re computed. But I like the platform-neutral
intermediate CSV files — good for archival and sharing — and the
simplicity of programs generating the data — concerned only with
atomic write semantics rather than calling into a particular SQL
database API.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Four Ways to Compile C for Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/06/13/"/>
    <id>urn:uuid:1e99288c-0500-36f5-9fe7-262e6c6287c4</id>
    <updated>2016-06-13T04:13:25Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>Update 2020: If you’re on Windows, just use <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>.
It’s <a href="/blog/2020/05/15/">my own toolchain distribution</a>, and it’s the best option
available. <a href="/blog/2020/09/25/">Everything you need</a> is in one package.</em></p>

<p>I primarily work on and develop for unix-like operating systems —
Linux in particular. However, when it comes to desktop applications,
most potential users are on Windows. Rather than develop on Windows,
which I’d rather avoid, I’ll continue developing, testing, and
debugging on Linux while keeping portability in mind. Unfortunately
every option I’ve found for building Windows C programs has some
significant limitations. These limitations advise my approach to
portability and restrict the C language features used by the program
for all platforms.</p>

<p>As of this writing I’ve identified four different practical ways to
build C applications for Windows. This information will definitely
become further and further out of date as this article ages, so if
you’re visiting from the future take a moment to look at the date.
Except for LLVM shaking things up recently, development tooling on
unix-like systems has had the same basic form for the past 15 years
(i.e. dominated by GCC). While Visual C++ has been around for more
than two decades, the tooling on Windows has seen more churn by
comparison.</p>

<p>Before I get into the specifics, let me point out a glaring problem
common to all four: Unicode arguments and filenames. Microsoft jumped
the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all
worlds, being a variable length encoding (surrogate pairs), backwards
incompatible (<a href="http://utf8everywhere.org/">unlike UTF-8</a>), and having byte-order issues (BOM).
Most Win32 functions that accept strings generally come in two flavors,
ANSI and UTF-16. The standard, portable C library functions wrap the
ANSI-flavored functions. This means <strong>portable C programs can’t interact
with Unicode filenames</strong>. (Update 2021: <a href="/blog/2021/12/30/">Now they can</a>.) They must
call the non-portable, Windows-specific versions. This includes <code class="language-plaintext highlighter-rouge">main</code>
itself, which is only handed ANSI-truncated arguments.</p>

<p>Compare this to unix-like systems, which generally adopted UTF-8, but
rather as a convention than as a hard rule. The operating system
doesn’t know or care about Unicode. Program arguments and filenames
are just zero-terminated bytestrings. Implicitly decoding these as
UTF-8 <a href="https://utcc.utoronto.ca/~cks/space/blog/python/Python3UnicodeIssue">would be a mistake anyway</a>. What happens when the
encoding isn’t valid?</p>

<p>This doesn’t <em>have</em> to be a problem on Windows. A Windows standard C
library could connect to Windows’ Unicode-flavored functions and
encode to/from UTF-8 as needed, allowing portable programs to maintain
the bytestring illusion. It’s only that none of the existing standard
C libraries do it this way.</p>

<h3 id="mingw-w64">Mingw-w64</h3>

<p>Of course my first natural choice is MinGW, specifically the
<a href="http://mingw-w64.org/doku.php">Mingw-w64</a> fork. It’s GCC ported to Windows. You can
continue relying on GCC-specific features when you need them. It’s got
all the core language features up through C11, plus the common
extensions. It’s probably packaged by your Linux distribution of
choice, making it trivial to cross-compile programs and libraries from
Linux — and with Wine you can even execute them on x86. Like regular
GCC, it outputs GDB-friendly DWARF debugging information, so you can
debug applications with GDB.</p>

<p>If I’m using Mingw-w64 on Windows, <del>I prefer to do so from inside
Cygwin</del>. Since it provides a complete POSIX environment, it maximizes
portability for the whole tool chain. This isn’t strictly required.</p>

<p>However, it has one big flaw. Unlike unix-like systems, Windows doesn’t
supply a system standard C library. That’s the compiler’s job. But
Mingw-w64 doesn’t have one. Instead it links against <code class="language-plaintext highlighter-rouge">msvcrt.dll</code>,
<del>which <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">isn’t officially supported by Microsoft</a>. It just
happens to exist on modern Windows installations. Since it’s not
supported,</del> it’s way out of date and doesn’t support much of C99. A lot
of these problems are patched over by the compiler, <del>but if you’re
relying on Mingw-w64, you still have to stick to some C89 library
features, such as limiting yourself to the C89 printf specifiers</del>.</p>

<p><del>Update: Mārtiņš Možeiko has pointed out <code class="language-plaintext highlighter-rouge">__USE_MINGW_ANSI_STDIO</code>, an
undocumented feature that fixes the printf family. I now use this by
default in all of my Mingw-w64 builds. It fixes most of the formatted
output issues, except that it’s incompatible with the <a href="https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-g_t_0040code_007bformat_007d-function-attribute-3318"><code class="language-plaintext highlighter-rouge">format</code> function
attribute</a>.</del> (Update 2021: Mingw-w64 now does the right thing
out of the box.)</p>

<p><del>Another problem is that <a href="http://thelinuxjedi.blogspot.com/2014/07/tripping-up-using-mingw.html">position-independent code generation is
broken</a>, and so ASLR is not an option. This means binaries produced
by Mingw-w64 are less secure than they should be. There are also a
number of <a href="https://gcc.gnu.org/ml/gcc-bugs/2015-05/msg02025.html">subtle code generation bugs</a> that might arise if you’re
doing something unusual.</del> (Update 2021: Mingw-w64 makes PIE mandatory.)</p>

<h3 id="visual-c">Visual C++</h3>

<p>The behemoth usually considered in this situation is Visual Studio and
the Visual C++ build tools. I strongly prefer open source development
tools, and Visual Studio obviously the <em>least</em> open source option, but
at least it’s cost-free these days. Now, I have absolutely no interest
in Visual Studio, but fortunately the Visual C++ compiler and
associated build tools can be used standalone, supporting both C and
C++.</p>

<p>Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that
batch file in a cmd.exe console and the Visual C++ command line build
tools will be made available in that console and in any programs
executed from it (your editor). It includes the compiler (cl.exe),
linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe),
and more. It also includes a <a href="/blog/2016/04/30/">mostly POSIX-complete</a> make called
nmake.exe. All these tools are noisy and print a copyright banner on
every invocation, so get used to passing <code class="language-plaintext highlighter-rouge">-nologo</code> every time, which
suppresses some of it.</p>

<p>When I said behemoth, I meant it. In my experience it literally takes
<em>hours</em> (unattended) to install Visual Studio 2015. <del>The good news is you
don’t actually need it all anymore. The build tools <a href="http://landinghub.visualstudio.com/visual-cpp-build-tools">are available
standalone</a>. While it’s still a larger and slower installation
process than it really should be, it’s is much more reasonable to
install. It’s good enough that I’d even say I’m comfortable relying on
it for Windows builds.</del> (Update: The build tools are unfortunately no
longer standalone.)</p>

<p>That being said, it’s not without its flaws. Microsoft has never
announced any plans to support C99. They only care about C++, with C as
a second class citizen. Since C++11 incorporated most of C99 and
Microsoft supports C++11, Visual Studio 2015 supports most of C99. The
only things missing as far as I can tell are variable length arrays
(VLAs), complex numbers, and C99’s array parameter declarators, since
none of these were adopted by C++. Some C99 features are considered
extensions (as they would be for C89), so you’ll also get warnings about
them, which can be disabled.</p>

<p>The command line interface (option flags, intermediates, etc.) isn’t
quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so
<strong>you’ll need separate Makefiles</strong>, or you’ll need to use a build
system that generates Visual C++ Makefiles.</p>

<p><del>Debugging is a major problem.</del> (Update 2022: It’s actually quite good
once <a href="/blog/2022/06/26/">you know how to do it</a>.) Visual C++ outputs separate .pdb
<a href="https://en.wikipedia.org/wiki/Program_database">program database</a> files, which aren’t usable from GDB. Visual
Studio has a built-in debugger, though it’s not included in the
standalone Visual C++ build tools. <del>I’m still searching for a decent
debugging solution for this scenario. I tried WinDbg, but I can’t stand
it.</del> (Update 2022: <a href="https://www.youtube.com/watch?v=r9eQth4Q5jg">RemedyBG is amazing</a>.)</p>

<p>In general the output code performance is on par with GCC and Clang,
so you’re not really gaining or losing performance with Visual C++.</p>

<h3 id="clang">Clang</h3>

<p>Unsurprisingly, <a href="http://clang.llvm.org/">Clang</a> has been ported to Windows. It’s like
Mingw-w64 in that you get the same features and interface across
platforms.</p>

<p>Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead <strong>it
relies directly on the official Windows SDK</strong>. You’ll basically need
to install the Visual C++ build tools as if were going to build with
Visual C++. This means no practical cross-platform builds and you’re
still relying on the proprietary Microsoft toolchain. In the past you
even had to use Microsoft’s linker, but LLVM now provides its own.</p>

<p>It generates GDB-friendly DWARF debug information (in addition to
CodeView) so in theory <strong>you can debug with GDB</strong> again. I haven’t
given this a thorough evaluation yet.</p>

<h3 id="pelles-c">Pelles C</h3>

<p>Finally there’s <a href="http://www.smorgasbordet.com/pellesc/">Pelles C</a>. It’s cost-free but not open
source. It’s a reasonable, small install that includes a full IDE with
an integrated debugger and command line tools. It has its own C
library and Win32 SDK with the most complete C11 support around. It
also supports OpenMP 3.1. All in all it’s pretty nice and is something
I wouldn’t be afraid to rely upon for Windows builds.</p>

<p>Like Visual C++, it has a couple of “povars” batch files to set up the
right environment, which includes a C compiler, linker, assembler,
etc. The compiler interface mostly mimics cl.exe, though there are far
fewer code generation options. The make program, pomake.exe, mimics
nmake.exe, but is even less POSIX-complete. The compiler’s <strong>output
code performance is also noticeably poorer than GCC, Clang, and Visual
C++</strong>. It’s definitely a less mature compiler.</p>

<p>It outputs CodeView debugging information, so <strong>GDB is of no use</strong>.
The best solution is to simply use the compiler built into the IDE,
which can be invoked directly from the command line. You don’t
normally need to code from within the IDE just to use the debugger.</p>

<p>Like Visual C++, it’s Windows only, so cross-compilation isn’t really
in the picture.</p>

<p>If performance isn’t of high importance, and you don’t require
specific code generation options, then Pelles C is a nice choice for
Windows builds.</p>

<h3 id="other-options">Other Options</h3>

<p>I’m sure there are a few other options out there, and I’d like to hear
about them so I can try them out. I focused on these since they’re all
cost free and easy to download. If I have to register or pay, then
it’s not going to beat these options.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Mapping Multiple Memory Views in User Space</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/04/10/"/>
    <id>urn:uuid:373e602e-0d43-3e03-f02c-2d169eb14df5</id>
    <updated>2016-04-10T21:59:16Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Modern operating systems run processes within <em>virtual memory</em> using a
piece of hardware called a <em>memory management unit</em> (MMU). The MMU
contains a <em>page table</em> that defines how virtual memory maps onto
<em>physical memory</em>. The operating system is responsible for maintaining
this page table, mapping and unmapping virtual memory to physical
memory as needed by the processes it’s running. If a process accesses
a page that is not currently mapped, it will trigger a <em>page fault</em>
and the execution of the offending thread will be paused until the
operating system maps that page.</p>

<p>This functionality allows for a neat hack: A physical memory address
can be mapped to multiple virtual memory addresses at the same time. A
process running with such a mapping will see these regions of memory
as aliased — views of the same physical memory. A store to one of
these addresses will simultaneously appear across all of them.</p>

<p>Some useful applications of this feature include:</p>

<ul>
  <li>An extremely fast, large memory “copy” by mapping the source memory
overtop the destination memory.</li>
  <li>Trivial interoperability between code instrumented with <a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy
bounds checking</a> [PDF] and non-instrumented code. A few bits
of each pointer are reserved to tag the pointer with the size of its
memory allocation. For compactness, the stored size is rounded up to
a power of two, making it “baggy.” Instrumented code checks this tag
before making a possibly-unsafe dereference. Normally, instrumented
code would need to clear (or set) these bits before dereferencing or
before passing it to non-instrumented code. Instead, the allocation
could be mapped simultaneously at each location for every possible
tag, making the pointer valid no matter its tag bits.</li>
  <li>Two responses to <a href="/blog/2016/03/31/">my last post on hotpatching</a> suggested
that, instead of modifying the instruction directly, memory
containing the modification could be mapped over top of the code. I
would copy the code to another place in memory, safely modify it in
private, switch the page protections from write to execute (both for
W^X and for <a href="https://web.archive.org/web/20190323050330/http://stackoverflow.com/a/18905927">other hardware limitations</a>), then map it over
the target. Restoring the original behavior would be as simple as
unmapping the change.</li>
</ul>

<p>Both POSIX and Win32 allow user space applications to create these
aliased mappings. The original purpose for these APIs is for shared
memory between processes, where the same physical memory is mapped
into two different processes’ virtual memory. But the OS doesn’t stop
us from mapping the shared memory to a different address within the
same process.</p>

<h3 id="posix-memory-mapping">POSIX Memory Mapping</h3>

<p>On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are <code class="language-plaintext highlighter-rouge">shm_open(3)</code>, <code class="language-plaintext highlighter-rouge">ftruncate(2)</code>, and <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<p>First, create a file descriptor to shared memory using <code class="language-plaintext highlighter-rouge">shm_open</code>. It
has very similar semantics to <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">shm_open</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">oflag</span><span class="p">,</span> <span class="n">mode_t</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">name</code> works much like a filesystem path, but is actually a
different namespace (though on Linux it <em>is</em> a tmpfs mounted at
<code class="language-plaintext highlighter-rouge">/dev/shm</code>). Resources created here (<code class="language-plaintext highlighter-rouge">O_CREAT</code>) will persist until
explicitly deleted (<code class="language-plaintext highlighter-rouge">shm_unlink(3)</code>) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via <code class="language-plaintext highlighter-rouge">fork(2)</code> or through UNIX domain sockets, so a name
isn’t strictly required.</p>

<p>OpenBSD introduced <a href="http://man.openbsd.org/OpenBSD-current/man3/shm_mkstemp.3"><code class="language-plaintext highlighter-rouge">shm_mkstemp(3)</code></a> to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
<code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag may or may not provide a fix (<a href="http://comments.gmane.org/gmane.linux.man/9815">it’s
undocumented</a>).</p>

<p>The portable workaround is to attempt to choose a unique name, open
the file with <code class="language-plaintext highlighter-rouge">O_CREAT | O_EXCL</code> (either atomically create the file or
fail), <code class="language-plaintext highlighter-rouge">shm_unlink</code> the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="s">"/example"</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">handle_error</span><span class="p">();</span> <span class="c1">// non-local exit</span>
<span class="n">shm_unlink</span><span class="p">(</span><span class="s">"/example"</span><span class="p">);</span>
</code></pre></div></div>

<p>The shared memory object is brand new (<code class="language-plaintext highlighter-rouge">O_EXCL</code>) and is therefore of
zero size. <code class="language-plaintext highlighter-rouge">ftruncate</code> sets it to the desired size. This does <em>not</em>
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally <code class="language-plaintext highlighter-rouge">mmap</code> the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (<code class="language-plaintext highlighter-rouge">munmap(2)</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>At this point both <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mh">0xdeafbeef</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%p %p 0x%x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>Will print out something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x6ffffff0000 0x6fffffe0000 0xdeafbeef
</code></pre></div></div>

<p>It’s also possible to do all this only with <code class="language-plaintext highlighter-rouge">open(2)</code> and <code class="language-plaintext highlighter-rouge">mmap(2)</code> by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.</p>

<h3 id="windows-memory-mapping">Windows Memory Mapping</h3>

<p>Windows is very similar, but directly supports anonymous shared
memory. The key functions are <code class="language-plaintext highlighter-rouge">CreateFileMapping</code>, and
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code>.</p>

<p>First create a file mapping object from an invalid handle value. Like
POSIX, the word “file” is used without actually involving files.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">,</span>
                             <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                             <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s no truncate step because the space is allocated at creation
time via the two-part size argument.</p>

<p>Then, just like <code class="language-plaintext highlighter-rouge">mmap</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>If I wanted to choose the target address myself, I’d call
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code> instead, which takes the address as additional
argument.</p>

<p>From here on it’s the same as above.</p>

<h3 id="generalizing-the-api">Generalizing the API</h3>

<p>Having some fun with this, I came up with a general API to allocate an
aliased mapping at an arbitrary number of addresses.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
</code></pre></div></div>

<p>Values in the address array must either be page-aligned or NULL to
allow the operating system to choose, in which case the map address is
written to the array.</p>

<p>It returns 0 on success. It may fail if the size is too small (0), too
large, too many file descriptors, etc.</p>

<p>Pass the same pointers back to <code class="language-plaintext highlighter-rouge">memory_alias_unmap</code> to free the
mappings. When called correctly it cannot fail, so there’s no return
value.</p>

<p>The full source is here: <a href="/download/memalias.c">memalias.c</a></p>

<h4 id="posix">POSIX</h4>

<p>Starting with the simpler of the two functions, the POSIX
implementation looks like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">munmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The complex part is creating the mapping:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="s">"/%s(%lu,%p)"</span><span class="p">,</span>
             <span class="n">__FUNCTION__</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">getpid</span><span class="p">(),</span> <span class="n">addrs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">shm_unlink</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span>
                        <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span>
                        <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared object name includes the process ID and pointer array
address, so there really shouldn’t be any non-malicious name
collisions, even if called from multiple threads in the same process.</p>

<p>Otherwise it just walks the array setting up the mappings.</p>

<h4 id="windows">Windows</h4>

<p>The Windows version is very similar.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">size</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">UnmapViewOfFile</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since Windows tracks the size internally, it’s unneeded and ignored.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">m</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">,</span>
                                 <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                                 <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">m</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">MapViewOfFileEx</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the future I’d like to find some unique applications of these
multiple memory views.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Calling the Native API While Freestanding</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/02/28/"/>
    <id>urn:uuid:3649a761-d3dc-391b-7f24-a28398100102</id>
    <updated>2016-02-28T23:47:22Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>When developing <a href="/blog/2016/01/31/">minimal, freestanding Windows programs</a>, it’s
obviously beneficial to take full advantage of dynamic libraries that
are already linked rather than duplicate that functionality in the
application itself. Every Windows process automatically, and
involuntarily, has kernel32.dll and ntdll.dll loaded into its process
space before it starts. As discussed previously, kernel32.dll provides
the Windows API (Win32). The other, ntdll.dll, provides the <em>Native
API</em> for user space applications, and is the focus of this article.</p>

<p>The Native API is a low-level API, a foundation for the implementation
of the Windows API and various components that don’t use the Windows
API (drivers, etc.). It includes a runtime library (RTL) suitable for
replacing important parts of the C standard library, unavailable to
freestanding programs. Very useful for a minimal program.</p>

<p>Unfortunately, <em>using</em> the Native API is a bit of a minefield. Not all
of the documented Native API functions are actually exported by
ntdll.dll, making them inaccessible both for linking and
GetProcAddress(). Some are exported, but not documented as such.
Others are documented as exported but are not documented <em>when</em> (which
release of Windows). If a particular function wasn’t exported until
Windows 8, I don’t want to use when supporting Windows 7.</p>

<p>This is further complicated by the Microsoft Windows SDK, where many
of these functions are just macros that alias C runtime functions.
Naturally, MinGW closely follows suit. For example, in both cases,
here is how the Native API function <code class="language-plaintext highlighter-rouge">RtlCopyMemory</code> is “declared.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))
</span></code></pre></div></div>

<p>This is certainly not useful for freestanding programs, though it has
a significant benefit for <em>hosted</em> programs: The C compiler knows the
semantics of <code class="language-plaintext highlighter-rouge">memcpy()</code> and can properly optimize around it. Any C
compiler worth its salt will replace a small or aligned, fixed-sized
<code class="language-plaintext highlighter-rouge">memcpy()</code> or <code class="language-plaintext highlighter-rouge">memmove()</code> with the equivalent inlined code. For
example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buffer0</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="kt">char</span> <span class="n">buffer1</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="c1">// ...</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">buffer0</span><span class="p">,</span> <span class="n">buffer1</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="c1">// ...</span>
</code></pre></div></div>

<p>On x86_64 (GCC 4.9.3, -Os), this <code class="language-plaintext highlighter-rouge">memmove()</code> call is replaced with
two instructions. This isn’t possible when calling an opaque function
in a non-standard dynamic library. The side effects could be anything.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movaps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">48</span><span class="p">]</span>
    <span class="nf">movaps</span>  <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">32</span><span class="p">],</span> <span class="nv">xmm0</span>
</code></pre></div></div>

<p>These Native API macro aliases are what have allowed certain Wine
issues <a href="https://bugs.winehq.org/show_bug.cgi?id=38783">to slip by unnoticed for years</a>. Very few user space
applications actually call Native API functions, even when addressed
directly by name in the source. The development suite is pulling a
bait and switch.</p>

<p>Like <a href="/blog/2014/12/09/">last time I danced at the edge of the compiler</a>, this has
caused headaches in my recent experimentation with freestanding
executables. The MinGW headers assume that the programs including them
will link against a C runtime. Dirty hack warning: To work around it,
I have to undo the definition in the MinGW headers and make my own.
For example, to use the real <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="cp">#undef RtlMoveMemory
</span><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">void</span> <span class="nf">RtlMoveMemory</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>Anywhere where I might have previously used <code class="language-plaintext highlighter-rouge">memmove()</code> I can instead
use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>. Or I could trivially supply my own wrapper:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">memmove</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlMoveMemory</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">d</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the same approach is not reliable with
<code class="language-plaintext highlighter-rouge">RtlCopyMemory()</code>, the cousin to <code class="language-plaintext highlighter-rouge">memcpy()</code>. As far as I can tell, it
was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June
2015). Use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code> instead. The overlap-handling overhead is
negligible compared to the function call overhead anyway.</p>

<p>As a side note: one reason besides minimalism for not implementing
your own <code class="language-plaintext highlighter-rouge">memmove()</code> is that it can’t be implemented efficiently in a
conforming C program. According to the language specification, your
implementation of <code class="language-plaintext highlighter-rouge">memmove()</code> would not be permitted to compare its
pointer arguments with <code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;=</code>, or <code class="language-plaintext highlighter-rouge">&gt;=</code>. That would lead to
undefined behavior when pointing to unrelated objects (ISO/IEC
9899:2011 §6.5.8¶5). The simplest legal approach is to allocate a
temporary buffer, copy the source buffer into it, then copy it into
the destination buffer. However, buffer allocation may fail — i.e.
NULL return from <code class="language-plaintext highlighter-rouge">malloc()</code> — introducing a failure case to
<code class="language-plaintext highlighter-rouge">memmove()</code>, which isn’t supposed to fail.</p>

<p>Update July 2016: Alex Elsayed pointed out a solution to the
<code class="language-plaintext highlighter-rouge">memmove()</code> problem in the comments. In short: iterate over the
buffers bytewise (<code class="language-plaintext highlighter-rouge">char *</code>) using equality (<code class="language-plaintext highlighter-rouge">==</code>) tests to check for
an overlap. In theory, a compiler could optimize away the loop and
make it efficient.</p>

<p>I keep mentioning Wine because I’ve been careful to ensure my
applications run correctly with it. So far it’s worked <em>perfectly</em>
with both Windows API and Native API functions. Thanks to the hard
work behind the Wine project, despite being written sharply against
the Windows API, these tiny programs remain relatively portable (x86
and ARM). It’s a good fit for graphical applications (games), but I
would <em>never</em> write a command line application like this. The command
line has always been a second class citizen on Windows.</p>

<p>Now that I’ve got these Native API issues sorted out, I’ve
significantly expanded the capabilities of my tiny, freestanding
programs without adding anything to their size. Functions like
<code class="language-plaintext highlighter-rouge">RtlUnicodeToUTF8N()</code> and <code class="language-plaintext highlighter-rouge">RtlUTF8ToUnicodeN()</code> will surely be handy.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Small, Freestanding Windows Executables</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/01/31/"/>
    <id>urn:uuid:8eddc701-52d3-3b0c-a8a8-dd13da6ead2c</id>
    <updated>2016-01-31T22:53:03Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>Update</strong>: This is old and <a href="/blog/2023/02/15/">was <strong>updated in 2023</strong></a>!</p>

<p>Recently I’ve been experimenting with freestanding C programs on
Windows. <em>Freestanding</em> refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and <a href="/blog/2014/12/09/">similar, bare metal
situations</a>. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size <code class="language-plaintext highlighter-rouge">memmove()</code> with move instructions. Since a freestanding
program would supply its own, it may have different semantics.</p>

<p>My usual go to for C/C++ on Windows is <a href="http://mingw-w64.org/">Mingw-w64</a>, which has
greatly suited my needs the past couple of years. It’s <a href="https://packages.debian.org/search?keywords=mingw-w64">packaged on
Debian</a>, and, when combined with Wine, allows me to fully develop
Windows applications on Linux. Being GCC, it’s also great for
cross-platform development since it’s essentially the same compiler as
the other platforms. The primary difference is the interface to the
operating system (POSIX vs. Win32).</p>

<p>However, it has one glaring flaw inherited from MinGW: it links
against msvcrt.dll, an ancient version of the Microsoft C runtime
library that currently ships with Windows. Besides being dated and
quirky, <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">it’s not an official part of Windows</a> and never has
been, despite its inclusion with every release since Windows 95.
Mingw-w64 doesn’t have a C library of its own, instead patching over
some of the flaws of msvcrt.dll and linking against it.</p>

<p>Since so much depends on msvcrt.dll despite its unofficial nature,
it’s unlikely Microsoft will ever drop it from future releases of
Windows. However, if strict correctness is a concern, we must ask
Mingw-w64 not to link against it. An alternative would be
<a href="http://plibc.sourceforge.net/">PlibC</a>, though the LGPL licensing is unfortunate. Another is
Cygwin, which is a very complete POSIX environment, but is heavy and
GPL-encumbered.</p>

<p>Sometimes I’d prefer to be more direct: <a href="https://hero.handmade.network/forums/code-discussion/t/94-guide_-_how_to_avoid_c_c++_runtime_on_windows">skip the C standard library
altogether</a> and talk directly to the operating system. On Windows
that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only
links against system DLLs.</p>

<h3 id="linux-vs-windows">Linux vs. Windows</h3>

<p>The most important benefit of a standard library like libc is a
portable, uniform interface to the host system. So long as the
standard library suits its needs, the same program can run anywhere.
Without it, the programs needs an implementation of each
host-specific interface.</p>

<p>On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (<code class="language-plaintext highlighter-rouge">int 0x80</code> on x86, <code class="language-plaintext highlighter-rouge">syscall</code> on
x86-64, <code class="language-plaintext highlighter-rouge">swi</code> on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.</p>

<p>For example, here’s a function for a 1-argument system call on x86-64.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">syscall1</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">result</span><span class="p">;</span>
    <span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">exit()</code> is implemented on top. Note: A <em>real</em> libc would do
cleanup before exiting, like calling registered <code class="language-plaintext highlighter-rouge">atexit()</code> functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;syscall.h&gt;</span><span class="c1">  // defines SYS_exit</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">code</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">code</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to <code class="language-plaintext highlighter-rouge">malloc()</code>). It’s not POSIX, but it has analogs to much of
the same functionality.</p>

<h3 id="program-entry">Program Entry</h3>

<p>The standard entry for a C program is <code class="language-plaintext highlighter-rouge">main()</code>. However, this is not
the application’s <em>true</em> entry. The entry is in the C library, which
does some initialization before calling your <code class="language-plaintext highlighter-rouge">main()</code>. When <code class="language-plaintext highlighter-rouge">main()</code>
returns, it performs cleanup and exits. Without a C library, programs
don’t start at <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<p>On Linux the default entry is the symbol <code class="language-plaintext highlighter-rouge">_start</code>. It’s prototype
would look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Returning from this function leads to a segmentation fault, so it’s up
to your application to perform the exit system call rather than
return.</p>

<p>On Windows, the entry depends on the type of application. The two
relevant subsystems today are the <em>console</em> and <em>windows</em> subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give <code class="language-plaintext highlighter-rouge">-mconsole</code> (default) or <code class="language-plaintext highlighter-rouge">-mwindows</code> to the linker to
choose the subsystem.</p>

<p>The default <a href="https://msdn.microsoft.com/en-us/library/f9t8842e.aspx">entry for each is slightly different</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike Linux’s <code class="language-plaintext highlighter-rouge">_start</code>, Windows programs can safely return from these
functions, similar to <code class="language-plaintext highlighter-rouge">main()</code>, hence the <code class="language-plaintext highlighter-rouge">int</code> return. The <code class="language-plaintext highlighter-rouge">WINAPI</code>
macro means the function may have a special calling convention,
depending on the platform.</p>

<p>On any system, you can choose a different entry symbol or address
using the <code class="language-plaintext highlighter-rouge">--entry</code> option to the GNU linker.</p>

<h3 id="disabling-libgcc">Disabling libgcc</h3>

<p>One problem I’ve run into is Mingw-w64 generating code that calls
<code class="language-plaintext highlighter-rouge">__chkstk_ms()</code> from libgcc. I believe this is a long-standing bug,
since <code class="language-plaintext highlighter-rouge">-ffreestanding</code> should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable <a href="https://metricpanda.com/rival-fortress-update-45-dealing-with-__chkstk-__chkstk_ms-when-cross-compiling-for-windows/">the stack
probe</a> and pre-commit the whole stack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
</code></pre></div></div>

<p>Alternatively you could link against libgcc (statically) with <code class="language-plaintext highlighter-rouge">-lgcc</code>,
but, again, I’m going for a tiny executable.</p>

<h3 id="a-freestanding-example">A freestanding example</h3>

<p>Here’s an example of a Windows “Hello, World” that doesn’t use a C
library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">WINAPI</span>
<span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">[]){</span><span class="mi">0</span><span class="p">},</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32
</code></pre></div></div>

<p>Notice I manually linked against kernel32.dll. The stripped final
result is only 4kB, mostly PE padding. There are <a href="http://www.phreedom.org/research/tinype/">techniques to trim
this down even further</a>, but for a substantial program it
wouldn’t make a significant difference.</p>

<p>From here you could create a GUI by linking against <code class="language-plaintext highlighter-rouge">user32.dll</code> and
<code class="language-plaintext highlighter-rouge">gdi32.dll</code> (both also part of Win32) and calling the appropriate
functions. I already <a href="/blog/2015/06/06/">ported my OpenGL demo</a> to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).</p>

<p>I may go this route for <a href="http://7drl.org/2016/01/13/7drl-2016-announced-for-5-13-march/">the upcoming 7DRL 2016</a> in March.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Goblin-COM 7DRL 2015</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/15/"/>
    <id>urn:uuid:362ccedf-9538-358f-9474-5befd8bce4de</id>
    <updated>2015-03-15T21:56:12Z</updated>
    <category term="game"/><category term="media"/><category term="win32"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Yesterday I completed my third entry to the annual Seven Day Roguelike
(7DRL) challenge (previously: <a href="/blog/2013/03/17/">2013</a> and <a href="/blog/2014/03/31/">2014</a>). This
year’s entry is called <strong>Goblin-COM</strong>.</p>

<p><a href="/img/screenshot/gcom.png"><img src="/img/screenshot/gcom-thumb.png" alt="" /></a></p>

<ul>
  <li>Download/Source: <a href="https://github.com/skeeto/goblin-com">Goblin-COM</a></li>
  <li>Telnet play (no saves): <code class="language-plaintext highlighter-rouge">telnet gcom.nullprogram.com</code></li>
  <li><a href="https://www.youtube.com/watch?v=QW3Uul7-Iss">Video review</a> by Akhier Dragonheart</li>
</ul>

<p>As with previous years, the ideas behind the game are not all that
original. The goal was to be a fantasy version of <a href="http://en.wikipedia.org/wiki/UFO:_Enemy_Unknown">classic
X-COM</a> with an ANSI terminal interface. You are the ruler of a
fledgling human nation that is under attack by invading goblins. You
hire heroes, operate squads, construct buildings, and manage resource
income.</p>

<p>The inspiration this year came from watching <a href="https://www.youtube.com/playlist?list=PL2xITSnTC0YkB2-B8fs-02YVT81AE0WtP">BattleBunny</a> play
<a href="http://openxcom.org/">OpenXCOM</a>, an open source clone of the original X-COM. It
had its major 1.0 release last year. Like the early days of
<a href="https://www.openttd.org/en/">OpenTTD</a>, it currently depends on the original game assets.
But also like OpenTTD, it surpasses the original game in every way, so
there’s no reason to bother running the original anymore. I’ve also
recently been watching <a href="https://youtu.be/bwPLKud0rP4">One F Jef play Silent Storm</a>, which is
another turn-based squad game with a similar combat simulation.</p>

<p>As in X-COM, the game is broken into two modes of play: the geoscape
(strategic) and the battlescape (tactical). Unfortunately I ran out of
time and didn’t get to the battlescape part, though I’d like to add it
in the future. What’s left is a sort-of city-builder with some squad
management. You can hire heroes and send them out in squads to
eliminate goblins, but rather than dropping to the battlescape,
battles always auto-resolve in your favor. Despite this, the game
still has a story, a win state, and a lose state. I won’t say what
they are, so you have to play it for yourself!</p>

<h3 id="terminal-emulator-layer">Terminal Emulator Layer</h3>

<p>My previous entries were HTML5 games, but this entry is a plain old
standalone application. C has been my preferred language for the past
few months, so that’s what I used. Both UTF-8-capable ANSI terminals
and the Windows console are supported, so it should be perfectly
playable on any modern machine. Note, though, that some of the
poorer-quality terminal emulators that you’ll find in your Linux
distribution’s repositories (rxvt and its derivatives) are not
Unicode-capable, which means they won’t work with G-COM.</p>

<p>I <strong>didn’t make use of ncurses</strong>, instead opting to write my own
terminal graphics engine. That’s because I wanted a <a href="/blog/2014/12/09/">single, small
binary</a> that was easy to build, and I didn’t want to mess around
with <a href="http://pdcurses.sourceforge.net/">PDCurses</a>. I’ve also been studying the Win32 API lately, so
writing my own terminal platform layer would rather easy to do anyway.</p>

<p>I experimented with a number of terminal emulators — LXTerminal,
Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator — but
the least capable “terminal” <em>by far</em> is the Windows console, so it
was the one to dictate the capabilities of the graphics engine. Some
ANSI terminals are capable of 256 colors, bold, underline, and
strikethrough fonts, but a highly portable API is basically <strong>limited
to 16 colors</strong> (RGBCMYKW with two levels of intensity) for each of the
foreground and background, and no other special text properties.</p>

<p>ANSI terminals also have a concept of a default foreground color and a
default background color. Most applications that output color (git,
grep, ls) leave the background color alone and are careful to choose
neutral foreground colors. G-COM always sets the background color, so
that the game looks the same no matter what the default colors are.
Also, the Windows console doesn’t really have default colors anyway,
even if I wanted to use them.</p>

<p>I put in partial support for Unicode because I wanted to use
interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported
Unicode for a long time now, but since they added it <em>too</em> early,
they’re locked into the <a href="http://utf8everywhere.org/">outdated UTF-16</a>. For me this wasn’t
too bad, because few computers, Linux included, are equipped to render
characters outside of the <a href="http://en.wikipedia.org/wiki/Plane_(Unicode)">Basic Multilingual Plane</a> anyway, so
there’s no need to deal with surrogate pairs. This is especially true
for the Windows console, which can only render a <em>very</em> small set of
characters: another limit on my graphics engine. Internally individual
codepoints are handled as <code class="language-plaintext highlighter-rouge">uint16_t</code> and strings are handled as UTF-8.</p>

<p>I said <em>partial</em> support because, in addition to the above, it has no
support for combining characters, or any other situation where a
codepoint takes up something other than one space in the terminal.
This requires lookup tables and dealing with <a href="/blog/2014/06/13/">pitfalls</a>, but
since I get to control exactly which characters were going to be used
I didn’t need any of that.</p>

<p>In spite of the limitations, I’m <em>really</em> happy with the graphical
results. The waves are animated continuously, even while the game is
paused, and it looks great. Here’s GNOME Terminal’s rendering, which I
think looked the best by default.</p>

<video width="480" height="400" controls="" loop="" autoplay="">
  <source src="/vid/gcom.webm" type="video/webm" />
  <source src="/vid/gcom.mp4" type="video/mp4" />
</video>

<p>I’ll talk about how G-COM actually communicates with the terminal in
another article. The interface between the game and the graphics
engine is really clean (<code class="language-plaintext highlighter-rouge">device.h</code>), so it would be an interesting
project to write a back end that renders the game to a regular window,
no terminal needed.</p>

<h4 id="color-directive">Color Directive</h4>

<p>I came up with a format directive to help me colorize everything. It
runs in addition to the standard <code class="language-plaintext highlighter-rouge">printf</code> directives. Here’s an example,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="o">&amp;</span><span class="n">panel</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Really save and quit? (Rk{y}/Rk{n})"</span><span class="p">);</span>
</code></pre></div></div>

<p>The color is specified by two characters, and the text it applies to
is wrapped in curly brackets. There are eight colors to pick from:
RGBCMYKW. That covers all the binary values for red, green, and blue.
To specify an “intense” (bright) color, capitalize it. That means the
<code class="language-plaintext highlighter-rouge">Rk{...}</code> above makes the wrapped text bright red.</p>

<p><img src="/img/screenshot/gcom-yn.png" alt="" /></p>

<p>Nested directives are also supported. (And, yes, that <code class="language-plaintext highlighter-rouge">K</code> means “high
intense black,” a.k.a. dark gray. A <code class="language-plaintext highlighter-rouge">w</code> means “low intensity white,”
a.k.a. light gray.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">,</span> <span class="s">"Kk{♦}    wk{Rk{B}uild}     Kk{♦}"</span><span class="p">);</span>
</code></pre></div></div>

<p>And it mixes with the normal <code class="language-plaintext highlighter-rouge">printf</code> directives:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">,</span> <span class="s">"(Rk{m}) Yk{Mine} [%s]"</span><span class="p">,</span> <span class="n">cost</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="single-binary">Single Binary</h3>

<p>The GNU linker has a really nice feature for linking arbitrary binary
data into your application. I used this to embed my assets into a
single binary so that the user doesn’t need to worry about any sort of
data directory or anything like that. Here’s what the <code class="language-plaintext highlighter-rouge">make</code> rule
would look like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$(LD) -r -b binary -o $@ $^
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-r</code> specifies that output should be relocatable — i.e. it can be
fed back into the linker later when linking the final binary. The <code class="language-plaintext highlighter-rouge">-b
binary</code> says that the input is just an opaque binary file (“plain”
text included). The linker will create three symbols for each input
file:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_start</code></li>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_end</code></li>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_size</code></li>
</ul>

<p>When then you can access from your C program like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">_binary_filename_txt_start</span><span class="p">[];</span>
</code></pre></div></div>

<p>I used this to embed the story texts, and I’ve used it in the past to
embed images and textures. If you were to link zlib, you could easily
compress these assets, too. I’m surprised this sort of thing isn’t
done more often!</p>

<h3 id="dumb-game-saves">Dumb Game Saves</h3>

<p>To save time, and because it doesn’t really matter, saves are just
memory dumps. I took another page from <a href="http://handmadehero.org/">Handmade Hero</a> and
allocate everything in a single, contiguous block of memory. With one
exception, there are no pointers, so the entire block is relocatable.
When references are needed, it’s done via integers into the embedded
arrays. This allows it to be cleanly reloaded in another process
later. As a side effect, it also means there are no dynamic
allocations (<code class="language-plaintext highlighter-rouge">malloc()</code>) while the game is running. Here’s roughly
what it looks like.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">game</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">map_seed</span><span class="p">;</span>
    <span class="n">map_t</span> <span class="o">*</span><span class="n">map</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">time</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">wood</span><span class="p">,</span> <span class="n">gold</span><span class="p">,</span> <span class="n">food</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">population</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">goblin_spawn_rate</span><span class="p">;</span>
    <span class="n">invader_t</span> <span class="n">invaders</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">squad_t</span> <span class="n">squads</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">hero_t</span> <span class="n">heroes</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">game_event_t</span> <span class="n">events</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="p">}</span> <span class="n">game_t</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">map</code> pointer is that one exception, but that’s because it’s
generated fresh after loading from the <code class="language-plaintext highlighter-rouge">map_seed</code>. Saving and loading
is trivial (error checking omitted) and <em>very</em> fast.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">game_save</span><span class="p">(</span><span class="n">game_t</span> <span class="o">*</span><span class="n">game</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">game</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">game_t</span> <span class="o">*</span>
<span class="nf">game_load</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">game_t</span> <span class="o">*</span><span class="n">game</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">));</span>
    <span class="n">fread</span><span class="p">(</span><span class="n">game</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
    <span class="n">game</span><span class="o">-&gt;</span><span class="n">map</span> <span class="o">=</span> <span class="n">map_generate</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">map_seed</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">game</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The data isn’t important enough to bother with <a href="http://lwn.net/Articles/322823/">rename+fsync</a>
durability. I’ll risk the data if it makes savescumming that much
harder!</p>

<p>The downside to this technique is that saves are generally not
portable across architectures (particularly where endianness differs),
and may not even portable between different platforms on the same
architecture. I only needed to persist a single game state on the same
machine, so this wouldn’t be a problem.</p>

<h3 id="final-results">Final Results</h3>

<p>I’m definitely going to be reusing some of this code in future
projects. The G-COM terminal graphics layer is nifty, and I already
like it better than ncurses, whose API I’ve always thought was kind of
ugly and old-fashioned. I like writing terminal applications.</p>

<p>Just like the last couple of years, the final game is a lot simpler
than I had planned at the beginning of the week. Most things take
longer to code than I initially expect. I’m still enjoying playing it,
which is a really good sign. When I play, I’m having enough fun to
deliberately delay the end of the game so that I can sprawl my nation
out over the island and generate crazy income.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
