Command line interfaces have varied throughout their brief history but have largely converged to some common, sound conventions. The core originates from unix, and the Linux ecosystem extended it, particularly via the GNU project. Unfortunately some tools initially appear to follow the conventions, but subtly get them wrong, usually for no practical benefit. I believe in many cases the authors simply didn’t know any better, so I’d like to review the conventions.
The simplest case is the short option flag. An option is a hyphen — specifically HYPHEN-MINUS U+002D — followed by one alphanumeric character. Capital letters are acceptable. The letters themselves have conventional meanings and are worth following if possible.
program -a -b -c
Flags can be grouped together into one program argument. This is both convenient and unambiguous. It’s also one of those often missed details when programs use hand-coded argument parsers, and the lack of support irritates me.
program -abc
program -acb
The next simplest case are short options that take arguments. The argument follows the option.
program -i input.txt -o output.txt
The space is optional, so the option and argument can be packed together into one program argument. Since the argument is required, this is still unambiguous. This is another often-missed feature in hand-coded parsers.
program -iinput.txt -ooutput.txt
This does not prohibit grouping. When grouped, the option accepting an argument must be last.
program -abco output.txt
program -abcooutput.txt
This technique is used to create another category, optional option arguments. The option’s argument can be optional but still unambiguous so long as the space is always omitted when the argument is present.
program -c # omitted
program -cblue # provided
program -c blue # omitted (blue is a new argument)
program -c -x # two separate flags
program -c-x # -c with argument "-x"
Optional option arguments should be used judiciously since they can be surprising, but they have their uses.
Options can typically appear in any order — something parsers often achieve via permutation — but non-options typically follow options.
program -a -b foo bar
program -b -a foo bar
GNU-style programs usually allow options and non-options to be mixed, though I don’t consider this to be essential.
program -a foo -b bar
program foo -a -b bar
program foo bar -a -b
If a non-option looks like an option because it starts with a hyphen,
use --
to demarcate options from non-options.
program -a -b -- -x foo bar
An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so --
is less often
needed.
# note: without argument permutation
program -a -b foo -x bar # 2 options, 3 non-options
Since short options can be cryptic, and there are such a limited number of them, more complex programs support long options. A long option starts with two hyphens followed by one or more alphanumeric, lowercase words. Hyphens separate words. Using two hyphens prevents long options from being confused for grouped short options.
program --reverse --ignore-backups
Occasionally flags are paired with a mutually exclusive inverse flag
that begins with --no-
. This avoids a future flag day where the
default is changed in the release that also adds the flag implementing
the original behavior.
program --sort
program --no-sort
Long options can similarly accept arguments.
program --output output.txt --block-size 1024
These may optionally be connected to the argument with an equals sign
=
, much like omitting the space for a short option argument.
program --output=output.txt --block-size=1024
Like before, this opens up the doors for optional option arguments. Due
to the required =
this is still unambiguous.
program --color --reverse
program --color=never --reverse
The --
retains its original behavior of disambiguating option-like
non-option arguments:
program --reverse -- --foo bar
Some programs, such as Git, have subcommands each with their own options. The main program itself may still have its own options distinct from subcommand options. The program’s options come before the subcommand and subcommand options follow the subcommand. Options are never permuted around the subcommand.
program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz
Above, the -a
, -b
, and -c
options are for program
, and the
others are for subcommand
. So, really, the subcommand is another
command line of its own.
There’s little excuse for not getting these conventions right assuming you’re interested in following the conventions. Short options can be parsed correctly in just ~60 lines of C code. Long options are just slightly more complex.
GNU’s getopt_long()
supports long option abbreviation — with no way to
disable it (!) — but this should be avoided.
Go’s flag package intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with =
. It’s sound, but I miss both features
every time I write programs in Go. That’s why I wrote my own argument
parser. Not only does it have a nicer feature set, I like the API a
lot more, too.
Python’s primary option parsing library is argparse
, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention and its behavior is unsound. For instance, the following
program has two options, --foo
and --bar
. The --foo
option accepts
an optional argument, and the --bar
option is a simple flag.
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))
Here are some example runs:
$ python parse.py
Namespace(bar=False, foo='X')
$ python parse.py --foo
Namespace(bar=False, foo=None)
$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')
$ python parse.py --bar --foo
Namespace(bar=True, foo=None)
$ python parse.py --foo arg
Namespace(bar=False, foo='arg')
Everything looks good except the last. If the --foo
argument is
optional then why did it consume arg
? What happens if I follow it with
--bar
? Will it consume it as the argument?
$ python parse.py --foo --bar
Namespace(bar=True, foo=None)
Nope! Unlike arg
, it left --bar
alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument looks
like an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since argparse
does not properly support --
,
that does not help.
$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg
Please, stick to the conventions unless you have really good reasons to break them!
]]>Suppose you’re writing a command line program that prompts the user for a password or passphrase, and Windows is one of the supported platforms (even very old versions). This program uses UTF-8 for its string representation, as it should, and so ideally it receives the password from the user encoded as UTF-8. On most platforms this is, for the most part, automatic. However, on Windows finding the correct answer to this problem is a maze where all the signs lead towards dead ends. I recently navigated this maze and found the way out.
I knew it was possible because my passphrase2pgp tool has been using the golang.org/x/crypto/ssh/terminal package, which gets it very nearly perfect. Though they were still fixing subtle bugs as recently as 6 months ago.
The first step is to ignore just everything you find online, because it’s either wrong or it’s solving a slightly different problem. I’ll discuss the dead ends later and focus on the solution first. Ultimately I want to implement this on Windows:
// Display prompt then read zero-terminated, UTF-8 password.
// Return password length with terminator, or zero on error.
int read_password(char *buf, int len, const char *prompt);
I chose int
for the length rather than size_t
because it’s a
password and should not even approach INT_MAX
.
For the impatient: complete, working, ready-to-use example
On a unix-like system, the program would:
open(2)
the special /dev/tty
file for reading and writingwrite(2)
the prompttcgetattr(3)
and tcsetattr(3)
to disable ECHO
read(2)
a line of inputtcsetattr(3)
close(2)
the fileA great advantage of this approach is that it doesn’t depend on standard input and standard output. Either or both can be redirected elsewhere, and this function still interacts with the user’s terminal. The Windows version will have the same advantage.
Despite some tempting shortcuts that don’t work, the steps on Windows are basically the same but with different names. There are a couple subtleties and extra steps. I’ll be ignoring errors in my code snippets below, but the complete example has full error handling.
Instead of /dev/tty
, the program opens two files: CONIN$
and
CONOUT$
using CreateFileA()
. Note: The “A” stands for ANSI,
as opposed to “W” for wide (Unicode). This refers to the encoding of the
file name, not to how the file contents are encoded. CONIN$
is opened
for both reading and writing because write permissions are needed to
change the console’s mode.
HANDLE hi = CreateFileA(
"CONIN$",
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0
);
HANDLE ho = CreateFileA(
"CONOUT$",
GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0
);
To write the prompt, call WriteConsoleA()
on the output handle.
On its own, this assumes the prompt is plain ASCII (i.e. "password:
"
), not UTF-8 (i.e. "contraseña: "
):
WriteConsoleA(ho, prompt, strlen(prompt), 0, 0);
If the prompt may contain UTF-8 data, perhaps because it displays a username or isn’t in English, you have two options:
WriteConsoleW()
instead.SetConsoleOutputCP()
with CP_UTF8
(65001). This is a global
(to the console) setting and should be restored when done.Next use GetConsoleMode()
and SetConsoleMode()
to
disable echo. The console usually has ENABLE_PROCESSED_INPUT
already
set, which tells the console to handle CTRL-C and such, but I set it
explicitly just in case. I also set ENABLE_LINE_INPUT
so that the user
can use backspace and so that the entire line is delivered at once.
DWORD orig = 0;
GetConsoleMode(hi, &orig);
DWORD mode = orig;
mode |= ENABLE_PROCESSED_INPUT;
mode &= ~ENABLE_ECHO_INPUT;
SetConsoleMode(hi, mode);
There are reports that ENABLE_LINE_INPUT
limits reads to 254 bytes,
but I was unable to reproduce it. My full example can read huge
passwords without trouble.
The old mode is saved in orig
so that it can be restored later.
Here’s where you have to pay the piper. As of the date of this article, the Windows API offers no method for reading UTF-8 input from the console. Give up on that hope now. If you use the “ANSI” functions to read input under any configuration, they will to the usual Windows thing of silently mangling your input.
So you must use the UTF-16 API, ReadConsoleW()
, and then
encode it yourself. Fortunately Win32 provides a UTF-8 encoder,
WideCharToMultiByte()
, which will even handle surrogate pairs
for all those people who like putting PILE OF POO
(U+1F4A9
) in their
passwords:
SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0; // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);
I use SecureZeroMemory()
to erase the UTF-16 version of the
password before freeing the buffer. The + 2
in the allocation is for
the CRLF line ending that will later be chopped off. The error handling
version checks that the input did indeed end with CRLF. Otherwise it was
truncated (too long).
Finally print a newline since the user-typed one wasn’t echoed, restore the old console mode, close the console handles, and return the final encoded length:
WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;
The error checking version doesn’t check for errors from any of these functions since either they cannot fail, or there’s nothing reasonable to do in the event of an error.
If you look around the Win32 API you might notice SetConsoleCP()
. A
reasonable person might think that setting the “code page” to UTF-8
(CP_UTF8
) might configure the console to encode input in UTF-8. The
good news is Windows will no longer mangle your input as before. The bad
news is that it will be mangled differently.
You might think you can use the CRT function _setmode()
with
_O_U8TEXT
on the FILE *
connected to the console. This does nothing
useful. (The only use for _setmode()
is with _O_BINARY
, to disable
braindead character translation on standard input and output.) The best
you’ll be able to do with the CRT is the same sort of wide character
read using non-standard functions, followed by conversion to UTF-8.
CredUICmdLinePromptForCredentials()
promises to be both a
mouthful of a function name, and a prepacked solution to this problem.
It only delivers on the first. This function seems to have broken some
time ago and nobody at Microsoft noticed — probably because nobody has
ever used this function. I couldn’t find a working example, nor a use
in any real application. When I tried to use it, I got a nonsense error
code it never worked. There’s a GUI version of this function that does
work, and it’s a viable alternative for certain situations, though not
mine.
At my most desperate, I hoped ENABLE_VIRTUAL_TERMINAL_PROCESSING
would
be a magical switch. On Windows 10 it magically enables some ANSI escape
sequences. The documentation in no way suggests it would work, and I
confirmed by experimentation that it does not. Pity.
I spent a lot of time searching down these dead ends until finally
settling with ReadConsoleW()
above. I hoped it would be more
automatic, but I’m glad I have at least some solution figured out.
In a previous article I demonstrated video filtering with C and a
unix pipeline. Thanks to the ubiquitous support for the
ridiculously simple Netpbm formats — specifically the “Portable
PixMap” (.ppm
, P6
) binary format — it’s trivial to parse and
produce image data in any language without image libraries. Video
decoders and encoders at the ends of the pipeline do the heavy lifting
of processing the complicated video formats actually used to store and
transmit video.
Naturally this same technique can be used to produce new video in a simple program. All that’s needed are a few functions to render artifacts — lines, shapes, etc. — to an RGB buffer. With a bit of basic sound synthesis, the same concept can be applied to create audio in a separate audio stream — in this case using the simple (but not as simple as Netpbm) WAV format. Put them together and a small, standalone program can create multimedia.
Here’s the demonstration video I’ll be going through in this article. It animates and visualizes various in-place sorting algorithms (see also). The elements are rendered as colored dots, ordered by hue, with red at 12 o’clock. A dot’s distance from the center is proportional to its corresponding element’s distance from its correct position. Each dot emits a sinusoidal tone with a unique frequency when it swaps places in a particular frame.
Original credit for this visualization concept goes to w0rthy.
All of the source code (less than 600 lines of C), ready to run, can be found here:
On any modern computer, rendering is real-time, even at 60 FPS, so you may be able to pipe the program’s output directly into your media player of choice. (If not, consider getting a better media player!)
$ ./sort | mpv --no-correct-pts --fps=60 -
VLC requires some help from ppmtoy4m:
$ ./sort | ppmtoy4m -F60:1 | vlc -
Or you can just encode it to another format. Recent versions of
libavformat can input PPM images directly, which means x264
can read
the program’s output directly:
$ ./sort | x264 --fps 60 -o video.mp4 /dev/stdin
By default there is no audio output. I wish there was a nice way to
embed audio with the video stream, but this requires a container and
that would destroy all the simplicity of this project. So instead, the
-a
option captures the audio in a separate file. Use ffmpeg
to
combine the audio and video into a single media file:
$ ./sort -a audio.wav | x264 --fps 60 -o video.mp4 /dev/stdin
$ ffmpeg -i video.mp4 -i audio.wav -vcodec copy -acodec mp3 \
combined.mp4
You might think you’ll be clever by using mkfifo
(i.e. a named pipe)
to pipe both audio and video into ffmpeg at the same time. This will
only result in a deadlock since neither program is prepared for this.
One will be blocked writing one stream while the other is blocked
reading on the other stream.
Several years ago my intern and I used the exact same pure C rendering technique to produce these raytracer videos:
I also used this technique to illustrate gap buffers.
This program really only has one purpose: rendering a sorting video with a fixed, square resolution. So rather than write generic image rendering functions, some assumptions will be hard coded. For example, the video size will just be hard coded and assumed square, making it simpler and faster. I chose 800x800 as the default:
#define S 800
Rather than define some sort of color struct with red, green, and blue
fields, color will be represented by a 24-bit integer (long
). I
arbitrarily chose red to be the most significant 8 bits. This has
nothing to do with the order of the individual channels in Netpbm
since these integers are never dumped out. (This would have stupid
byte-order issues anyway.) “Color literals” are particularly
convenient and familiar in this format. For example, the constant for
pink: 0xff7f7fUL
.
In practice the color channels will be operated upon separately, so here are a couple of helper functions to convert the channels between this format and normalized floats (0.0–1.0).
static void
rgb_split(unsigned long c, float *r, float *g, float *b)
{
*r = ((c >> 16) / 255.0f);
*g = (((c >> 8) & 0xff) / 255.0f);
*b = ((c & 0xff) / 255.0f);
}
static unsigned long
rgb_join(float r, float g, float b)
{
unsigned long ir = roundf(r * 255.0f);
unsigned long ig = roundf(g * 255.0f);
unsigned long ib = roundf(b * 255.0f);
return (ir << 16) | (ig << 8) | ib;
}
Originally I decided the integer form would be sRGB, and these functions handled the conversion to and from sRGB. Since it had no noticeable effect on the output video, I discarded it. In more sophisticated rendering you may want to take this into account.
The RGB buffer where images are rendered is just a plain old byte
buffer with the same pixel format as PPM. The ppm_set()
function
writes a color to a particular pixel in the buffer, assumed to be S
by S
pixels. The complement to this function is ppm_get()
, which
will be needed for blending.
static void
ppm_set(unsigned char *buf, int x, int y, unsigned long color)
{
buf[y * S * 3 + x * 3 + 0] = color >> 16;
buf[y * S * 3 + x * 3 + 1] = color >> 8;
buf[y * S * 3 + x * 3 + 2] = color >> 0;
}
static unsigned long
ppm_get(unsigned char *buf, int x, int y)
{
unsigned long r = buf[y * S * 3 + x * 3 + 0];
unsigned long g = buf[y * S * 3 + x * 3 + 1];
unsigned long b = buf[y * S * 3 + x * 3 + 2];
return (r << 16) | (g << 8) | b;
}
Since the buffer is already in the right format, writing an image is dead simple. I like to flush after each frame so that observers generally see clean, complete frames. It helps in debugging.
static void
ppm_write(const unsigned char *buf, FILE *f)
{
fprintf(f, "P6\n%d %d\n255\n", S, S);
fwrite(buf, S * 3, S, f);
fflush(f);
}
If you zoom into one of those dots, you may notice it has a nice smooth edge. Here’s one rendered at 30x the normal resolution. I did not render, then scale this image in another piece of software. This is straight out of the C program.
In an early version of this program I used a dumb dot rendering routine. It took a color and a hard, integer pixel coordinate. All the pixels within a certain distance of this coordinate were set to the color, everything else was left alone. This had two bad effects:
Dots jittered as they moved around since their positions were rounded to the nearest pixel for rendering. A dot would be centered on one pixel, then suddenly centered on another pixel. This looked bad even when those pixels were adjacent.
There’s no blending between dots when they overlap, making the lack of anti-aliasing even more pronounced.
Instead the dot’s position is computed in floating point and is actually rendered as if it were between pixels. This is done with a shader-like routine that uses smoothstep — just as found in shader languages — to give the dot a smooth edge. That edge is blended into the image, whether that’s the background or a previously-rendered dot. The input to the smoothstep is the distance from the floating point coordinate to the center (or corner?) of the pixel being rendered, maintaining that between-pixel smoothness.
Rather than dump the whole function here, let’s look at it piece by piece. I have two new constants to define the inner dot radius and the outer dot radius. It’s smooth between these radii.
#define R0 (S / 400.0f) // dot inner radius
#define R1 (S / 200.0f) // dot outer radius
The dot-drawing function takes the image buffer, the dot’s coordinates, and its foreground color.
static void
ppm_dot(unsigned char *buf, float x, float y, unsigned long fgc);
The first thing to do is extract the color components.
float fr, fg, fb;
rgb_split(fgc, &fr, &fg, &fb);
Next determine the range of pixels over which the dot will be draw. These are based on the two radii and will be used for looping.
int miny = floorf(y - R1 - 1);
int maxy = ceilf(y + R1 + 1);
int minx = floorf(x - R1 - 1);
int maxx = ceilf(x + R1 + 1);
Here’s the loop structure. Everything else will be inside the innermost
loop. The dx
and dy
are the floating point distances from the center
of the dot.
for (int py = miny; py <= maxy; py++) {
float dy = py - y;
for (int px = minx; px <= maxx; px++) {
float dx = px - x;
/* ... */
}
}
Use the x and y distances to compute the distance and smoothstep value, which will be the alpha. Within the inner radius the color is on 100%. Outside the outer radius it’s 0%. Elsewhere it’s something in between.
float d = sqrtf(dy * dy + dx * dx);
float a = smoothstep(R1, R0, d);
Get the background color, extract its components, and blend the foreground and background according to the computed alpha value. Finally write the pixel back into the buffer.
unsigned long bgc = ppm_get(buf, px, py);
float br, bg, bb;
rgb_split(bgc, &br, &bg, &bb);
float r = a * fr + (1 - a) * br;
float g = a * fg + (1 - a) * bg;
float b = a * fb + (1 - a) * bb;
ppm_set(buf, px, py, rgb_join(r, g, b));
That’s all it takes to render a smooth dot anywhere in the image.
The array being sorted is just a global variable. This simplifies some of the sorting functions since a few are implemented recursively. They can call for a frame to be rendered without needing to pass the full array. With the dot-drawing routine done, rendering a frame is easy:
#define N 360 // number of dots
static int array[N];
static void
frame(void)
{
static unsigned char buf[S * S * 3];
memset(buf, 0, sizeof(buf));
for (int i = 0; i < N; i++) {
float delta = abs(i - array[i]) / (N / 2.0f);
float x = -sinf(i * 2.0f * PI / N);
float y = -cosf(i * 2.0f * PI / N);
float r = S * 15.0f / 32.0f * (1.0f - delta);
float px = r * x + S / 2.0f;
float py = r * y + S / 2.0f;
ppm_dot(buf, px, py, hue(array[i]));
}
ppm_write(buf, stdout);
}
The buffer is static
since it will be rather large, especially if S
is cranked up. Otherwise it’s likely to overflow the stack. The
memset()
fills it with black. If you wanted a different background
color, here’s where you change it.
For each element, compute its delta from the proper array position,
which becomes its distance from the center of the image. The angle is
based on its actual position. The hue()
function (not shown in this
article) returns the color for the given element.
With the frame()
function complete, all I need is a sorting function
that calls frame()
at appropriate times. Here are a couple of
examples:
static void
shuffle(int array[N], uint64_t *rng)
{
for (int i = N - 1; i > 0; i--) {
uint32_t r = pcg32(rng) % (i + 1);
swap(array, i, r);
frame();
}
}
static void
sort_bubble(int array[N])
{
int c;
do {
c = 0;
for (int i = 1; i < N; i++) {
if (array[i - 1] > array[i]) {
swap(array, i - 1, i);
c = 1;
}
}
frame();
} while (c);
}
To add audio I need to keep track of which elements were swapped in this frame. When producing a frame I need to generate and mix tones for each element that was swapped.
Notice the swap()
function above? That’s not just for convenience.
That’s also how things are tracked for the audio.
static int swaps[N];
static void
swap(int a[N], int i, int j)
{
int tmp = a[i];
a[i] = a[j];
a[j] = tmp;
swaps[(a - array) + i]++;
swaps[(a - array) + j]++;
}
Before we get ahead of ourselves I need to write a WAV header. Without getting into the purpose of each field, just note that the header has 13 fields, followed immediately by 16-bit little endian PCM samples. There will be only one channel (monotone).
#define HZ 44100 // audio sample rate
static void
wav_init(FILE *f)
{
emit_u32be(0x52494646UL, f); // "RIFF"
emit_u32le(0xffffffffUL, f); // file length
emit_u32be(0x57415645UL, f); // "WAVE"
emit_u32be(0x666d7420UL, f); // "fmt "
emit_u32le(16, f); // struct size
emit_u16le(1, f); // PCM
emit_u16le(1, f); // mono
emit_u32le(HZ, f); // sample rate (i.e. 44.1 kHz)
emit_u32le(HZ * 2, f); // byte rate
emit_u16le(2, f); // block size
emit_u16le(16, f); // bits per sample
emit_u32be(0x64617461UL, f); // "data"
emit_u32le(0xffffffffUL, f); // byte length
}
Rather than tackle the annoying problem of figuring out the total
length of the audio ahead of time, I just wave my hands and write the
maximum possible number of bytes (0xffffffff
). Most software that
can read WAV files will understand this to mean the entire rest of the
file contains samples.
With the header out of the way all I have to do is write 1/60th of a second worth of samples to this file each time a frame is produced. That’s 735 samples (1,470 bytes) at 44.1kHz.
The simplest place to do audio synthesis is in frame()
right after
rendering the image.
#define FPS 60 // output framerate
#define MINHZ 20 // lowest tone
#define MAXHZ 1000 // highest tone
static void
frame(void)
{
/* ... rendering ... */
/* ... synthesis ... */
}
With the largest tone frequency at 1kHz, Nyquist says we only need to sample at 2kHz. 8kHz is a very common sample rate and gives some overhead space, making it a good choice. However, I found that audio encoding software was a lot happier to accept the standard CD sample rate of 44.1kHz, so I stuck with that.
The first thing to do is to allocate and zero a buffer for this frame’s samples.
int nsamples = HZ / FPS;
static float samples[HZ / FPS];
memset(samples, 0, sizeof(samples));
Next determine how many “voices” there are in this frame. This is used to mix the samples by averaging them. If an element was swapped more than once this frame, it’s a little louder than the others — i.e. it’s played twice at the same time, in phase.
int voices = 0;
for (int i = 0; i < N; i++)
voices += swaps[i];
Here’s the most complicated part. I use sinf()
to produce the
sinusoidal wave based on the element’s frequency. I also use a parabola
as an envelope to shape the beginning and ending of this tone so that
it fades in and fades out. Otherwise you get the nasty, high-frequency
“pop” sound as the wave is given a hard cut off.
for (int i = 0; i < N; i++) {
if (swaps[i]) {
float hz = i * (MAXHZ - MINHZ) / (float)N + MINHZ;
for (int j = 0; j < nsamples; j++) {
float u = 1.0f - j / (float)(nsamples - 1);
float parabola = 1.0f - (u * 2 - 1) * (u * 2 - 1);
float envelope = parabola * parabola * parabola;
float v = sinf(j * 2.0f * PI / HZ * hz) * envelope;
samples[j] += swaps[i] * v / voices;
}
}
}
Finally I write out each sample as a signed 16-bit value. I flush the frame audio just like I flushed the frame image, keeping them somewhat in sync from an outsider’s perspective.
for (int i = 0; i < nsamples; i++) {
int s = samples[i] * 0x7fff;
emit_u16le(s, wav);
}
fflush(wav);
Before returning, reset the swap counter for the next frame.
memset(swaps, 0, sizeof(swaps));
You may have noticed there was text rendered in the corner of the video
announcing the sort function. There’s font bitmap data in font.h
which
gets sampled to render that text. It’s not terribly complicated, but
you’ll have to study the code on your own to see how that works.
This simple video rendering technique has served me well for some years now. All it takes is a bit of knowledge about rendering. I learned quite a bit just from watching Handmade Hero, where Casey writes a software renderer from scratch, then implements a nearly identical renderer with OpenGL. The more I learn about rendering, the better this technique works.
Before writing this post I spent some time experimenting with using a media player as a interface to a game. For example, rather than render the game using OpenGL or similar, render it as PPM frames and send it to the media player to be displayed, just as game consoles drive television sets. Unfortunately the latency is horrible — multiple seconds — so that idea just doesn’t work. So while this technique is fast enough for real time rendering, it’s no good for interaction.
]]>gmake
) instead
of the system’s make.
I’ve since become familiar and comfortable with make’s official specification, and I’ve spend the last year writing strictly portable Makefiles. Not only has are my builds now portable across all unix-like systems, my Makefiles are cleaner and more robust. Many of the common make extensions — conditionals in particular — lead to fragile, complicated Makefiles and are best avoided anyway. It’s important to be able to trust your build system to do its job correctly.
This tutorial should be suitable for make beginners who have never written their own Makefiles before, as well as experienced developers who want to learn how to write portable Makefiles. Regardless, in order to understand the examples you must be familiar with the usual steps for building programs on the command line (compiler, linker, object files, etc.). I’m not going to suggest any fancy tricks nor provide any sort of standard starting template. Makefiles should be dead simple when the project is small, and grow in a predictable, clean fashion alongside the project.
I’m not going to cover every feature. You’ll need to read the specification for yourself to learn it all. This tutorial will go over the important features as well as the common conventions. It’s important to follow established conventions so that people using your Makefiles will know what to expect and how to accomplish the basic tasks.
If you’re running Debian, or a Debian derivative such as Ubuntu, the
bmake
and freebsd-buildutils
packages will provide the bmake
and
fmake
programs respectively. These alternative make implementations
are very useful for testing your Makefiles’ portability, should you
accidentally make use of a GNU Make feature. It’s not perfect since each
implements some of the same extensions as GNU Make, but it will catch
some common mistakes.
I am free, no matter what rules surround me. If I find them tolerable, I tolerate them; if I find them too obnoxious, I break them. I am free because I know that I alone am morally responsible for everything I do. ―Robert A. Heinlein
At make’s core are one or more dependency trees, constructed from rules. Each vertex in the tree is called a target. The final products of the build (executable, document, etc.) are the tree roots. A Makefile specifies the dependency trees and supplies the shell commands to produce a target from its prerequisites.
In this illustration, the “.c” files are source files that are written by hand, not generated by commands, so they have no prerequisites. The syntax for specifying one or more edges in this dependency tree is simple:
target [target...]: [prerequisite...]
While technically multiple targets can be specified in a single rule, this is unusual. Typically each target is specified in its own rule. To specify the tree in the illustration above:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
The order of these rules doesn’t matter. The entire Makefile is parsed before any actions are taken, so the tree’s vertices and edges can be specified in any order. There’s one exception: the first non-special target in a Makefile is the default target. This target is selected implicitly when make is invoked without choosing a target. It should be something sensible, so that a user can blindly run make and get a useful result.
A target can be specified more than once. Any new prerequisites are appended to the previously-given prerequisites. For example, this Makefile is identical to the previous, though it’s typically not written this way:
game: graphics.o
game: physics.o
game: input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
There are six special targets that are used to change the behavior
of make itself. All have uppercase names and start with a period.
Names fitting this pattern are reserved for use by make. According to
the standard, in order to get reliable POSIX behavior, the first
non-comment line of the Makefile must be .POSIX
. Since this is a
special target, it’s not a candidate for the default target, so game
will remain the default target:
.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
In practice, even a simple program will have header files, and sources that include a header file should also have an edge on the dependency tree for it. If the header file changes, targets that include it should also be rebuilt.
.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
We’ve constructed a dependency tree, but we still haven’t told make how to actually build any targets from its prerequisites. The rules also need to specify the shell commands that produce a target from its prerequisites.
If you were to create the source files in the example and invoke make,
you will find that it actually does know how to build the object
files. This is because make is initially configured with certain
inference rules, a topic which will be covered later. For now, we’ll
add the .SUFFIXES
special target to the top, erasing all the built-in
inference rules.
Commands immediately follow the target/prerequisite line in a rule. Each command line must start with a tab character. This can be awkward if your text editor isn’t configured for it, and it will be awkward if you try to copy the examples from this page.
Each line is run in its own shell, so be mindful of using commands like
cd
, which won’t affect later lines.
The simplest thing to do is literally specify the same commands you’d type at the shell:
.POSIX:
.SUFFIXES:
game: graphics.o physics.o input.o
cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
cc -c graphics.c
physics.o: physics.c physics.h
cc -c physics.c
input.o: input.c input.h graphics.h physics.h
cc -c input.c
I tried to walk into Target, but I missed. ―Mitch Hedberg
When invoking make, it accepts zero or more targets from the dependency tree, and it will build these targets — e.g. run the commands in the target’s rule — if the target is out-of-date. A target is out-of-date if it is older than any of its prerequisites.
# build the "game" binary (default target)
$ make
# build just the object files
$ make graphics.o physics.o input.o
This effect cascades up the dependency tree and causes further targets
to be rebuilt until all of the requested targets are up-to-date. There’s
a lot of room for parallelism since different branches of the tree can
be updated independently. It’s common for make implementations to
support parallel builds with the -j
option. This is non-standard, but
it’s a fantastic feature that doesn’t require anything special in the
Makefile to work correctly.
Similar to parallel builds is make’s -k
(“keep going”) option, which
is standard. This tells make not to stop on the first error, and to
continue updating targets that are unaffected by the error. This is nice
for fully populating Vim’s quickfix list or Emacs’ compilation
buffer.
It’s common to have multiple targets that should be built by default. If the first rule selects the default target, how do we solve the problem of needing multiple default targets? The convention is to use phony targets. These are called “phony” because there is no corresponding file, and so phony targets are never up-to-date. It’s convention for a phony “all” target to be the default target.
I’ll make game
a prerequisite of a new “all” target. More real targets
could be added as necessary to turn them into defaults. Users of this
Makefile will also expect make all
to build the entire project.
Another common phony target is “clean” which removes all of the built
files. Users will expect make clean
to delete all generated files.
.POSIX:
.SUFFIXES:
all: game
game: graphics.o physics.o input.o
cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
cc -c graphics.c
physics.o: physics.c physics.h
cc -c physics.c
input.o: input.c input.h graphics.h physics.h
cc -c input.c
clean:
rm -f game graphics.o physics.o input.o
So far the Makefile hardcodes cc
as the compiler, and doesn’t use any
compiler flags (warnings, optimization, hardening, etc.). The user
should be able to easily control all these things, but right now they’d
have to edit the entire Makefile to do so. Perhaps the user has both
gcc
and clang
installed, and wants to choose one or the other
without changing which is installed as cc
.
To solve this, make has macros that expand into strings when
referenced. The convention is to use the macro named CC
when talking
about the C compiler, CFLAGS
when talking about flags passed to the C
compiler, LDFLAGS
for flags passed to the C compiler when linking, and
LDLIBS
for flags about libraries when linking. The Makefile should
supply defaults as needed.
A macro is expanded with $(...)
. It’s valid (and normal) to reference
a macro that hasn’t been defined, which will be an empty string. This
will be the case with LDFLAGS
below.
Macro values can contain other macros, which will be expanded recursively each time the macro is expanded. Some make implementations allow the name of the macro being expanded to itself be a macro, which is turing complete, but this behavior is non-standard.
.POSIX:
.SUFFIXES:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
all: game
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
$(CC) -c $(CFLAGS) graphics.c
physics.o: physics.c physics.h
$(CC) -c $(CFLAGS) physics.c
input.o: input.c input.h graphics.h physics.h
$(CC) -c $(CFLAGS) input.c
clean:
rm -f game graphics.o physics.o input.o
Macros are overridden by macro definitions given as command line
arguments in the form name=value
. This allows the user to select their
own build configuration. This is one of make’s most powerful and
under-appreciated features.
$ make CC=clang CFLAGS='-O3 -march=native'
If the user doesn’t want to specify these macros on every invocation,
they can (cautiously) use make’s -e
flag to set overriding macros
definitions from the environment.
$ export CC=clang
$ export CFLAGS=-O3
$ make -e all
Some make implementations have other special kinds of macro assignment
operators beyond simple assignment (=
). These are unnecessary, so
don’t worry about them.
The road itself tells us far more than signs do. ―Tom Vanderbilt, Traffic: Why We Drive the Way We Do
There’s repetition across the three different object files. Wouldn’t it be nice if there was a way to communicate this pattern? Fortunately there is, in the form of inference rules. It says that a target with a certain extension, with a prerequisite with another certain extension, is built a certain way. This will make more sense with an example.
In an inference rule, the target indicates the extensions. The $<
macro expands to the prerequisite, which is essential to making
inference rules work generically. Unfortunately this macro is not
available in target rules, as much as that would be useful.
For example, here’s an inference rule that teaches make how to build an object file from a C source file. This particular rule is one that is pre-defined by make, so you’ll never need to write this one yourself. I’ll include it for completeness.
.c.o:
$(CC) $(CFLAGS) -c $<
These extensions must be added to .SUFFIXES
before they will work.
With that, the commands for the rules about object files can be omitted.
.POSIX:
.SUFFIXES:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
all: game
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
rm -f game graphics.o physics.o input.o
.SUFFIXES: .c .o
.c.o:
$(CC) $(CFLAGS) -c $<
The first empty .SUFFIXES
clears the suffix list. The second one adds
.c
and .o
to the now-empty suffix list.
Conventions are, indeed, all that shield us from the shivering void, though often they do so but poorly and desperately. ―Robert Aickman
Users usually expect an “install” target that installs the built
program, libraries, man pages, etc. By convention this target should use
the PREFIX
and DESTDIR
macros.
The PREFIX
macro should default to /usr/local
, and since it’s a
macro the user can override it to install elsewhere, such as in their
home directory. The user should override it for both building and
installing, since the prefix may need to be built into the binary (e.g.
-DPREFIX=$(PREFIX)
).
The DESTDIR
is macro is used for staged builds, so that it gets
installed under a fake root directory for the sake of packaging. Unlike
PREFIX, it will not actually be run from this directory.
.POSIX:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
PREFIX = /usr/local
all: game
install: game
mkdir -p $(DESTDIR)$(PREFIX)/bin
mkdir -p $(DESTDIR)$(PREFIX)/share/man/man1
cp -f game $(DESTDIR)$(PREFIX)/bin
gzip < game.1 > $(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
rm -f game graphics.o physics.o input.o
You may also want to provide an “uninstall” phony target that does the opposite.
make PREFIX=$HOME/.local install
Other common targets are “mostlyclean” (like “clean” but don’t delete some slow-to-build targets), “distclean” (delete even more than “clean”), “test” or “check” (run the test suite), and “dist” (create a package).
One of make’s big weak points is scaling up as a project grows in size.
As your growing project is broken into subdirectories, you may be tempted to put a Makefile in each subdirectory and invoke them recursively.
Don’t use recursive Makefiles. It breaks the dependency tree across separate instances of make and typically results in a fragile build. There’s nothing good about it. Have one Makefile at the root of your project and invoke make there. You may have to teach your text editor how to do this.
When talking about files in subdirectories, just include the subdirectory in the name. Everything will work the same as far as make is concerned, including inference rules.
src/graphics.o: src/graphics.c
src/physics.o: src/physics.c
src/input.o: src/input.c
Keeping your object files separate from your source files is a nice idea. When it comes to make, there’s good news and bad news.
The good news is that make can do this. You can pick whatever file names you like for targets and prerequisites.
obj/input.o: src/input.c
The bad news is that inference rules are not compatible with out-of-source builds. You’ll need to repeat the same commands for each rule as if inference rules didn’t exist. This is tedious for large projects, so you may want to have some sort of “configure” script, even if hand-written, to generate all this for you. This is essentially what CMake is all about. That, plus dependency management.
Another problem with scaling up is tracking the project’s ever-changing
dependencies across all the source files. Missing a dependency means the
build may not be correct unless you make clean
first.
If you go the route of using a script to generate the tedious parts of
the Makefile, both GCC and Clang have a nice feature for generating all
the Makefile dependencies for you (-MM
, -MT
), at least for C and
C++. There are lots of tutorials for doing this dependency generation on
the fly as part of the build, but it’s fragile and slow. Much better to
do it all up front and “bake” the dependencies into the Makefile so that
make can do its job properly. If the dependencies change, rebuild your
Makefile.
For example, here’s what it looks like invoking gcc’s dependency
generator against the imaginary input.c
for an out-of-source build:
$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h
Notice the output is in Makefile’s rule format.
Unfortunately this feature strips the leading paths from the target, so,
in practice, using it is always more complicated than it should be (e.g.
it requires the use of -MT
).
Microsoft has an implementation of make called Nmake, which comes with
Visual Studio. It’s nearly a POSIX-compatible make, but
necessarily breaks from the standard in some places. Their cl.exe
compiler uses .obj
as the object file extension and .exe
for
binaries, both of which differ from the unix world, so it has different
built-in inference rules. Windows also lacks a Bourne shell and the
standard unix tools, so all of the commands will necessarily be
different.
There’s no equivalent of rm -f
on Windows, so good luck writing a
proper “clean” target. No, del /f
isn’t the same.
So while it’s close to POSIX make, it’s not practical to write a Makefile that will simultaneously work properly with both POSIX make and Nmake. These need to be separate Makefiles.
It’s nice to have reliable, portable Makefiles that just work anywhere. Code to the standards and you don’t need feature tests or other sorts of special treatment.
]]>In the Smarter Every Day video, Destin illustrates the effect by simulating rolling shutter using a short video clip. In each frame of the video, a few additional rows are locked in place, showing the effect in slow motion, making it easier to understand.
At the end of the video he thanks a friend for figuring out how to get After Effects to simulate rolling shutter. After thinking about this for a moment, I figured I could easily accomplish this myself with just a bit of C, without any libraries. The video above this paragraph is the result.
I previously described a technique to edit and manipulate video without any formal video editing tools. A unix pipeline is sufficient for doing minor video editing, especially without sound. The program at the front of the pipe decodes the video into a raw, uncompressed format, such as YUV4MPEG or PPM. The tools in the middle losslessly manipulate this data to achieve the desired effect (watermark, scaling, etc.). Finally, the tool at the end encodes the video into a standard format.
$ decode video.mp4 | xform-a | xform-b | encode out.mp4
For the “decode” program I’ll be using ffmpeg now that it’s back in
the Debian repositories. You can throw a video in virtually any
format at it and it will write PPM frames to standard output. For the
encoder I’ll be using the x264
command line program, though ffmpeg
could handle this part as well. Without any filters in the middle,
this example will just re-encode a video:
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
x264 -o output.mp4 /dev/stdin
The filter tools in the middle only need to read and write in the raw image format. They’re a little bit like shaders, and they’re easy to write. In this case, I’ll write C program that simulates rolling shutter. The filter could be written in any language that can read and write binary data from standard input to standard output.
Update: It appears that input PPM streams are a rather recent
feature of libavformat (a.k.a lavf, used by x264
). Support for PPM
input first appeared in libavformat 3.1 (released June 26th, 2016). If
you’re using an older version of libavformat, you’ll need to stick
ppmtoy4m
in front of x264
in the processing pipeline.
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
ppmtoy4m | \
x264 -o output.mp4 /dev/stdin
In the past, my go to for raw video data has been loose PPM frames and
YUV4MPEG streams (via ppmtoy4m
). Fortunately, over the years a lot
of tools have gained the ability to manipulate streams of PPM images,
which is a much more convenient format. Despite being raw video data,
YUV4MPEG is still a fairly complex format with lots of options and
annoying colorspace concerns. PPM is simple RGB without
complications. The header is just text:
P6
<width> <height>
<maxdepth>
<width * height * 3 binary RGB data>
The maximum depth is virtually always 255. A smaller value reduces the image’s dynamic range without reducing the size. A larger value involves byte-order issues (endian). For video frame data, the file will typically look like:
P6
1920 1080
255
<frame RGB>
Unfortunately the format is actually a little more flexible than this.
Except for the new line (LF, 0x0A) after the maximum depth, the
whitespace is arbitrary and comments starting with #
are permitted.
Since the tools I’m using won’t produce comments, I’m going to ignore
that detail. I’ll also assume the maximum depth is always 255.
Here’s the structure I used to represent a PPM image, just one frame of video. I’m using a flexible array member to pack the data at the end of the structure.
struct frame {
size_t width;
size_t height;
unsigned char data[];
};
Next a function to allocate a frame:
static struct frame *
frame_create(size_t width, size_t height)
{
struct frame *f = malloc(sizeof(*f) + width * height * 3);
f->width = width;
f->height = height;
return f;
}
We’ll need a way to write the frames we’ve created.
static void
frame_write(struct frame *f)
{
printf("P6\n%zu %zu\n255\n", f->width, f->height);
fwrite(f->data, f->width * f->height, 3, stdout);
}
Finally, a function to read a frame, reusing an existing buffer if
possible. The most complex part of the whole program is just parsing
the PPM header. The %*c
in the scanf()
specifically consumes the
line feed immediately following the maximum depth.
static struct frame *
frame_read(struct frame *f)
{
size_t width, height;
if (scanf("P6 %zu%zu%*d%*c", &width, &height) < 2) {
free(f);
return 0;
}
if (!f || f->width != width || f->height != height) {
free(f);
f = frame_create(width, height);
}
fread(f->data, width * height, 3, stdin);
return f;
}
Since this program will only be part of a pipeline, I’m not worried
about checking the results of fwrite()
and fread()
. The process
will be killed by the shell if something goes wrong with the pipes.
However, if we’re out of video data and get an EOF, scanf()
will
fail, indicating the EOF, which is normal and can be handled cleanly.
That’s all the infrastructure we need to built an identity filter that passes frames through unchanged:
int main(void)
{
struct frame *frame = 0;
while ((frame = frame_read(frame)))
frame_write(frame);
}
Processing a frame is just matter of adding some stuff to the body of
the while
loop.
For the rolling shutter filter, in addition to the input frame we need an image to hold the result of the rolling shutter. Each input frame will be copied into the rolling shutter frame, but a little less will be copied from each frame, locking a little bit more of the image in place.
int
main(void)
{
int shutter_step = 3;
size_t shutter = 0;
struct frame *f = frame_read(0);
struct frame *out = frame_create(f->width, f->height);
while (shutter < f->height && (f = frame_read(f))) {
size_t offset = shutter * f->width * 3;
size_t length = f->height * f->width * 3 - offset;
memcpy(out->data + offset, f->data + offset, length);
frame_write(out);
shutter += shutter_step;
}
free(out);
free(f);
}
The shutter_step
controls how many rows are capture per frame of
video. Generally capturing one row per frame is too slow for the
simulation. For a 1080p video, that’s 1,080 frames for the entire
simulation: 18 seconds at 60 FPS or 36 seconds at 30 FPS. If this
program were to accept command line arguments, controlling the shutter
rate would be one of the options.
Putting it all together:
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
./rolling-shutter | \
x264 -o output.mp4 /dev/stdin
Here are some of the results for different shutter rates: 1, 3, 5, 8, 10, and 15 rows per frame. Feel free to right-click and “View Video” to see the full resolution video.
This post contains the full source in parts, but here it is all together:
Here’s the original video, filmed by my wife using her Nikon D5500, in case you want to try it for yourself:
It took much longer to figure out the string-pulling contraption to slowly spin the fan at a constant rate than it took to write the C filter program.
On Hacker News, morecoffee shared a video of the second order effect (direct link), where the rolling shutter speed changes over time.
A deeper analysis of rolling shutter: Playing detective with rolling shutter photos.
]]>$HOME/.local/
. Within are the standard
/usr
directories, such as bin/
, include/
, lib/
, etc.,
containing my own software, libraries, and man pages. These are
first-class citizens, indistinguishable from the system-installed
programs and libraries. With one exception (setuid programs), none of
this requires root privileges.
Installing software in $HOME serves two important purposes, both of which are indispensable to me on a regular basis.
This prevents me from installing packaged software myself through the system’s package manager. Building and installing the software myself in my home directory, without involvement from the system administrator, neatly works around this issue. As a software developer, it’s already perfectly normal for me to build and run custom software, and this is just an extension of that behavior.
In the most desperate situation, all I need from the sysadmin is a decent C compiler and at least a minimal POSIX environment. I can bootstrap anything I might need, both libraries and programs, including a better C compiler along the way. This is one major strength of open source software.
I have noticed one alarming trend: Both GCC (since 4.8) and Clang are written in C++, so it’s becoming less and less reasonable to bootstrap a C++ compiler from a C compiler, or even from a C++ compiler that’s more than a few years old. So you may also need your sysadmin to supply a fairly recent C++ compiler if you want to bootstrap an environment that includes C++. I’ve had to avoid some C++ software (such as CMake) for this reason.
In theory this is what /usr/local
is all about. It’s typically the
location for software not managed by the system’s package manager.
However, I think it’s cleaner to put this in $HOME/.local
, so long
as other system users don’t need it.
For example, I have an installation of each version of Emacs between
24.3 (the oldest version worth supporting) through the latest stable
release, each suffixed with its version number, under $HOME/.local
.
This is useful for quickly running a test suite under different
releases.
$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...
Another example is NetHack, which I prefer to play with a couple of
custom patches (Menucolors, wchar). The install to
$HOME/.local
is also captured as a patch.
$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install
Normally NetHack wants to be setuid (e.g. run as the “games” user) in order to restrict access to high scores, saves, and bones — saved levels where a player died, to be inserted randomly into other players’ games. This prevents cheating, but requires root to set up. Fortunately, when I install NetHack in my home directory, this isn’t a feature I actually care about, so I can ignore it.
Mutt is in a similar situation, since it wants to install a
special setgid program (mutt_dotlock
) that synchronizes mailbox
access. All MUAs need something like this.
Everything described below is relevant to basically any modern unix-like system: Linux, BSD, etc. I personally install software in $HOME across a variety of systems and, fortunately, it mostly works the same way everywhere. This is probably in large part due to everyone standardizing around the GCC and GNU binutils interfaces, even if the system compiler is actually LLVM/Clang.
Out of the box, installing things in $HOME/.local
won’t do anything
useful. You need to set up some environment variables in your shell
configuration (i.e. .profile
, .bashrc
, etc.) to tell various
programs, such as your shell, about it. The most obvious variable is
$PATH:
export PATH=$HOME/.local/bin:$PATH
Notice I put it in the front of the list. This is because I want my home directory programs to override system programs with the same name. For what other reason would I install a program with the same name if not to override the system program?
In the simplest situation this is good enough, but in practice you’ll probably need to set a few more things. If you install libraries in your home directory and expect to use them just as if they were installed on the system, you’ll need to tell the compiler where else to look for those headers and libraries, both for C and C++.
export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib
The first two are like the -I
compiler option and the third is like
-L
linker option, except you usually won’t need to use them
explicitly. Unfortunately LIBRARY_PATH
doesn’t override the system
library paths, so in some cases, you will need to explicitly set
-L
. Otherwise you will still end up linking against the system library
rather than the custom packaged version. I really wish GCC and Clang
didn’t behave this way.
Some software uses pkg-config
to determine its compiler and linker
flags, and your home directory will contain some of the needed
information. So set that up too:
export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig
Finally, when you install libraries in your home directory, the run-time dynamic linker will need to know where to find them. There are three ways to deal with this:
LD_LIBRARY_PATH
.For the crude way, point the run-time linker at your lib/
and you’re
done:
export LD_LIBRARY_PATH=$HOME/.local/lib
However, this is like using a shotgun to kill a fly. If you install a library in your home directory that is also installed on the system, and then run a system program, it may be linked against your library rather than the library installed on the system as was originally intended. This could have detrimental effects.
The precision method is to set the ELF “runpath” value. It’s like a
per-binary LD_LIBRARY_PATH
. The run-time linker uses this path first
in its search for libraries, and it will only have an effect on that
particular program/library. This also applies to dlopen()
.
Some software will configure the runpath by default in their build
system, but often you need to configure this yourself. The simplest way
is to set the LD_RUN_PATH
environment variable when building software.
Another option is to manually pass -rpath
options to the linker via
LDFLAGS
. It’s used directly like this:
$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux
Verify with readelf
:
$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]
ELF supports a special $ORIGIN
“variable” set to the binary’s
location. This allows the program and associated libraries to be
installed anywhere without changes, so long as they have the same
relative position to each other . (Note the quotes to prevent shell
interpolation.)
$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux
There is one situation where runpath
won’t work: when you want a
system-installed program to find a home directory library with
dlopen()
— e.g. as an extension to that program. You either need to
ensure it uses a relative or absolute path (i.e. the argument to
dlopen()
contains a slash) or you must use LD_LIBRARY_PATH
.
Personally, I always use the Worse is Better LD_LIBRARY_PATH
shotgun. Occasionally it’s caused some annoying issues, but the vast
majority of the time it gets the job done with little fuss. This is
just my personal development environment, after all, not a production
server.
Another potentially tricky issue is man pages. When a program or
library installs a man page in your home directory, it would certainly
be nice to access it with man <topic>
just like it was installed on
the system. Fortunately, Debian and Debian-derived systems, using a
mechanism I haven’t yet figured out, discover home directory man pages
automatically without any assistance. No configuration needed.
It’s more complicated on other systems, such as the BSDs. You’ll need to
set the MANPATH
variable to include $HOME/.local/share/man
. It’s
unset by default and it overrides the system settings, which means you
need to manually include the system paths. The manpath
program can
help with this … if it’s available.
export MANPATH=$HOME/.local/share/man:$(manpath)
I haven’t figured out a portable way to deal with this issue, so I mostly ignore it.
While I’ve poo-pooed autoconf in the past, the standard
configure
script usually makes it trivial to build and install
software in $HOME. The key ingredient is the --prefix
option:
$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
Most of the time it’s that simple! If you’re linking against your own
libraries and want to use runpath
, it’s a little more complicated:
$ ./configure --prefix=$HOME/.local \
LDFLAGS="-Wl,-rpath=$HOME/.local/lib"
For CMake, there’s CMAKE_INSTALL_PREFIX
:
$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
The CMake builds I’ve seen use ELF runpath by default, and no further configuration may be required to make that work. I’m sure that’s not always the case, though.
Some software is just a single, static, standalone binary with everything baked in. It doesn’t need to be given a prefix, and installation is as simple as copying the binary into place. For example, Enchive works like this:
$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin
Some software uses its own unique configuration interface. I can respect
that, but it does add some friction for users who now have something
additional and non-transferable to learn. I demonstrated a NetHack build
above, which has a configuration much more involved than it really
should be. Another example is LuaJIT, which uses make
variables that
must be provided consistently on every invocation:
$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install
(You can use the “install” target to both build and install, but I
wanted to illustrate the repetition of PREFIX
.)
Some libraries aren’t so smart about pkg-config
and need some
handholding — for example, ncurses. I mention it because
it’s required for both Vim and Emacs, among many others, so I’m often
building it myself. It ignores --prefix
and needs to be told a
second time where to install things:
$ ./configure --prefix=$HOME/.local \
--enable-pc-files \
--with-pkg-config-libdir=$PKG_CONFIG_PATH
Another issue is that a whole lot of software has been hardcoded for
ncurses 5.x (i.e. ncurses5-config
), and it requires hacks/patching
to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x
for this reason.
I could go on and on like this, discussing the quirks for the various libraries and programs that I use. Over the years I’ve gotten used to many of these issues, committing the solutions to memory. Unfortunately, even within the same version of a piece of software, the quirks can change between major operating system releases, so I’m continuously learning my way around new issues. It’s really given me an appreciation for all the hard work that package maintainers put into customizing and maintaining software builds to fit properly into a larger ecosystem.
]]>This article has a followup.
Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.
Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.
It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86-64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).
I’ve put the complete demo here if you want to see it all at once:
I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86-64 assembly, feel free to skip to the next section.
x86-64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.
rsp
: stack pointerrbp
: “base” pointer (still used in debugging and profiling)rax
rbx
rcx
rdx
: general purpose (notice: a, b, c, d)rdi
rsi
: “destination” and “source”, now meaningless namesr8
r9
r10
r11
r12
r13
r14
r15
: added for x86-64The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86-64 they’re almost completely interchangeable.
There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.
The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.
The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86-64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:
So, for example, to perform a function call like foo(1, 2, 3)
, store
1, 2 and 3 in rdi, rsi, and rdx, then call
the function. The mov
instruction stores the source (second) operand in its destination
(first) operand. The call
instruction pushes the current value of
rip onto the stack, then sets rip (jumps) to the address of the
target function. When the callee is ready to return, it uses the ret
instruction to pop the original rip value off the stack and back
into rip, returning control to the caller.
mov rdi, 1
mov rsi, 2
mov rdx, 3
call foo
Called functions must preserve the contents of these registers (the same value must be stored when the function returns):
When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.
Each system call has an integer identifying it. This number is
different on each platform, but, in Linux’s case, it will never
change. Instead of call
, rax is set to the number of the
desired system call and the syscall
instruction makes the request to
the OS kernel. Prior to x86-64, this was done with an old-fashioned
interrupt. Because interrupts are slow, a special,
statically-positioned “vsyscall” page (now deprecated as a security
hazard), later vDSO, is provided to allow certain system
calls to be made as function calls. We’ll only need the syscall
instruction in this article.
So, for example, the write() system call has this C prototype.
ssize_t write(int fd, const void *buf, size_t count);
On x86-64, the write() system call is at the top of the system call
table as call 1 (read() is 0). Standard output is file
descriptor 1 by default (standard input is 0). The following bit of
code will write 10 bytes of data from the memory address buffer
(a
symbol defined elsewhere in the assembly program) to standard output.
The number of bytes written, or -1 for error, will be returned in rax.
mov rdi, 1 ; fd
mov rsi, buffer
mov rdx, 10 ; 10 bytes
mov rax, 1 ; SYS_write
syscall
There’s one last thing you need to know: registers often hold a memory
address (i.e. a pointer), and you need a way to read the data behind
that address. In NASM syntax, wrap the register in brackets (e.g.
[rax]
), which, if you’re familiar with C, would be the same as
dereferencing the pointer.
These bracket expressions, called an effective address, may be
limited mathematical expressions to offset that base address
entirely within a single instruction. This expression can include
another register (index), a power-of-two scalar (bit shift), and
an immediate signed offset. For example, [rax + rdx*8 + 12]
. If
rax is a pointer to a struct, and rdx is an array index to an element
in array on that struct, only a single instruction is needed to read
that element. NASM is smart enough to allow the assembly programmer to
break this mold a little bit with more complex expressions, so long as
it can reduce it to the [base + index*2^exp + offset]
form.
The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.
Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.
The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.
You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:
brk(): Extends (or shrinks) the heap of a running process, typically located somewhere shortly after the .bss segment. Many allocators will do this for small or initial allocations. This is a less optimal choice for thread stacks because the stacks will be very near other important data, near other stacks, and lack a guard page (by default). It would be somewhat easier for an attacker to exploit a buffer overflow. A guard page is a locked-down page just past the absolute end of the stack that will trigger a segmentation fault on a stack overflow, rather than allow a stack overflow to trash other memory undetected. A guard page could still be created manually with mprotect(). Also, there’s also no room for these stacks to grow.
mmap(): Use an anonymous mapping to allocate a contiguous set of pages at some randomized memory location. As we’ll see, you can even tell the kernel specifically that you’re going to use this memory as a stack. Also, this is simpler than using brk() anyway.
On x86-64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.
void *stack_create(void);
The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.
void *mmap(void *addr, size_t length, int prot, int flags);
For flags
, we’ll choose a private, anonymous mapping that, being a
stack, grows downward. Even with that last flag, the system call will
still return the bottom address of the mapping, which will be
important to remember later. It’s just a simple matter of setting the
arguments in the registers and making the system call.
%define SYS_mmap 9
%define STACK_SIZE (4096 * 1024) ; 4 MB
stack_create:
mov rdi, 0
mov rsi, STACK_SIZE
mov rdx, PROT_WRITE | PROT_READ
mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
mov rax, SYS_mmap
syscall
ret
Now we can allocate new stacks (or stack-sized buffers) as needed.
Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.
long sys_clone(unsigned long flags, void *child_stack);
Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.
long thread_create(void (*)(void));
The function pointer argument is passed via rdi, per the ABI. Store
this for safekeeping on the stack (push
) in preparation for calling
stack_create(). When it returns, the address of the low end of stack
will be in rax.
thread_create:
push rdi
call stack_create
lea rsi, [rax + STACK_SIZE - 8]
pop qword [rsi]
mov rdi, CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
CLONE_PARENT | CLONE_THREAD | CLONE_IO
mov rax, SYS_clone
syscall
ret
The second argument to clone() is a pointer to the high address of
the stack (specifically, just above the stack). So we need to add
STACK_SIZE
to rax to get the high end. This is done with the lea
instruction: load effective address. Despite the brackets,
it doesn’t actually read memory at that address, but instead stores
the address in the destination register (rsi). I’ve moved it back by 8
bytes because I’m going to place the thread function pointer at the
“top” of the new stack in the next instruction. You’ll see why in a
moment.
Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.
As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.
CLONE_THREAD
: Put the new process in the same thread group.CLONE_VM
: Runs in the same virtual memory space.CLONE_PARENT
: Share a parent with the callee.CLONE_SIGHAND
: Share signal handlers.CLONE_FS
, CLONE_FILES
, CLONE_IO
: Share filesystem information.A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).
Now here’s the really cool part, and the reason branching isn’t
needed. There’s no reason to check rax to determine if we are the
original thread (in which case we return to the caller) or if we’re
the new thread (in which case we jump to the thread function).
Remember how we seeded the new stack with the thread function? When
the new thread returns (ret
), it will jump to the thread function
with a completely empty stack. The original thread, using the original
stack, will return to the caller.
The value returned by thread_create() is the process ID of the new
thread, which is essentially the thread object (e.g. Pthread’s
pthread_t
).
The thread function has to be careful not to return (ret
) since
there’s nowhere to return. It will fall off the stack and terminate
the program with a segmentation fault. Remember that threads are just
processes? It must use the exit() syscall to terminate. This won’t
terminate the other threads.
%define SYS_exit 60
exit:
mov rax, SYS_exit
syscall
Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.
If you found this interesting, be sure to check out the full demo link
at the top of this article. Now with the ability to spawn threads,
it’s a great opportunity to explore and experiment with x86’s
synchronization primitives, such as the lock
instruction prefix,
xadd
, and compare-and-exchange (cmpxchg
). I’ll discuss
these in a future article.
Monday’s /r/dailyprogrammer challenge was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(u(0)
) and a sequence of operations, f
, to apply to the previous
term (u(n + 1) = f(u(n))
) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.
For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5
would be
input as +2 *3 -5
. If u(0) = 0
then,
u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.
Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).
Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.
The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).
Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, NULL
points to a special page with read, write, and execute disabled.
Memory returned by malloc()
and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through malloc()
, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an asmbuf
struct.
#define PAGE_SIZE 4096
struct asmbuf {
uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
uint64_t count;
};
To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use sysconf(_SC_PAGESIZE)
to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.
Instead of malloc()
, the compiler allocates memory as an anonymous
memory map (mmap()
). It’s anonymous because it’s not backed by a
file.
struct asmbuf *
asmbuf_create(void)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}
Windows doesn’t have POSIX mmap()
, so on that platform we use
VirtualAlloc()
instead. Here’s the equivalent in Win32.
struct asmbuf *
asmbuf_create(void)
{
DWORD type = MEM_RESERVE | MEM_COMMIT;
return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}
Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.
The POSIX mprotect()
function is used to change memory protections.
void
asmbuf_finalize(struct asmbuf *buf)
{
mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}
Or on Win32 (that last parameter is not allowed to be NULL
),
void
asmbuf_finalize(struct asmbuf *buf)
{
DWORD old;
VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}
Finally, instead of free()
it gets unmapped.
void
asmbuf_free(struct asmbuf *buf)
{
munmap(buf, PAGE_SIZE);
}
And on Win32,
void
asmbuf_free(struct asmbuf *buf)
{
VirtualFree(buf, 0, MEM_RELEASE);
}
I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.
asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);
We’re only going to be concerned with three of x86-64’s many
registers: rdi
, rax
, and rdx
. These are 64-bit (r
) extensions
of the original 16-bit 8086 registers. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.
long recurrence(long);
The System V AMD64 ABI calling convention says that the first
integer/pointer function argument is passed in the rdi
register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in rax
when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy rdi
to rax
.
mov rax, rdi
There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in asmbuf
. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in rcx
rather than rdi
. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.
The very last thing it will do, assuming the result is in rax
, is
return to the caller.
ret
So we know the assembly, but what do we pass to asmbuf_ins()
? This
is where we get our hands dirty.
If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.
Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file peek.s
and hand it to nasm
. It will produce a raw binary
with the machine code, which we’ll disassemble with nidsasm
(the
NASM disassembler).
$ nasm peek.s
$ ndisasm -b64 peek
00000000 4889F8 mov rax,rdi
00000003 C3 ret
That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.
asmbuf_ins(buf, 3, 0x4889f8); // mov rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3); // ret
For each operation, we’ll set it up so the operand will already be
loaded into rdi
regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use 0x0123456789abcdef
as the
operand.
mov rdi, 0x0123456789abcdef
Which disassembled with ndisasm
is,
00000000 48BFEFCDAB896745 mov rdi,0x123456789abcdef
-2301
Notice the operand listed little endian immediately after the instruction. That’s also easy!
long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf); // mov rdi, operand
asmbuf_immediate(buf, 8, &operand);
Apply the same discovery process individually for each operator you
want to support, accumulating the result in rax
for each.
switch (operator) {
case '+':
asmbuf_ins(buf, 3, 0x4801f8); // add rax, rdi
break;
case '-':
asmbuf_ins(buf, 3, 0x4829f8); // sub rax, rdi
break;
case '*':
asmbuf_ins(buf, 4, 0x480fafc7); // imul rax, rdi
break;
case '/':
asmbuf_ins(buf, 3, 0x4831d2); // xor rdx, rdx
asmbuf_ins(buf, 3, 0x48f7ff); // idiv rdi
break;
}
As an exercise, try adding support for modulus operator (%
), XOR
(^
), and bit shifts (<
, >
). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the closed form solution to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.
Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a void *
just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)
asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);
That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.
I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.
]]>Last week in Handmade Hero (days 21-25), Casey Muratori added interactive programming to the game engine. This is especially useful in game development, where the developer might want to tweak, say, a boss fight without having to restart the entire game after each tweak. Now that I’ve seen it done, it seems so obvious. The secret is to build almost the entire application as a shared library.
This puts a serious constraint on the design of the program: it
cannot keep any state in global or static variables, though this
should be avoided anyway. Global state will be lost each
time the shared library is reloaded. In some situations, this can also
restrict use of the C standard library, including functions like
malloc()
, depending on how these functions are implemented or
linked. For example, if the C standard library is statically linked,
functions with global state may introduce global state into the shared
library. It’s difficult to know what’s safe to use. This works fine in
Handmade Hero because the core game, the part loaded as a shared
library, makes no use of external libraries, including the standard
library.
Additionally, the shared library must be careful with its use of function pointers. The functions being pointed at will no longer exist after a reload. This is a real issue when combining interactive programming with object oriented C.
To demonstrate how this works, let’s go through an example. I wrote a simple ncurses Game of Life demo that’s easy to modify. You can get the entire source here if you’d like to play around with it yourself on a Unix-like system.
Quick start:
make
then ./main
. Press r
randomize and q
to quit.game.c
to change the Game of Life rules, add colors, etc.make
. Your changes will be reflected
immediately in the original program!As of this writing, Handmade Hero is being written on Windows, so Casey is using a DLL and the Win32 API, but the same technique can be applied on Linux, or any other Unix-like system, using libdl. That’s what I’ll be using here.
The program will be broken into two parts: the Game of Life shared library (“game”) and a wrapper (“main”) whose job is only to load the shared library, reload it when it updates, and call it at a regular interval. The wrapper is agnostic about the operation of the “game” portion, so it could be re-used almost untouched in another project.
To avoid maintaining a whole bunch of function pointer assignments in
several places, the API to the “game” is enclosed in a struct. This
also eliminates warnings from the C compiler about mixing data and
function pointers. The layout and contents of the game_state
struct is private to the game itself. The wrapper will only handle a
pointer to this struct.
struct game_state;
struct game_api {
struct game_state *(*init)();
void (*finalize)(struct game_state *state);
void (*reload)(struct game_state *state);
void (*unload)(struct game_state *state);
bool (*step)(struct game_state *state);
};
In the demo the API is made of 5 functions. The first 4 are primarily concerned with loading and unloading.
init()
: Allocate and return a state to be passed to every other
API call. This will be called once when the program starts and never
again, even after reloading. If we were concerned about using
malloc()
in the shared library, the wrapper would be responsible
for performing the actual memory allocation.
finalize()
: The opposite of init()
, to free all resources held
by the game state.
reload()
: Called immediately after the library is reloaded. This
is the chance to sneak in some additional initialization in the
running program. Normally this function will be empty. It’s only
used temporarily during development.
unload()
: Called just before the library is unloaded, before a new
version is loaded. This is a chance to prepare the state for use by
the next version of the library. This can be used to update structs
and such, if you wanted to be really careful. This would also
normally be empty.
step()
: Called at a regular interval to run the game. A real game
will likely have a few more functions like this.
The library will provide a filled out API struct as a global variable,
GAME_API
. This is the only exported symbol in the entire shared
library! All functions will be declared static, including the ones
referenced by the structure.
const struct game_api GAME_API = {
.init = game_init,
.finalize = game_finalize,
.reload = game_reload,
.unload = game_unload,
.step = game_step
};
The wrapper is focused on calling dlopen()
, dlsym()
, and
dlclose()
in the right order at the right time. The game will be
compiled to the file libgame.so
, so that’s what will be loaded. It’s
written in the source with a ./
to force the name to be used as a
filename. The wrapper keeps track of everything in a game
struct.
const char *GAME_LIBRARY = "./libgame.so";
struct game {
void *handle;
ino_t id;
struct game_api api;
struct game_state *state;
};
The handle
is the value returned by dlopen()
. The id
is the
inode of the shared library, as returned by stat()
. The rest is
defined above. Why the inode? We could use a timestamp instead, but
that’s indirect. What we really care about is if the shared object
file is actually a different file than the one that was loaded. The
file will never be updated in place, it will be replaced by the
compiler/linker, so the timestamp isn’t what’s important.
Using the inode is a much simpler situation than in Handmade Hero. Due to Windows’ broken file locking behavior, the game DLL can’t be replaced while it’s being used. To work around this limitation, the build system and the loader have to rely on randomly-generated filenames.
void game_load(struct game *game)
The purpose of the game_load()
function is to load the game API into
a game
struct, but only if either it hasn’t been loaded yet or if
it’s been updated. Since it has several independent failure
conditions, let’s examine it in parts.
struct stat attr;
if ((stat(GAME_LIBRARY, &attr) == 0) && (game->id != attr.st_ino)) {
First, use stat()
to determine if the library’s inode is different
than the one that’s already loaded. The id
field will be 0
initially, so as long as stat()
succeeds, this will load the library
the first time.
if (game->handle) {
game->api.unload(game->state);
dlclose(game->handle);
}
If a library is already loaded, unload it first, being sure to call
unload()
to inform the library that it’s being updated. It’s
critically important that dlclose()
happens before dlopen()
. On
my system, dlopen()
looks only at the string it’s given, not the
file behind it. Even though the file has been replaced on the
filesystem, dlopen()
will see that the string matches a library
already opened and return a pointer to the old library. (Is this a
bug?) The handles are reference counted internally by libdl.
void *handle = dlopen(GAME_LIBRARY, RTLD_NOW);
Finally load the game library. There’s a race condition here that
cannot be helped due to limitations of dlopen()
. The library may
have been updated again since the call to stat()
. Since we can’t
ask dlopen()
about the inode of the library it opened, we can’t
know. But as this is only used during development, not in production,
it’s not a big deal.
if (handle) {
game->handle = handle;
game->id = attr.st_ino;
/* ... more below ... */
} else {
game->handle = NULL;
game->id = 0;
}
If dlopen()
fails, it will return NULL
. In the case of ELF, this
will happen if the compiler/linker is still in the process of writing
out the shared library. Since the unload was already done, this means
no game will be loaded when game_load
returns. The user of the
struct needs to be prepared for this eventuality. It will need to try
loading again later (i.e. a few milliseconds). It may be worth filling
the API with stub functions when no library is loaded.
const struct game_api *api = dlsym(game->handle, "GAME_API");
if (api != NULL) {
game->api = *api;
if (game->state == NULL)
game->state = game->api.init();
game->api.reload(game->state);
} else {
dlclose(game->handle);
game->handle = NULL;
game->id = 0;
}
When the library loads without error, look up the GAME_API
struct
that was mentioned before and copy it into the local struct. Copying
rather than using the pointer avoids one more layer of redirection
when making function calls. The game state is initialized if it hasn’t
been already, and the reload()
function is called to inform the game
it’s just been reloaded.
If looking up the GAME_API
fails, close the handle and consider it
a failure.
The main loop calls game_load()
each time around. And that’s it!
int main(void)
{
struct game game = {0};
for (;;) {
game_load(&game);
if (game.handle)
if (!game.api.step(game.state))
break;
usleep(100000);
}
game_unload(&game);
return 0;
}
Now that I have this technique in by toolbelt, it has me itching to develop a proper, full game in C with OpenGL and all, perhaps in another Ludum Dare. The ability to develop interactively is very appealing.
]]>Update 2020: DOS Defender was featured on GET OFF MY LAWN.
This past weekend I participated in Ludum Dare #31. Before the theme was even announced, due to recent fascination I wanted to make an old school DOS game. DOSBox would be the target platform since it’s the most practical way to run DOS applications anymore, despite modern x86 CPUs still being fully backwards compatible all the way back to the 16-bit 8086.
I successfully created and submitted a DOS game called DOS Defender. It’s a 32-bit 80386 real mode DOS COM program. All assets are embedded in the executable and there are no external dependencies, so the entire game is packed into that 10kB binary.
You’ll need a joystick/gamepad in order to play. I included mouse support in the Ludum Dare release in order to make it easier to review, but this was removed because it doesn’t work well.
The most technically interesting part is that I didn’t need any
DOS development tools to create this! I only used my every day Linux
C compiler (gcc
). It’s not actually possible to build DOS Defender
in DOS. Instead, I’m treating DOS as an embedded platform, which is
the only form in which DOS still exists today. Along with
DOSBox and DOSEMU, this is a pretty comfortable toolchain.
If all you care about is how to do this yourself, skip to the “Tricking GCC” section, where we’ll write a “Hello, World” DOS COM program with Linux’s GCC.
I didn’t have GCC in mind when I started this project. What really triggered all of this was that I had noticed Debian’s bcc package, Bruce’s C Compiler, that builds 16-bit 8086 binaries. It’s kept around for compiling x86 bootloaders and such, but it can also be used to compile DOS COM files, which was the part that interested me.
For some background: the Intel 8086 was a 16-bit microprocessor released in 1978. It had none of the fancy features of today’s CPU: no memory protection, no floating point instructions, and only up to 1MB of RAM addressable. All modern x86 desktops and laptops can still pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same limited addressing and all. That’s some serious backwards compatibility. This feature is called real mode. It’s the mode in which all x86 computers boot. Modern operating systems switch to protected mode as soon as possible, which provides virtual addressing and safe multi-tasking. DOS is not one of these operating systems.
Unfortunately, bcc is not an ANSI C compiler. It supports a subset of
K&R C, along with inline x86 assembly. Unlike other 8086 C compilers,
it has no notion of “far” or “long” pointers, so inline assembly is
required to access other memory segments (VGA, clock, etc.).
Side note: the remnants of these 8086 “long pointers” still exists
today in the Win32 API: LPSTR
, LPWORD
, LPDWORD
, etc. The inline
assembly isn’t anywhere near as nice as GCC’s inline assembly. The
assembly code has to manually load variables from the stack so, since
bcc supports two different calling conventions, the assembly ends up
being hard-coded to one calling convention or the other.
Given all its limitations, I went looking for alternatives.
DJGPP is the DOS port of GCC. It’s a very impressive project, bringing almost all of POSIX to DOS. The DOS ports of many programs are built with DJGPP. In order to achieve this, it only produces 32-bit protected mode programs. If a protected mode program needs to manipulate hardware (i.e. VGA), it must make requests to a DOS Protected Mode Interface (DPMI) service. If I used DJGPP, I couldn’t make a single, standalone binary as I had wanted, since I’d need to include a DPMI server. There’s also a performance penalty for making DPMI requests.
Getting a DJGPP toolchain working can be difficult, to put it kindly. Fortunately I found a useful project, build-djgpp, that makes it easy, at least on Linux.
Either there’s a serious bug or the official DJGPP binaries have become infected again, because in my testing I kept getting the “Not COFF: check for viruses” error message when running my programs in DOSBox. To double check that it’s not an infection on my own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as a clean room. It’s impossible for this ARM-based device to get infected with an x86 virus. It still had the same problem, and all the binary hashes matched up between the machines, so it’s not my fault.
So given the DPMI issue and the above, I moved on.
What I finally settled on is a neat hack that involves “tricking” GCC into producing real mode DOS COM files, so long as it can target 80386 (as is usually the case). The 80386 was released in 1985 and was the first 32-bit x86 microprocessor. GCC still targets this instruction set today, even in the x86-64 toolchain. Unfortunately, GCC cannot actually produce 16-bit code, so my main goal of targeting 8086 would not be achievable. This doesn’t matter, though, since DOSBox, my intended platform, is an 80386 emulator.
In theory this should even work unchanged with MinGW, but there’s a
long-standing MinGW bug that prevents it from working right (“cannot
perform PE operations on non PE output file”). It’s still do-able, and
I did it myself, but you’ll need to drop the OUTPUT_FORMAT
directive
and add an extra objcopy
step (objcopy -O binary
).
To demonstrate how to do all this, let’s make a DOS “Hello, World” COM program using GCC on Linux.
There’s a significant burden with this technique: there will be no
standard library. It’s basically like writing an operating system
from scratch, except for the few services DOS provides. This means no
printf()
or anything of the sort. Instead we’ll ask DOS to print a
string to the terminal. Making a request to DOS means firing an
interrupt, which means inline assembly!
DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26,
0x27, 0x2F. The big one, and the one we’re interested in, is 0x21,
function 0x09 (print string). Between DOS and BIOS, there are
thousands of functions called this way. I’m not going to try
to explain x86 assembly, but in short the function number is stuffed
into register ah
and interrupt 0x21 is fired. Function 0x09 also
takes an argument, the pointer to the string to be printed, which is
passed in registers dx
and ds
.
Here’s the GCC inline assembly print()
function. Strings passed to
this function must be terminated with a $
. Why? Because DOS.
static void print(char *string)
{
asm volatile ("mov $0x09, %%ah\n"
"int $0x21\n"
: /* no output */
: "d"(string)
: "ah");
}
The assembly is declared volatile
because it has a side effect
(printing the string). To GCC, the assembly is an opaque hunk, and the
optimizer relies in the output/input/clobber constraints (the last
three lines). For DOS programs like this, all inline assembly will
have side effects. This is because it’s not being written for
optimization but to access hardware and DOS, things not accessible to
plain C.
Care must also be taken by the caller, because GCC doesn’t know that
the memory pointed to by string
is ever read. It’s likely the array
that backs the string needs to be declared volatile
too. This is all
foreshadowing into what’s to come: doing anything in this environment
is an endless struggle against the optimizer. Not all of these battles
can be won.
Now for the main function. The name of this function shouldn’t matter,
but I’m avoiding calling it main()
since MinGW has a funny ideas
about mangling this particular symbol, even when it’s asked not to.
int dosmain(void)
{
print("Hello, World!\n$");
return 0;
}
COM files are limited to 65,279 bytes in size. This is because an x86 memory segment is 64kB and COM files are simply loaded by DOS to 0x0100 in the segment and executed. There are no headers, it’s just a raw binary. Since a COM program can never be of any significant size, and no real linking needs to occur (freestanding), the entire thing will be compiled as one translation unit. It will be one call to GCC with a bunch of options.
Here are the essential compiler options.
-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding
Since no standard libraries are in use, the only difference between
gnu99 and c99 is that trigraphs are disabled (as they should be) and
inline assembly can be written as asm
instead of __asm__
. It’s a
no brainer. This project will be so closely tied to GCC that I don’t
care about using GCC extensions anyway.
I’m using -Os
to keep the compiled output as small as possible. It
will also make the program run faster. This is important when
targeting DOSBox because, by default, it will deliberately run as slow
as a machine from the 1980’s. I want to be able to fit in that
constraint. If the optimizer is causing problems, you may need to
temporarily make this -O0
to determine if the problem is your fault
or the optimizer’s fault.
You see, the optimizer doesn’t understand that the program will be
running in real mode, and under its addressing constraints. It will
perform all sorts of invalid optimizations that break your perfectly
valid programs. It’s not a GCC bug since we’re doing crazy stuff
here. I had to rework my code a number of times to stop the optimizer
from breaking my program. For example, I had to avoid returning
complex structs from functions because they’d sometimes be filled with
garbage. The real danger here is that a future version of GCC will be
more clever and will break more stuff. In this battle, volatile
is
your friend.
Th next option is -nostdlib
, since there are no valid libraries for
us to link against, even statically.
The options -m32 -march=i386
set the compiler to produce 80386 code.
If I was writing a bootloader for a modern computer, targeting 80686
would be fine, too, but DOSBox is 80386.
The -ffreestanding
argument requires that GCC not emit code that
calls built-in standard library helper functions. Sometimes instead of
emitting code to do something, it emits code that calls a built-in
function to do it, especially with math operators. This was one of the
main problems I had with bcc, where this behavior couldn’t be
disabled. This is most commonly used in writing bootloaders and
kernels. And now DOS COM files.
The -Wl
option is used to pass arguments to the linker (ld
). We
need it since we’re doing all this in one call to GCC.
-Wl,--nmagic,--script=com.ld
The --nmagic
turns off page alignment of sections. One, we don’t
need this. Two, that would waste precious space. In my tests it
doesn’t appear to be necessary, but I’m including it just in case.
The --script
option tells the linker that we want to use a custom
linker script. This allows us to precisely lay out the sections
(text
, data
, bss
, rodata
) of our program. Here’s the com.ld
script.
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text);
}
.data :
{
*(.data);
*(.bss);
*(.rodata);
}
_heap = ALIGN(4);
}
The OUTPUT_FORMAT(binary)
says not to put this into an ELF (or PE,
etc.) file. The linker should just dump the raw code. A COM file is
just raw code, so this means the linker will produce a COM file!
I had said that COM files are loaded to 0x0100
. The fourth line
offsets the binary to this location. The first byte of the COM file
will still be the first byte of code, but it will be designed to run
from that offset in memory.
What follows is all the sections, text
(program), data
(static
data), bss
(zero-initialized data), rodata
(strings). Finally I
mark the end of the binary with the symbol _heap
. This will come in
handy later for writing sbrk()
, after we’re done with “Hello,
World.” I’ve asked for the _heap
position to be 4-byte aligned.
We’re almost there.
The linker is usually aware of our entry point (main
) and sets that
up for us. But since we asked for “binary” output, we’re on our own.
If the print()
function is emitted first, our program’s execution
will begin with executing that function, which is invalid. Our program
needs a little header stanza to get things started.
The linker script has a STARTUP
option for handling this, but to
keep it simple we’ll put that right in the program. This is usually
called crt0.o
or Boot.o
, in case those names every come up in your
own reading. This inline assembly must be the very first thing in
our code, before any includes and such. DOS will do most of the setup
for us, we really just have to jump to the entry point.
asm (".code16gcc\n"
"call dosmain\n"
"mov $0x4C, %ah\n"
"int $0x21\n");
The .code16gcc
tells the assembler that we’re going to be running in
real mode, so that it makes the proper adjustment. Despite the name,
this will not make it produce 16-bit code! First it calls dosmain
,
the function we wrote above. Then it informs DOS, using function
0x4C
(terminate with return code), that we’re done, passing the exit
code along in the 1-byte register al
(already set by dosmain
).
This inline assembly is automatically volatile
because it has no
inputs or outputs.
Here’s the entire C program.
asm (".code16gcc\n"
"call dosmain\n"
"mov $0x4C,%ah\n"
"int $0x21\n");
static void print(char *string)
{
asm volatile ("mov $0x09, %%ah\n"
"int $0x21\n"
: /* no output */
: "d"(string)
: "ah");
}
int dosmain(void)
{
print("Hello, World!\n$");
return 0;
}
I won’t repeat com.ld
. Here’s the call to GCC.
gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
-o hello.com -Wl,--nmagic,--script=com.ld hello.c
And testing it in DOSBox:
From here if you want fancy graphics, it’s just a matter of making an interrupt and writing to VGA memory. If you want sound you can perform an interrupt for the PC speaker. I haven’t sorted out how to call Sound Blaster yet. It was from this point that I grew DOS Defender.
To cover one more thing, remember that _heap
symbol? We can use it
to implement sbrk()
for dynamic memory allocation within the main
program segment. This is real mode, and there’s no virtual memory, so
we’re free to write to any memory we can address at any time. Some of
this is reserved (i.e. low and high memory) for hardware. So using
sbrk()
specifically isn’t really necessary, but it’s interesting
to implement ourselves.
As is normal on x86, your text and segments are at a low address
(0x0100 in this case) and the stack is at a high address (around
0xffff in this case). On Unix-like systems, the memory returned by
malloc()
comes from two places: sbrk()
and mmap()
. What sbrk()
does is allocates memory just above the text/data segments, growing
“up” towards the stack. Each call to sbrk()
will grow this space (or
leave it exactly the same). That memory would then managed by
malloc()
and friends.
Here’s how we can get sbrk()
in a COM program. Notice I have to
define my own size_t
, since we don’t have a standard library.
typedef unsigned short size_t;
extern char _heap;
static char *hbreak = &_heap;
static void *sbrk(size_t size)
{
char *ptr = hbreak;
hbreak += size;
return ptr;
}
It just sets a pointer to _heap
and grows it as needed. A slightly
smarter sbrk()
would be careful about alignment as well.
In the making of DOS Defender an interesting thing happened. I was
(incorrectly) counting on the memory return by my sbrk()
being
zeroed. This was the case the first time the game ran. However, DOS
doesn’t zero this memory between programs. When I would run my game
again, it would pick right up where it left off, because the same
data structures with the same contents were loaded back into place. A
pretty cool accident! It’s part of what makes this a fun embedded
platform.
Suppose you’re writing a function pass_match()
that takes an input
stream, an output stream, and a pattern. It works sort of like grep.
It passes to the output each line of input that matches the pattern.
The pattern string contains a shell glob pattern to be handled by
POSIX fnmatch()
. Here’s what the interface looks like.
void pass_match(FILE *in, FILE *out, const char *pattern);
Glob patterns are simple enough that pre-compilation, as would be done for a regular expression, is unnecessary. The bare string is enough.
Some time later the customer wants the program to support regular
expressions in addition to shell-style glob patterns. For efficiency’s
sake, regular expressions need to be pre-compiled and so will not be
passed to the function as a string. It will instead be a POSIX
regex_t
object. A quick-and-dirty approach might be to
accept both and match whichever one isn’t NULL.
void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);
Bleh. This is ugly and won’t scale well. What happens when more kinds of filters are needed? It would be much better to accept a single object that covers both cases, and possibly even another kind of filter in the future.
One of the most common ways to customize the the behavior of a
function in C is to pass a function pointer. For example, the final
argument to qsort()
is a comparator that determines how
objects get sorted.
For pass_match()
, this function would accept a string and return a
boolean value deciding if the string should be passed to the output
stream. It gets called once on each line of input.
void pass_match(FILE *in, FILE *out, bool (*match)(const char *));
However, this has one of the same problems as qsort()
:
the passed function lacks context. It needs a pattern string or
regex_t
object to operate on. In other languages these would be
attached to the function as a closure, but C doesn’t have closures. It
would need to be smuggled in via a global variable, which is not
good.
static regex_t regex; // BAD!!!
bool regex_match(const char *string)
{
return regexec(®ex, string, 0, NULL, 0) == 0;
}
Because of the global variable, in practice pass_match()
would be
neither reentrant nor thread-safe. We could take a lesson from GNU’s
qsort_r()
and accept a context to be passed to the filter function.
This simulates a closure.
void pass_match(FILE *in, FILE *out,
bool (*match)(const char *, void *), void *context);
The provided context pointer would be passed to the filter function as
the second argument, and no global variables are needed. This would
probably be good enough for most purposes and it’s about as simple as
possible. The interface to pass_match()
would cover any kind of
filter.
But wouldn’t it be nice to package the function and context together as one object?
How about putting the context on a struct and making an interface out of that? Here’s a tagged union that behaves as one or the other.
enum filter_type { GLOB, REGEX };
struct filter {
enum filter_type type;
union {
const char *pattern;
regex_t regex;
} context;
};
There’s one function for interacting with this struct:
filter_match()
. It checks the type
member and calls the correct
function with the correct context.
bool filter_match(struct filter *filter, const char *string)
{
switch (filter->type) {
case GLOB:
return fnmatch(filter->context.pattern, string, 0) == 0;
case REGEX:
return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
}
abort(); // programmer error
}
And the pass_match()
API now looks like this. This will be the final
change to pass_match()
, both in implementation and interface.
void pass_match(FILE *input, FILE *output, struct filter *filter);
It still doesn’t care how the filter works, so it’s good enough to
cover all future cases. It just calls filter_match()
on the pointer
it was given. However, the switch
and tagged union aren’t friendly
to extension. Really, it’s outright hostile. We finally have some
degree of polymorphism, but it’s crude. It’s like building duct tape
into a design. Adding new behavior means adding another switch
case.
This is a step backwards. We can do better.
With the switch
we’re no longer taking advantage of function
pointers. So what about putting a function pointer on the struct?
struct filter {
bool (*match)(struct filter *, const char *);
};
The filter itself is passed as the first argument, providing context.
In object oriented languages, that’s the implicit this
argument. To
avoid requiring the caller to worry about this detail, we’ll hide it
in a new switch
-free version of filter_match()
.
bool filter_match(struct filter *filter, const char *string)
{
return filter->match(filter, string);
}
Notice we’re still lacking the actual context, the pattern string or the regex object. Those will be different structs that embed the filter struct.
struct filter_regex {
struct filter filter;
regex_t regex;
};
struct filter_glob {
struct filter filter;
const char *pattern;
};
For both the original filter struct is the first member. This is
critical. We’re going to be using a trick called type punning. The
first member is guaranteed to be positioned at the beginning of the
struct, so a pointer to a struct filter_glob
is also a pointer to a
struct filter
. Notice any resemblance to inheritance?
Each type, glob and regex, needs its own match method.
static bool
method_match_regex(struct filter *filter, const char *string)
{
struct filter_regex *regex = (struct filter_regex *) filter;
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static bool
method_match_glob(struct filter *filter, const char *string)
{
struct filter_glob *glob = (struct filter_glob *) filter;
return fnmatch(glob->pattern, string, 0) == 0;
}
I’ve prefixed them with method_
to indicate their intended usage. I
declared these static
because they’re completely private. Other
parts of the program will only be accessing them through a function
pointer on the struct. This means we need some constructors in order
to set up those function pointers. (For simplicity, I’m not error
checking.)
struct filter *filter_regex_create(const char *pattern)
{
struct filter_regex *regex = malloc(sizeof(*regex));
regcomp(®ex->regex, pattern, REG_EXTENDED);
regex->filter.match = method_match_regex;
return ®ex->filter;
}
struct filter *filter_glob_create(const char *pattern)
{
struct filter_glob *glob = malloc(sizeof(*glob));
glob->pattern = pattern;
glob->filter.match = method_match_glob;
return &glob->filter;
}
Now this is real polymorphism. It’s really simple from the user’s
perspective. They call the correct constructor and get a filter object
that has the desired behavior. This object can be passed around
trivially, and no other part of the program worries about how it’s
implemented. Best of all, since each method is a separate function
rather than a switch
case, new kinds of filter subtypes can be
defined independently. Users can create their own filter types that
work just as well as the two “built-in” filters.
Oops, the regex filter needs to be cleaned up when it’s done, but the
user, by design, won’t know how to do it. Let’s add a free()
method.
struct filter {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
};
void filter_free(struct filter *filter)
{
return filter->free(filter);
}
And the methods for each. These would also be assigned in the constructor.
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = (struct filter_regex *) f;
regfree(®ex->regex);
free(f);
}
static void
method_free_glob(struct filter *f)
{
free(f);
}
The glob constructor should perhaps strdup()
its pattern as a
private copy, in which case it would be freed here.
A good rule of thumb is to prefer composition over inheritance. Having tidy filter objects opens up some interesting possibilities for composition. Here’s an AND filter that composes two arbitrary filter objects. It only matches when both its subfilters match. It supports short circuiting, so put the faster, or most discriminating, filter first in the constructor (user’s responsibility).
struct filter_and {
struct filter filter;
struct filter *sub[2];
};
static bool
method_match_and(struct filter *f, const char *s)
{
struct filter_and *and = (struct filter_and *) f;
return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);
}
static void
method_free_and(struct filter *f)
{
struct filter_and *and = (struct filter_and *) f;
filter_free(and->sub[0]);
filter_free(and->sub[1]);
free(f);
}
struct filter *filter_and(struct filter *a, struct filter *b)
{
struct filter_and *and = malloc(sizeof(*and));
and->sub[0] = a;
and->sub[1] = b;
and->filter.match = method_match_and;
and->filter.free = method_free_and;
return &and->filter;
}
It can combine a regex filter and a glob filter, or two regex filters,
or two glob filters, or even other AND filters. It doesn’t care what
the subfilters are. Also, the free()
method here frees its
subfilters. This means that the user doesn’t need to keep hold of
every filter created, just the “top” one in the composition.
To make composition filters easier to use, here are two “constant” filters. These are statically allocated, shared, and are never actually freed.
static bool
method_match_any(struct filter *f, const char *string)
{
return true;
}
static bool
method_match_none(struct filter *f, const char *string)
{
return false;
}
static void
method_free_noop(struct filter *f)
{
}
struct filter FILTER_ANY = { method_match_any, method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };
The FILTER_NONE
filter will generally be used with a (theoretical)
filter_or()
and FILTER_ANY
will generally be used with the
previously defined filter_and()
.
Here’s a simple program that composes multiple glob filters into a single filter, one for each program argument.
int main(int argc, char **argv)
{
struct filter *filter = &FILTER_ANY;
for (char **p = argv + 1; *p; p++)
filter = filter_and(filter_glob_create(*p), filter);
pass_match(stdin, stdout, filter);
filter_free(filter);
return 0;
}
Notice only one call to filter_free()
is needed to clean up the
entire filter.
As I mentioned before, the filter struct must be the first member of filter subtype structs in order for type punning to work. If we want to “inherit” from two different types like this, they would both need to be in this position: a contradiction.
Fortunately type punning can be generalized such that it the
first-member constraint isn’t necessary. This is commonly done through
a container_of()
macro. Here’s a C99-conforming definition.
#include <stddef.h>
#define container_of(ptr, type, member) \
((type *)((char *)(ptr) - offsetof(type, member)))
Given a pointer to a member of a struct, the container_of()
macro
allows us to back out to the containing struct. Suppose the regex
struct was defined differently, so that the regex_t
member came
first.
struct filter_regex {
regex_t regex;
struct filter filter;
};
The constructor remains unchanged. The casts in the methods change to the macro.
static bool
method_match_regex(struct filter *f, const char *string)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
regfree(®ex->regex);
free(f);
}
It’s a constant, compile-time computed offset, so there should be no practical performance impact. The filter can now participate freely in other intrusive data structures, like linked lists and such. It’s analogous to multiple inheritance.
Say we want to add a third method, clone()
, to the filter API, to
make an independent copy of a filter, one that will need to be
separately freed. It will be like the copy assignment operator in C++.
Each kind of filter will need to define an appropriate “method” for
it. As long as new methods like this are added at the end, this
doesn’t break the API, but it does break the ABI regardless.
struct filter {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
struct filter *(*clone)(struct filter *);
};
The filter object is starting to get big. It’s got three pointers — 24 bytes on modern systems — and these pointers are the same between all instances of the same type. That’s a lot of redundancy. Instead, these pointers could be shared between instances in a common table called a virtual method table, commonly known as a vtable.
Here’s a vtable version of the filter API. The overhead is now only one pointer regardless of the number of methods in the interface.
struct filter {
struct filter_vtable *vtable;
};
struct filter_vtable {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
struct filter *(*clone)(struct filter *);
};
Each type creates its own vtable and links to it in the constructor. Here’s the regex filter re-written for the new vtable API and clone method. This is all the tricks in one basket for a big object oriented C finale!
struct filter *filter_regex_create(const char *pattern);
struct filter_regex {
regex_t regex;
const char *pattern;
struct filter filter;
};
static bool
method_match_regex(struct filter *f, const char *string)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
regfree(®ex->regex);
free(f);
}
static struct filter *
method_clone_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return filter_regex_create(regex->pattern);
}
/* vtable */
struct filter_vtable filter_regex_vtable = {
method_match_regex, method_free_regex, method_clone_regex
};
/* constructor */
struct filter *filter_regex_create(const char *pattern)
{
struct filter_regex *regex = malloc(sizeof(*regex));
regex->pattern = pattern;
regcomp(®ex->regex, pattern, REG_EXTENDED);
regex->filter.vtable = &filter_regex_vtable;
return ®ex->filter;
}
This is almost exactly what’s going on behind the scenes in C++. When
a method/function is declared virtual
, and therefore dispatches
based on the run-time type of its left-most argument, it’s listed in
the vtables for classes that implement it. Otherwise it’s just a
normal function. This is why functions need to be declared virtual
ahead of time in C++.
In conclusion, it’s relatively easy to get the core benefits of object oriented programming in plain old C. It doesn’t require heavy use of macros, nor do users of these systems need to know that underneath it’s an object system, unless they want to extend it for themselves.
Here’s the whole example program once if you’re interested in poking:
]]>_Atomic
type specifier and not paying enough attention to memory
ordering constraints.
Still, this is a good opportunity to break new ground with a
demonstration of C11. I’m going to use the new
stdatomic.h
portion of C11 to build a lock-free data
structure. To compile this code you’ll need a C compiler and C library
with support for both C11 and the optional stdatomic.h
features. As
of this writing, as far as I know only GCC 4.9, released April
2014, supports this. It’s in Debian unstable but not in Wheezy.
If you want to take a look before going further, here’s the source. The test code in the repository uses plain old pthreads because C11 threads haven’t been implemented by anyone yet.
I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won’t run just anywhere. I’ll discuss all this below. So sorry for the long article. It’s just a lot more complex a topic than I had anticipated!
A lock-free data structure doesn’t require the use of mutex locks. More generally, it’s a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations — transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it’s usually safer, because there’s no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there’s additional risk of starvation (livelock), where a thread is unable to make progress.
As a demonstration, I’ll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it’s going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there’s only one value to be updated when pushing and popping: the pointer to the head of the list.
Here’s what the API will look like. I’ll define lstack_t
shortly.
I’m making it an opaque type because its fields should never be
accessed directly. The goal is to completely hide the atomic
semantics from the users of the stack.
int lstack_init(lstack_t *lstack, size_t max_size);
void lstack_free(lstack_t *lstack);
size_t lstack_size(lstack_t *lstack);
int lstack_push(lstack_t *lstack, void *value);
void *lstack_pop (lstack_t *lstack);
Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.
Notice there’s a maximum size declared at initialization time. While
lock-free allocation is possible [PDF], C makes no
guarantees that malloc()
is lock-free, so being truly lock-free
means not calling malloc()
. An important secondary benefit to
pre-allocating the stack’s memory is that this implementation doesn’t
require the use of hazard pointers, which would be far more
complicated than the stack itself.
The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being “full.”
The int
return value of lstack_init()
and lstack_push()
is for
error codes, returning 0 for success. The only way these can fail is
by running out of memory. This is an issue regardless of being
lock-free: systems can simply run out of memory. In the push case it
means the stack is full.
Here’s the definition for a node in the stack. Neither field needs to be accessed atomically, so they’re not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it’s effectively immutable (outside of reuse). Users never need to touch this structure.
struct lstack_node {
void *value;
struct lstack_node *next;
};
Internally a lstack_t
is composed of two stacks: the value stack
(head
) and the free node stack (free
). These will be handled
identically by the atomic functions, so it’s really a matter of
convention which stack is which. All nodes are initially placed on the
free stack and the value stack starts empty. Here’s what an internal
stack looks like.
struct lstack_head {
uintptr_t aba;
struct lstack_node *node;
};
There’s still no atomic declaration here because the struct is going
to be handled as an entire unit. The aba
field is critically
important for correctness and I’ll go over it shortly. It’s declared
as a uintptr_t
because it needs to be the same size as a pointer.
Now, this is not guaranteed by C11 — it’s only guaranteed to be large
enough to hold any valid void *
pointer, so it could be even larger
— but this will be the case on any system that has the required
hardware support for this lock-free stack. This struct is therefore
the size of two pointers. If that’s not true for any reason, this code
will not link. Users will never directly access or handle this struct
either.
Finally, here’s the actual stack structure.
typedef struct {
struct lstack_node *node_buffer;
_Atomic struct lstack_head head, free;
_Atomic size_t size;
} lstack_t;
Notice the use of the new _Atomic
qualifier. Atomic values may have
different size, representation, and alignment requirements in order to
satisfy atomic access. These values should never be accessed directly,
even just for reading (use atomic_load()
).
The size
field is for convenience to check the number of elements on
the stack. It’s accessed separately from the stack nodes themselves,
so it’s not safe to read size
and use the information to make
assumptions about future accesses (e.g. checking if the stack is empty
before popping off an element). Since there’s no way to lock the
lock-free stack, there’s otherwise no way to estimate the size of the
stack during concurrent access without completely disassembling it via
lstack_pop()
.
There’s no reason to use volatile
here. That’s a
separate issue from atomic operations. The C11 stdatomic.h
macros
and functions will ensure atomic values are accessed appropriately.
As stated before, all nodes are initially placed on the internal free
stack. During initialization they’re allocated in one solid chunk,
chained together, and pinned on the free
pointer. The initial
assignments to atomic values are done through ATOMIC_VAR_INIT
, which
deals with memory access ordering concerns. The aba
counters don’t
actually need to be initialized. Garbage, indeterminate values are
just fine, but not initializing them would probably look like a
mistake.
int
lstack_init(lstack_t *lstack, size_t max_size)
{
struct lstack_head head_init = {0, NULL};
lstack->head = ATOMIC_VAR_INIT(head_init);
lstack->size = ATOMIC_VAR_INIT(0);
/* Pre-allocate all nodes. */
lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
if (lstack->node_buffer == NULL)
return ENOMEM;
for (size_t i = 0; i < max_size - 1; i++)
lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
lstack->node_buffer[max_size - 1].next = NULL;
struct lstack_head free_init = {0, lstack->node_buffer};
lstack->free = ATOMIC_VAR_INIT(free_init);
return 0;
}
The free nodes will not necessarily be used in the same order that they’re placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.
The reverse of lstack_init()
is simple, and it’s assumed concurrent
access has terminated. The stack is no longer valid, at least not
until lstack_init()
is used again. This one is declared inline
and
put in the header.
static inline void
stack_free(lstack_t *lstack)
{
free(lstack->node_buffer);
}
To read an atomic value we need to use atomic_load()
. Give it a
pointer to an atomic value, it dereferences the pointer and returns
the value. This is used in another inline function for reading the
size of the stack.
static inline size_t
lstack_size(lstack_t *lstack)
{
return atomic_load(&lstack->size);
}
For operating on the two stacks there will be two internal, static
functions, push
and pop
. These deal directly in nodes, accepting
and returning them, so they’re not suitable to expose in the API
(users aren’t meant to be aware of nodes). This is the most complex
part of lock-free stacks. Here’s pop()
.
static struct lstack_node *
pop(_Atomic struct lstack_head *head)
{
struct lstack_head next, orig = atomic_load(head);
do {
if (orig.node == NULL)
return NULL; // empty stack
next.aba = orig.aba + 1;
next.node = orig.node->next;
} while (!atomic_compare_exchange_weak(head, &orig, next));
return orig.node;
}
It’s centered around the new C11 stdatomic.h
function
atomic_compare_exchange_weak()
. This is an atomic operation more
generally called compare-and-swap (CAS). On x86 there’s an
instruction specifically for this, cmpxchg
. Give it a pointer to the
atomic value to be updated (head
), a pointer to the value it’s
expected to be (orig
), and a desired new value (next
). If the
expected and actual values match, it’s updated to the new value. If
not, it reports a failure and updates the expected value to the latest
value. In the event of a failure we start all over again, which
requires the while
loop. This is an optimistic strategy.
The “weak” part means it will sometimes spuriously fail where the
“strong” version would otherwise succeed. In exchange for more
failures, calling the weak version is faster. Use the weak version
when the body of your do ... while
loop is fast and the strong
version when it’s slow (when trying again is expensive), or if you
don’t need a loop at all. You usually want to use weak.
The alternative to CAS is load-link/store-conditional. It’s a
stronger primitive that doesn’t suffer from the ABA problem described
next, but it’s also not available on x86-64. On other platforms, one
or both of atomic_compare_exchange_*()
will be implemented using
LL/SC, but we still have to code for the worst case (CAS).
The aba
field is here to solve the ABA problem by counting
the number of changes that have been made to the stack. It will be
updated atomically alongside the pointer. Reasoning about the ABA
problem is where I got stuck last time writing this article.
Suppose aba
didn’t exist and it was just a pointer being swapped.
Say we have two threads, A and B.
Thread A copies the current head
into orig
, enters the loop body
to update next.node
to orig.node->next
, then gets preempted
before the CAS. The scheduler pauses the thread.
Thread B comes along performs a pop()
changing the value pointed
to by head
. At this point A’s CAS will fail, which is fine. It
would reconstruct a new updated value and try again. While A is
still asleep, B puts the popped node back on the free node stack.
Some time passes with A still paused. The freed node gets re-used
and pushed back on top of the stack, which is likely given that
nodes are allocated FIFO. Now head
has its original value again,
but the head->node->next
pointer is pointing somewhere completely
new! This is very bad because A’s CAS will now succeed despite
next.node
having the wrong value.
A wakes up and it’s CAS succeeds. At least one stack value has been lost and at least one node struct was leaked (it will be on neither stack, nor currently being held by a thread). This is the ABA problem.
The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They’ll work with append-only data structures, where pointers are never recycled, but that’s it.
The aba
field solves the problem because it’s incremented every time
the pointer is updated. Remember that this internal stack struct is
two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16
bytes is compared by CAS and they all have to match for it to succeed.
Since B, or other threads, will increment aba
at least twice (once
to remove the node, and once to put it back in place), A will never
mistake the recycled pointer for the old one. There’s a special
double-width CAS instruction specifically for this purpose,
cmpxchg16
. This is generally called DWCAS. It’s available on most
x86-64 processors. On Linux you can check /proc/cpuinfo
for support.
It will be listed as cx16
.
If it’s not available at compile-time this program won’t link. The
function that wraps cmpxchg16
won’t be there. You can tell GCC to
assume it’s there with the -mcx16
flag. The same rule here applies
to C++11’s new std::atomic.
There’s still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There’s nothing we can do about this, but if you witness this in the wild you need to immediately stop what you’re doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.
Another problem in pop()
is dereferencing orig.node
to access its
next
field. By the time we get to it, the node pointed to by
orig.node
may have already been removed from the stack and freed. If
the stack was using malloc()
and free()
for allocations, it may
even have had free()
called on it. If so, the dereference would be
undefined behavior — a segmentation fault, or worse.
There are three ways to deal with this.
Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won’t be a problem. However, if we’re interacting with a garbage collector we’re not really lock-free.
Hazard pointers. Each thread keeps track of what nodes it’s currently accessing and other threads aren’t allowed to free nodes on this list. This is messy and complicated.
Never free nodes. This implementation recycles nodes, but they’re
never truly freed until lstack_free()
. It’s always safe to
dereference a node pointer because there’s always a node behind it.
It may point to a node that’s on the free list or one that was even
recycled since we got the pointer, but the aba
field deals with
any of those issues.
Reference counting on the node won’t work here because we can’t get to
the counter fast enough (atomically). It too would require
dereferencing in order to increment. The reference counter could
potentially be packed alongside the pointer and accessed by a DWCAS,
but we’re already using those bytes for aba
.
Push is a lot like pop.
static void
push(_Atomic struct lstack_head *head, struct lstack_node *node)
{
struct lstack_head next, orig = atomic_load(head);
do {
node->next = orig.node;
next.aba = orig.aba + 1;
next.node = node;
} while (!atomic_compare_exchange_weak(head, &orig, next));
}
It’s counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn’t take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.
The API push and pop functions are built on these internal atomic functions.
int
lstack_push(lstack_t *lstack, void *value)
{
struct lstack_node *node = pop(&lstack->free);
if (node == NULL)
return ENOMEM;
node->value = value;
push(&lstack->head, node);
atomic_fetch_add(&lstack->size, 1);
return 0;
}
Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there’s an instant where the stack size is listed as one shorter than it actually is. However, since there’s no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.
Popping is the same thing in reverse.
void *
lstack_pop(lstack_t *lstack)
{
struct lstack_node *node = pop(&lstack->head);
if (node == NULL)
return NULL;
atomic_fetch_sub(&lstack->size, 1);
void *value = node->value;
push(&lstack->free, node);
return value;
}
Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It’s really simple with the primitive push and pop.
The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It’s not really exercising the library thoroughly because there are no contended pops, but I couldn’t think of a better example at the time.
The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.
]]>Unfortunately, support for the Digispark on Linux is spotty. Just as with any hardware project, the details are irreversibly messy. It can’t make use of the standard Arduino software for programming the board, so you have to download a customized toolchain. This download includes files that have the incorrect vendor ID, requiring a manual fix. Worse, the fix listed in their documentation is incomplete, at least for Debian and Debian-derived systems.
The main problem is that Linux will not automatically create a
/dev/ttyACM0
device like it normally does for Arduino devices.
Instead it gets a long, hidden, unpredictable device name. The fix is
to ask udev to give it a predictable name by appending the following
to the first line in the provided udev rules file (49-micronucleus.rules
),
SYMLINK+="ttyACM%n"
The whole uncommented portion of the rules file should look like this:
The ==
is a conditional operator, indicating that the rule only
applies when the condition is met. The :=
and +=
are assignment
operators, evaluated when all of the conditions are met. The SYMLINK
part tells udev put a softlink to the device in /dev
under a
predictable name.
Update August 2019: I’ve got a PGP key again, but I’m using my own tool, passphrase2pgp, to manage it. This tool allows for a particular workflow that GnuPG has never and will never provide. It doesn’t rely on S2K as described below.
One of the items in my dotfiles repository is my PGP keys, both private and public. I believe this is a unique approach that hasn’t been done before — a public experiment. It may seem dangerous, but I’ve given it careful thought and I’m only using the tools already available from GnuPG. It ensures my keys are well backed-up (via the Torvalds method) and available wherever I should need them.
In your GnuPG directory there are two core files: secring.gpg
and
pubring.gpg
. The first contains your secret keys and the second
contains public keys. secring.gpg
is not itself encrypted. You can
(should) have different passphrases for each key, after all. These
files (or any PGP file) can be inspected with --list-packets
. Notice
it won’t prompt for a passphrase in order to get this data,
$ gpg --list-packets ~/.gnupg/secring.gpg
:secret key packet:
version 4, algo 1, created 1298734547, expires 0
skey[0]: [2048 bits]
skey[1]: [17 bits]
iter+salt S2K, algo: 9, SHA1 protection, hash: 10, salt: ...
protect count: 10485760 (212)
protect IV: a6 61 4a 95 44 1e 7e 90 88 c3 01 70 8d 56 2e 11
encrypted stuff follows
:user ID packet: "Christopher Wellons <...>"
:signature packet: algo 1, keyid 613382C548B2B841
... and so on ...
Each key is encrypted individually within this file with a
passphrase. If you try to use the key, GPG will attempt to decrypt it
by asking for the passphrase. If someone were to somehow gain access
to your secring.gpg
, they’d still need to get your passphrase, so
pick a strong one. The official documentation
advises you to keep your secring.gpg
well-guarded and only rely on
the passphrase as a cautionary measure. I’m ignoring that part.
If you’re using GPG’s defaults, your secret key is encrypted with
CAST5, a symmetric block cipher. The encryption key is your passphrase
salted (mixed with a non-secret random number) and hashed with SHA-1
65,536 times. Using the hash function over and over is called
key stretching. It
greatly increases the amount of required work for a brute-force
attack, making your passphrase more effective. All of these settings
can be adjusted to better protect the secret key at the cost of less
portability. Since I’ve chosen to publish my secring.gpg
in my
dotfiles repository I cranked up the settings as far as I can.
I changed the cipher to AES256, which is more modern, more trusted, and more widely used than CAST5. For the passphrase digest, I selected SHA-512. There are better passphrase digest algorithms out there but this is the longest, slowest one that GPG offers. The PGP spec supports between 1024 and 65,011,712 digest iterations, so I picked one of the largest. 65 million iterations takes my laptop over a second to process — absolutely brutal for someone attempting a brute-force attack. Here’s the command to change to this configuration on an existing key,
gpg --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 \
--s2k-count 65000000 --edit-key <key id>
When the edit key prompt comes up, enter passwd
to change your
passphrase. You can enter the same passphrase again and it will re-use
it with the new configuration.
I’m feeling quite secure with my secret key, despite publishing my
secring.gpg
. Before now, I was much more at risk of losing it to
disk failure than having it exposed. I challenge anyone who doubts my
security to crack my secret key. I’d rather learn that I’m wrong
sooner than later!
With this established in my dotfiles repository, I can more easily include private dotfiles. Rather than use a symmetric cipher with an individual passphrase on each file, I encrypt the private dotfiles to myself. All my private dotfiles are managed with one key: my PGP key. This also plays better with Emacs. While it supports transparent encryption, it doesn’t even attempt to manage your passphrase (with good reason). If the file is encrypted with a symmetric cipher, Emacs will prompt for a passphrase on each save. If I encrypt them with my public key, I only need the passphrase when I first open the file.
How it works right now is any dotfile that ends with .priv.pgp
will
be decrypted into place — not symlinked, unfortunately, since this is
impossible. The install script has a -p
switch to disable private
dotfiles, such as when I’m using an untrusted computer. gpg-agent
ensures that I only need to enter my passphrase once during the
install process no matter how many private dotfiles there are.
So you want to make your own animated GIFs from a video clip? Well, it’s a pretty easy process that can be done almost entirely from the command line. I’m going to show you how to turn the clip into a GIF and add an image macro overlay. Like this,
The key tool here is going to be Gifsicle, a very excellent command-line tool for creating and manipulating GIF images. So, the full list of tools is,
Here’s the source video for the tutorial. It’s an awkward video my wife took of our confused cats, Calvin and Rocc.
My goal is to cut after Calvin looks at the camera, before he looks away. From roughly 3 seconds to 23 seconds. I’ll have mplayer give me the frames as JPEG images.
mplayer -vo jpeg -ss 3 -endpos 23 -benchmark calvin-dummy.webm
This tells mplayer to output JPEG frames between 3 and 23 seconds,
doing it as fast as it can (-benchmark
). This output almost 800
images. Next I look through the frames and delete the extra images at
the beginning and end that I don’t want to keep. I’m also going to
throw away the even numbered frames, since GIFs can’t have such a high
framerate in practice.
rm *[0,2,4,6,8].jpg
There’s also dead space around the cats in the image that I want to crop. Looking at one of the frames in GIMP, I’ve determined this is a 450 by 340 box, with the top-left corner at (136, 70). We’ll need this information for ImageMagick.
Gifsicle only knows how to work with GIFs, so we need to batch convert
these frames with ImageMagick’s convert
. This is where we need the
crop dimensions from above, which is given in ImageMagick’s notation.
ls *.jpg | xargs -I{} -P4 \
convert {} -crop 450x340+136+70 +repage -resize 300 {}.gif
This will do four images at a time in parallel. The +repage
is
necessary because ImageMagick keeps track of the original image
“canvas”, and it will simply drop the section of the image we don’t
want rather than completely crop it away. The repage forces it to
resize the canvas as well. I’m also scaling it down slightly to save
on the final file size.
We have our GIF frames, so we’re almost there! Next, we ask Gifsicle to compile an animated GIF.
gifsicle --loop --delay 5 --dither --colors 32 -O2 *.gif > ../out.gif
I’ve found that using 32 colors and dithering the image gives very
nice results at a reasonable file size. Dithering adds noise to the
image to remove the banding that occurs with small color palettes.
I’ve also instructed it to optimize the GIF as fully as it can
(-O2
). If you’re just experimenting and want Gifsicle to go faster,
turning off dithering goes a long way, followed by disabling
optimization.
The delay of 5 gives us the 15-ish frames-per-second we want — since we cut half the frames from a 30 frames-per-second source video. We also want to loop indefinitely.
The result is this 6.7 MB GIF. A little large, but good enough. It’s basically what I was going for. Next we add some macro text.
In GIMP, make a new image with the same dimensions of the GIF frames, with a transparent background.
Add your macro text in white, in the Impact Condensed font.
Right click the text layer and select “Alpha to Selection,” then under Select, grow the selection by a few pixels — 3 in this case.
Select the background layer and fill the selection with black, giving a black border to the text.
Save this image as text.png, for our text overlay.
Time to go back and redo the frames, overlaying the text this time. This is called compositing and ImageMagick can do it without breaking a sweat. To composite two images is simple.
convert base.png top.png -composite out.png
List the image to go on top, then use the -composite
flag, and it’s
placed over top of the base image. In my case, I actually don’t want
the text to appear until Calvin, the orange cat, faces the camera.
This happens quite conveniently at just about frame 500, so I’m only
going to redo those frames.
ls 000005*.jpg | xargs -I{} -P4 \
convert {} -crop 450x340+136+70 +repage \
-resize 300 text.png -composite {}.gif
Run Gifsicle again and this 6.2 MB image is the result. The text overlay compresses better, so it’s a tiny bit smaller.
Now it’s time to post it on reddit and reap that tasty, tasty karma. (Over 400,000 views!)
]]>The first three are usually available from your Linux distribution repositories, making them trivial to obtain. The last one is easy to obtain and compile.
If you’re using a modern browser, you should have noticed my
portrait on the left-hand side changed recently (update: it’s been
removed). That’s an HTML5 WebM video — currently with Ogg Theora
fallback due to a GitHub issue. To cut the video down to that portrait
size, I used the above four tools on the original video.
WebM seems to be becoming the standard HTML5 video format. Google is pushing it and it’s supported by all the major browsers, except Safari. So, unless something big happens, I plan on going with WebM for web video in the future.
To begin, as I’ve done before, split the video into its individual frames,
mplayer -vo jpeg -ao dummy -benchmark video_file
The -benchmark
option hints for mplayer
to go as fast as possible,
rather than normal playback speed.
Next look through the output frames and delete any unwanted frames to
keep, such as the first and last few seconds of video. With the
desired frames remaining, use ImageMagick, or any batch image editing
software, to crop out the relevant section of the images. This can be
done in parallel with xargs
’ -P
option — to take advantage of
multiple cores if disk I/O isn’t being the bottleneck.
ls *.jpg | xargs -I{} -P5 convert {} 312x459+177+22 {}.ppm
That crops out a 312 by 459 section of the image, with the top-left
corner at (177, 22). Any other convert
filters can be stuck in there
too. Notice the output format is the
portable pixmap (ppm
),
which is significant because it won’t introduce any additional loss
and, most importantly, it is required by the next tool.
If I’m happy with the result, I use ppmtoy4m
to pipe the new frames
to the encoder,
cat *.ppm | ppmtoy4m | vpxenc --best -o output.webm -
As the name implies, ppmtoy4m
converts a series of portable pixmap
files into a
YUV4MPEG2
(y4m
) video stream. YUV4MPEG2 is the bitmap of the video world:
gigantic, lossless, uncompressed video. It’s exactly the kind of thing
you want to hand to a video encoder. If you need to specify any
video-specific parameters, ppmtoy4m
is the tool that needs to know
it. For example, to set the framerate to 10 FPS,
... | ppmtoy4m -F 10:1 | ...
ppmtoy4m
is a classically-trained unix tool: stdin to stdout. No
need to dump that raw video to disk, just pipe it right into the WebM
encoder. If you choose a different encoder, it might not support
reading from stdin, especially if you do multiple passes. A possible
workaround would be a named pipe,
mkfifo video.y4m
cat *.ppm | ppmtoy4m > video.y4m &
otherencoder video.4pm
For WebM encoding, I like to use the --best
option, telling the
encoder to take its time to do a good job. To do two passes and get
even more quality per byte (--passes=2
) a pipe cannot be used and
you’ll need to write the entire raw video onto the disk. If you try to
pipe it anyway, vpxenc
will simply crash rather than give an error
message (as of this writing). This had me confused for awhile.
To produce Ogg Theora instead of WebM,
ffmpeg2theora is a great tool. It’s
well-behaved on the command line and can be dropped in place of
vpxenc
.
To do audio, encode your audio stream with your favorite audio encoder
(Vorbis, Lame, etc.) then merge them together into your preferred
container. For example, to add audio to a WebM video (i.e. Matroska),
use mkvmerge
from MKVToolNix,
mkvmerge --webm -o combined.webm video.webm audio.ogg
Extra notes update: There’s a bug in imlib2 where it can’t read PPM
files that have no initial comment, so some tools, including GIMP and
QIV, can’t read PPM files produced by ImageMagick. Fortunately
ppmtoy4m
is unaffected. However, there is a bug in ppmtoy4m
where it can’t read PPM files with a depth other than 8 bits. Fix this
by giving the option -depth 8
to ImageMagick’s convert
.
$HOME/.ant/lib
these days is an up-to-date ivy.jar
.
Last month I started managing my entire Emacs configuration in Git, which has already paid for itself by saving me time. I found out a few other people have been using it (including Brian), so I also wrote up a README file describing my specific changes.
With Emacs being a breeze to synchronize between my computers, I
noticed a new bottleneck emerged: my .ant
directory. Apache Ant puts everything in
$ANT_HOME/lib
and $HOME/.ant/lib
into its classpath. So, for
example, if you wanted to use JUnit with Ant,
you’d toss junit.jar
in either of those directories. $ANT_HOME
tends to be a system directory, and I prefer to only modify system
directories indirectly through apt
, so I put everything in
$HOME/.ant/lib
. Unfortunately, that’s another directory to keep
track of on my own. Fortunately, I already know how to deal with
that. It’s now another Git repository,
https://github.com/skeeto/.ant (README)
With that in place, settling into a new computer for development is
almost as simple as cloning those two repositories. Yesterday I took
the step to eliminate the only significant step that remained:
setting up java-docs
. Before you could really
take advantage of my Java extension, you really needed to have a
Javadoc directory scanned by Emacs. The results of that scan not only
provided an easy way to jump into documentation, but also provided the
lists for class name completion. Now, java-docs
now automatically
loads up the core Java Javadoc, linking to the official website, if
the user never sets it up.
So if you want to see exactly how my Emacs workflow with Java operates, it’s just a few small steps away. This should work for any operating system suitable for Java development.
Let’s start by getting Java set up. First, install a JDK and Apache Ant. This is trivial to do on Debian-based systems,
sudo apt-get install openjdk-6-jdk ant
On Windows, the JDK is easy, but Ant needs some help. You probably
need to set ANT_HOME
to point to the install location, and you
definitely need to add it to your PATH
.
Next install Git. This should be straightforward; just make sure its
in your PATH
(so Emacs can find it).
Clone my .ant
repository in your home directory.
cd
git clone https://github.com/skeeto/.ant.git
Except for Emacs, that’s really all I need to develop with Java. This setup should allow you to compile and hack on just about any of my Java projects. To test it out, anywhere you like clone one of my projects, such as my example project.
git clone https://github.com/skeeto/sample-java-project.git
You should be able to build and run it now,
cd sample-java-project
ant run
If that works, you’re ready to set up Emacs. First, install Emacs. If
you’re not familiar with Emacs, now would be the time to go through
the tutorial to pick up the basics. Fire it up and type CTRL + h
and
then t
(in Emacs’ terms: C-h t
), or select the tutorial from the
menu.
Move any existing configuration out of the way,
mv .emacs .old.emacs
mv .emacs.d .old.emacs.d
Clone my configuration,
git clone https://github.com/skeeto/.emacs.d.git
Then run Emacs. You should be greeted with a plain, gray window: the wombat theme. No menu bar, no toolbar, just a minibuffer, mode line, and wide open window. Anything else is a waste of screen real estate. This initial empty buffer has a great aesthetic, don’t you think?
Now to go for a test drive: open up that Java project you cloned, with
M-x open-java-project
. That will prompt you for the root directory
of the project. The only thing this does is pre-opens all of the
source files for you, exposing their contents to dabbrev-expand
and
makes jumping to other source files as easy as changing buffers — so
it’s not strictly necessary.
Switch to a buffer with a source file, such as
SampleJavaProject.java
if you used my example project. Change
whatever you like, such as the printed string. You can add import
statements at any time with C-x I
(note: capital I
), where
java-docs
will present you with a huge list of classes from which to
pick. The import will be added at the top of the buffer in the correct
position in the import listing.
Without needing to save, hit C-x r
to run the program from Emacs. A
*compilation-1*
buffer will pop up with all of the output from Ant
and the program. If you just want to compile without running it, type
C-x c
instead. If there were any errors, Ant will report them in the
compilation buffer. You can jump directly to these with C-x `
(that’s a backtick).
Now open a new source file in the same package (same directory) as the
source file you just edited. Type cls
and hit tab. The boilerplate,
including package statement, will be filled out for you by
YASnippet. There are a bunch of completion snippets available. Try
jal
for example, which completes with information from java-docs
.
When I’m developing a library, I don’t have a main function, so
there’s nothing to “run”. Instead, I drive things from unit tests,
which can be run with C-x t
, which runs the “test” target if there
is one.
To see your changes, type C-x g
to bring up Magit and type M-s
in
the Magit buffer (to show a full diff). From here you can make
commits, push, pull, merge, switch branches, reset, and so on. To
learn how to do all this, see the
Magit manual. You
can type q
to exit the Magit window, or use S-<arrow key>
to move
to an adjacent buffer in any direction.
And that’s basically my workflow. Developing in C is a very similar
process, but without the java-docs
part.
Here's a little on-going project I put together recently. It's mostly for my own future reference, but perhaps someone else may find it useful.
git clone git://github.com/skeeto/sample-java-project.git
If you couldn't guess already, I'm strongly against tying a project's development to a particular IDE. It happens too much: someone starts the project by firing up their favorite IDE, clicking "Create new project", and checks in whatever it spits out. It usually creates a build system integrated tightly into that particular IDE. At work I've seen it happen on two different large Java projects. There are some ways around it, like maintaining two build systems side-by-side, but it's not very pretty. Sometimes the Java IDE can spit out some Ant build files for the sake of continuous integration, but it remains a second-class citizen for development.
I prefer the other direction: start with a standalone build system, then stick your own development environment on top of that. Each developer picks and is responsible for whatever IDE or editor they want, with the standalone build system providing the canonical build (and, in my experience, if you must use an IDE, NetBeans has the smoothest integration with Ant). So in the case of Java, this means setting up an Ant-based build.
I've said before that I like the Java platform, I just find the primary language disappointing. Similarly, I like Ant, I just find the build script language disappointing (XML). It seems other people like it too, at least for Java development, because I haven't been able to find any serious criticisms of it outside of hating the XML (notice the first result in that search is written by someone who is Doing It All Wrong). I love that it works on filesets and not files. It's like getting atomic commits for my build system. If I add a new source file to my project I don't need to adjust the Ant build script in any way.
One downside of Ant is that, while it's commonly used in a very
standard way, it doesn't guide you in that direction or provide
special shortcuts to make the common cases easier. It's typical to
have a src/
directory containing all your source and
a build/
directory, created by Ant, that contains all the
built and generated files. With Ant you basically say, "Compile these
sources to here, then jar that directory up." Ant alone doesn't make
this very obvious. Give it to someone standed on a desert island and I
bet they won't derive the same best practice as the rest of the world.
Take make
, for example. Because building object files
from source is so common, (depending on the implementation) it has
built-in rules for it. This is all you need to say,
and make
knows how to do the rest.
Same for linking, it's so common you don't have to type anything more than necessary.
It guides you in creating good Makefiles. If you want to learn the best practice for Ant, you have to either buy a book on Ant or look at what lots of other people are doing. And so I provide my sample-java-project for this exact purpose.
You can use that as a skeleton when creating your own project, and you'll barely have to customize the build file. It's a big mass of boilerplate, the kind of stuff that Ant should have built-in by default. I'll be expanding it over time as I learn more about how to effectively use Ant.
So far, I included two things that you normally won't see: a target to run a Java indenter (AStyle) on your code, and a target to run the bureaucratic Checkstyle on your code.
]]>At work I currently spend about a third of my time doing data reduction, and it's become one of my favorite tasks. (I've done it on my own too). Data come in from various organizations and sponsors in all sorts of strange formats. We have a bunch of fancy analysis tools to work on the data, but they aren't any good if they can't read the format. So I'm tasked with writing tools to convert incoming data into a more useful format.
If the source file is a text-based file it's usually just a matter of writing a parser — possibly including a grammar — after carefully studying the textual structure. Binary files are trickier. Fortunately, there are a few tools that come in handy for identifying the format of a strange binary file.
The first is the standard utility found on any unix-like
system: file
. I have no idea if it has an official website because it's a term
that's impossible to search for. It tries to identify a file based on
the magic numbers and other tests, none based on the actual
file name. I've never been to lucky to have file
recognize
a strange format at work. But silence speaks volumes: it means the
data are not packed into something common, like a simple zip archive.
Next, I take a look at the file with ent, a pseudo-random number sequence test program. This will reveal how compressed (or even encrypted) data are. If ent says the data are very dense, say 7 bits per byte or more, the format is employing a good compression algorithm. The next step would be tackling that so I can start over on the uncompressed contents. If it's something like 4 bits per byte there's no compression. If it's in between then it might be employing a weak, custom compression algorithm. I've always seen the latter two.
Next I dive in with a hex editor. I use a combination of
Emacs' hexl-mode
and the standard BSD
tool hexdump (for
something more static). One of the first things I like to identify is
byte order, and in a hex dump it's often obvious.
In general, better designed formats use big endian, also known as network order. That's the standard ordering used in communication, regardless of the native byte ordering of the network clients. The amateur, home-brew formats are generally less thoughtful and dump out whatever the native format is, usually little endian because that's what x86 is. Worse, they'll also generate data on architectures that are big endian, so you can get it both ways without any warning. In that case your conversion tool has to be sensitive to byte order and find some way to identify which ordering a file is using. A time-stamp field is very useful here, because a 64-bit time-stamp read with the wrong byte order will give a very unreasonable date.
For example, here's something I see often.
eb 03 00 00 35 00 00 00 66 1e 00 00
That's most likely 3 4-byte values, in little endian byte order. The zeros make the integers stand out.
eb 03 00 00 35 00 00 00 66 1e 00 00
We can tell it's little endian because the non-zero digits are on the left. This information will be useful in identifying more bytes in the file.
Next I'd look for headers, common strings of bytes, so that I can identify larger structures in the data. I've never had to reverse engineer a format ... yet. I'm not sure if I could. Once I got this far I've always been able to research the format further and find either source code or documentation, revealing everything to me.
If the file contains strings I'll dump them out
with
strings
. I haven't found this too useful at work, but
it's been useful at home.
And there's something still useful beyond these. Something I made
myself at home for a completely different purpose, but I've exploited
its side effects: my PNG
Archiver. The original purpose of the tool is to store a file in
an image, as images are easier to share with others. The side effect
is that by viewing the image I get to see the structure of the
file. For example, here's my laptop's /bin/ls
, very
roughly labeled.
It's easy to spot the different segments of the ELF format. Higher entropy sections are more brightly colored. Strings, being composed of ASCII-like text, have their MSB's unset, which is why they're darker. Any non-compressed format will have an interesting profile like this. Here's a Word doc, an infamously horrible format,
And here's some Emacs bytecode. You can tell the code vectors apart from the constants section below it.
If you find yourself having to inspect strange files, keep these tools around to make the job easier.
]]>Did you know that Emacs comes with a calculator? Woop-dee-doo! Call the presses! Wow, a whole calculator! Sounds a bit lame, right?
Actually, it's much more than just a simple calculator. It's a computer algebra system! It is officially called a calculator, which isn't fair. It's an understatement, and I am sure has caused many people to overlook it. I finally ran into it during a thorough (re)reading of the Emacs manuals and almost skipped over it myself.
Ever see that demonstration by Will Wright for the game Spore several years ago? The player starts as a single-cell organism and evolves into a civilization with interstellar presence. When he started the demo he showed a cell through what looked like a microscope. No one had any idea yet what the game was about, so every time he increased the scope, from bacteria to animal, animal to civilization, civilization to space travel, interplanetary travel to interstellar travel, there was a huge reaction from the audience. It was like those infomercials: "But that's not all!!!"
As I made my way through the Emacs calc manual I was continually amazed by its power, with a similar constant increase in scope. Each new page was almost saying, "But that's not all!!!"
Like an infomercial I'm going to run through some of its features. See the calc manual for a real thorough introduction. It has practice exercises that shows some gotchas and interesting feature interactions.
Fire it up with C-x * c
or M-x calc
. There
will be two new windows (Emacs windows, that is), one with the
calculator and the other with usage history (the "trail").
First of all, the calculator operates on a stack and so its basic use
is done with RPN. The stack builds vertically, downwards. Type in
numbers and hit enter to push them onto the stack. Operators can be
typed right after the number, so no need to hit enter all the
time. Because negative (-
) is reserved for subtraction an
underscore _
is used to type a negative number. An
example stack with 3, 4, and 10,
3: 3 2: 4 1: 10 .
10 is at the "top" of the stack (indicated by the "1:"), so if we type
a *
the top two elements are multiplied. Like so,
2: 3 1: 40 .
The calculator has no limitations on the size of integers, so you work
with large numbers without losing precision. For example, we'll
take 2^200
.
2: 2 1: 200 .
Apply the ^
operator,
1: 1606938044258990275541962092341162602522202993782792835301376 .
But that's not all!!! It has a complex number type, which is entered
in pairs (real, imaginary) with parenthesis. They can be operated on
like any other number. Take -1 + 2i
minus 4 +
2i
,
2: (-1, 2) 1: (4, 2) .
Subtract with -
,
1: -5 .
Then take the square root of that using Q
, the square
root function.
1: (0., 2.2360679775) .
We can set the calculator's precision with p
. The default
is 12 places, showing here 1 / 7
.
1: 0.142857142857 .
If we adjust the precision to 50 and do it again,
2: 0.142857142857 1: 0.14285714285714285714285714285714285714285714285714 .
Numbers can be displayed in various notations, too, like fixed-point, scientific notation, and engineering notation. It will switch between these without losing any information (the stored form is separate from the displayed form).
But that's not all!!! We can represent rational numbers precisely with
ratios. These are entered with a :
. Push
on 1/7
, 3/14
, and 17/29
,
3: 1:7 2: 3:13 1: 17:29 .
And multiply them all together, which displays in the lowest form,
1: 51:2842 .
There is a mode for working in these automatically.
But that's not all!!! We can change the radix. To enter a number with
a different radix, which prefix it with the radix and a
#
. Here is how we enter 29 in base-2,
2#11101
We can change the display radix with d r
. With 29 on the
stack, here's base-4,
1: 4#131 .
Base-16,
1: 16#1D .
Base-36,
1: 36#T .
But that's not all!!! We can enter algebraic expressions onto the
stack with apostrophe, '
. Symbols can be entered as part
of the expression. Note: these expressions are not entered in RPN.
1: a^3 + a^2 b / c d - a / b .
There is a "big" mode (d B
) for easier reading,
2 3 a b a 1: a + ---- - - c d b .
We can assign values to variables to have the expression evaluated. If
we assign a
to 10 and use the "evaluates-to" operator,
2 3 a b a 100 b 10 1: a + ---- - - => 1000 + ----- - -- c d b c d b .
But that's not all!!! There is a vector type for working with vectors
and matrices and doing linear algebra. They are entered with
brackets, []
.
2: [4, 1, 5] 1: [ [ 1, 2, 3 ] [ 4, 5, 6 ] [ 6, 7, 8 ] ] .
And take the dot product, then take cross product of this vector and matrix,
2: [38, 48, 58] 1: [ [ -14, -18, -22 ] [ -19, -18, -17 ] [ 15, 18, 21 ] ] .
Any matrix and vector operator you could probably think of is available, including map and reduce (and you can define your own expression to apply).
We can use this to solve a linear system. Find x
and y
in terms of a
and b
,
x + a y = 6 x + b y = 10
Enter it (note we are using symbols),
2: [6, 10] 1: [ [ 1, a ] [ 1, b ] ] .
And divide,
4 a 4 1: [6 + -----, -----] a - b b - a .
But that's not all!!! We can create graphs if gnuplot is installed. We
can give it two vectors, or an algebraic expression. This plot
of sin(x)
and x cos(x)
was made with just a
few keystrokes,
But that's not all!!! There is an HMS type for handling times and angles. For 2 hours, 30 minutes, and 4 seconds, and some others,
3: 2@ 30' 4" 2: 4@ 22' 13" 1: 1@ 2' 56" .
Of course, the normal operators work as expected. We can add them all up,
1: 7@ 55' 13" .
We can convert between this and radians, and degrees, and so on.
But that's not all!!! The calculator also has a date type, entered
inside angled brackets, <>
(in algebra entry
mode). It is really flexible on input dates. We can insert the current
date with t N
.
1: <6:59:34pm Tue Jun 23, 2009> .
If we add numbers they are treated as days. Add 4,
1: <6:59:34pm Sat Jun 27, 2009> .
It works with the HMS format from before too. Subtract 2@ 3'
15"
.
1: <4:56:32pm Sat Jun 27, 2009> .
But that's not all!!! There is a modulo form for performing modulo arithmetic. For example, 17 mod 24,
1: 17 mod 24 .
Add 10,
1: 3 mod 24 .
This is most useful for forms such as n^p mod M
, which
this will handle efficiently. For example, 3^100000 mod
24
. The naive way would be to find 3^100000
first,
then take the modulus. This involves a computationally expensive
middle step of calculating 3^100000
, a huge number. The
modulo form does it smarter.
But that's not all!!! The calculator can do unit conversions. The version of Emacs (22.3.1) I am typing in right now knows about 159 different units. For example, I push 65 mph onto the stack,
1: 65 mph .
Convert to meters per second with u c
,
1: 29.0576 m / s .
It is flexible about mixing type of units. For example, I enter 3 cubic meters,
3 1: 3 m .
I can convert to gallons,
1: 792.516157074 gal .
I work in a lab without Internet access during the day, so when I need to do various conversions Emacs is indispensable.
The speed of light is also a unit. I can enter 1 c
and
convert to meters per second,
1: 299792458 m / s .
But that's not all!!! As I said, it's a computer algebra system so it understands symbolic math. Remember those algebraic expressions from before? I can operate on those. Let's push some expressions onto the stack,
3: ln(x) 2 a x 2: a x + --- + c b 1: y + c .
Multiply the top two, then add the third,
2 a x 1: ln(x) + (a x + --- + c) (y + c) b .
Expand with a x
, then simplify with a s
,
2 a x y 2 a c x 2 1: ln(x) + a y x + ----- + c y + a c x + ----- + c b b .
Now, one of the coolest features: calculus. Differentiate with respect
to x, with a d
,
1 a y a c 1: - + 2 a y x + --- + 2 a c x + --- x b b .
Or undo that and integrate it,
3 2 3 2 a y x a x y a c x a c x 2 1: x ln(x) - x + ------ + ------ + c x y + ------ + ------ + x c 3 2 b 3 2 b .
That's just awesome! That's a text editor ... doing calculus!
So, that was most of the main features. It was kind of exhausting going through all of that, and I am only scratching the surface of what the calculator can do.
Naturally, it can be extended with some elisp. It provides a
defmath
macro specifically for this.
I bet (hope?) someday it will have a functions for doing Laplace and Fourier transforms.
]]>In my previous article I drew those red dice myself, using GIMP. Since I really enjoyed figuring out how to do it, and actually doing it, here is a little tutorial.
The numbers and sizes are arbitrary, so feel free to adjust things if you think they look better. I am no artist. I am sure someone could take this further to make it look better, perhaps by making the pips look indented, or adding some transparency effect so the dice look clear. I am a GIMP newbie.
In GIMP, create a new 300x300 image and fill it with a dark red. I
used c30808
for this. This will be the base color of the
dice, so if you want differently colored dice, choose whatever color
you like.
Next we use the ellipse selection tool (e) to make pips. In the settings, set the ellipses to a fixed size of 75x75.
Create the ellipse and move it to the upper left-hand corner. Use the arrow keys to nudge it to position (5, 5) — or just type in these values.
Use bucket fill (shift+b) to fill the selected area with white, or whatever color you want your pips to be. Keep doing this to make pips in each corner. The positions should be (5, 220), (220, 5), and (220, 220). This makes the 4 face. Put a fifth pip in the middle, (112, 112), to turn it into a 5 face.
You now have one face of your die. The 1, 2, 3, and 4 faces are the same as the 5 face, but fewer pips. In the layers dialog name the current layer "5". Now, duplicate the layer (shift+ctrl+d) and name this new layer 4. Use either the paintbrush tool (p) to paint your base color over the middle pip, or use the selection tools to remove it.
Keep duplicating layers and removing pips until you have 5 faces: 1, 2, 3, 4, and 5.
Duplicate the "4" layer and create two more pips to make the 6
face. You now have 6 layers, each containing a single face. Here is
my .xcf
when I was done:
dice-faces.xcf.
Now comes the fun part, the real guts of the drawing. You are going to map these layers onto a cube. Go to Filters -> Map -> Map Object. Map to "Box" and select Transparent background and Create new image.
Under the Orientation tab adjust the rotation. For the first die, try something like (20, 40, -5). If you enable Show preview wireframe you can see your adjustments live. Just don't make these values too high or it will make the next step more difficult.
Under the Box tab set the Front, Top, and Left faces to different layers. Note that the opposite sides of a die always add up to 7. That is, 1 is opposite to 6, 2 is opposite 5, and 3 is opposite 4. Here is how a typical die looks.
If you are really picky, you might want to pay attention to the orientation of the 3's, 2's, and 6's and flip those layers accordingly.
Hit Preview! to see your work. If you are happy, click OK. Autocrop the new image with Image -> Autocrop Image.
Do this a few more times with different faces at different orientations. I will make just one more for the example.
Create a new 640x480 image with transparent background. Copy and paste your dice into this image. After each paste, make a new layer (shift+ctrl+n), so each die gets its own layer. Use the Move tool (m) to adjust the dice into a sort-of mid-roll. Whatever looks good.
The last part left is the shadow. First, merge the visible layers (shift+m), then duplicate the remaining layer. Call this new layer "Shadow".
Go to Colors -> Brightness-Contrast. Set contrast to -127. This will be the shadow. If you want a darker or lighter shadow, open the same dialog again and adjust the brightness. Next scale the shadow layer vertically by 50%. You want the width to remain the same.
Select the Sheer tool (shift+s) and sheer the layer in the X direction -100 pixels. Move the shadow layer to the bottom. Now use the Move tool (m) to move the shadow into an appropriate position.
You can add a penumbra by applying a Gaussian blur to the shadow layer: Filters -> Blur -> Gaussian Blur. I blurred mine by 5 pixels.
Finally, you might want to autocrop the layers, then fit the image canvas to the layers, which will get rid of the excess border.
]]>
I have gotten several e-mails lately about using GNU Octave. One specifically was about blurring images in Octave. In response, I am writing this in-depth post to cover spatial filters, and how to use them in GNU Octave (a free implementation of the Matlab programming language). This should be the sort of information you would find near the beginning of an introductory digital image processing textbook, but written out more simply. In the future, I will probably be writing a post covering non-linear spatial and/or frequency domain filters in Octave.
If you want to follow along in Octave, I strongly recommend that you
upgrade to the new Octave 3.0. It is considered stable, but differs
significantly from Octave 2.1, which many people may be used to. You
will also need to install
the image
processing package
from Octave-Forge. To get
help with any Octave function, just type help
<function>
.
The most common linear spatial image filtering involves convolving a filter mask, sometimes called a convolution kernel, over an image, which is a two-dimensional matrix. In the case of an RGB color image, the image is actually composed of three two-dimensional grayscale images, each representing a single color, where each is convolved with the filter mask separately.
Convolution is sliding a mask over an image. The new value at the mask's position is the sum of the value of each element of the mask multiplied by the value of the image at that position. For an example, let's start with 1-dimensional convolution. Define a mask,
5 3 2 4 8
The 2 is the anchor for the mask. Define an image,
0 0 1 2 1 0 0
As we convolve, the mask will extend beyond the image at the edges. One way to handle this is to pad the image with 0's. We start by placing the mask at the left edge. (zero-padding is underlined)
Mask: 5 3 2 4 8 Image: 0 0 0 0 1 2 1 0 0
The first output value is 8, as every other element of the mask is multiplied by zero.
Output: 8 x x x x x x
Now, slide the mask over by one position,
Mask: 5 3 2 4 8 Image: 0 0 0 1 2 1 0 0
The output here is 20, because 8*2 + 4*1 = 20;
Output: 8 20 x x x x x
If we continue sliding the mask along, the output becomes,
Output: 8 20 18 11 13 13 5
Here is the correlation done in Octave interactively,
(filter2()
is the correlation function).
octave> filter2([5 3 2 4 8], [0 0 1 2 1 0 0]) ans = 8 20 18 11 13 13 5
The same thing happens in two-dimensional convolution, with the mask moving in the vertical direction as well, so that each element in the image is covered.
Sometimes you will hear this described as correlation
(Octave's filter2
) or convolution
(Octave's conv2
). The only difference between these
operations is that in convolution the filter masked is rotated 180
degrees. Whoop-dee-doo. Most of the time your filter is probably
symmetrical anyway. So, don't worry much about the difference between
these two. Especially in Octave, where rotating a matrix is easy
(see rot90()
).
Now that we know convolution, let's introduce the sample image we will be using. I carefully put this together in Inkscape, which should give us a nice scalable test image. When converting to a raster format, there is a bit of unwanted anti-aliasing going on (couldn't find a way to turn that off), but it is minimal.
Save that image (the PNG file, not the linked SVG file) where you can
get to it in Octave. Now, let's load the image into Octave
using imread()
.
m = imread("image-test.png");
The image is a grayscale image, so it has only one layer. The size
of m
should be 300x300. You can check this like so (note
the lack of semicolon so we can see the output),
size(m)
You can view the image stored in m
with imshow
. It doesn't care about the image dimensions
or size, so until you resize the plot window, it will probably be
stretched.
imshow(m);
Now, let's make an extremely simple 5x5 filter mask.
f = ones(5) * 1/25
Octave will show us what this matrix looks like.
f = 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000 0.040000
This filter mask is called an averaging filter. It simply averages all the pixels around the image (think about how this works out in the convolution). The effect will be to blur the image. It is important to note here that the sum of the elements is 1 (or 100% if you are thinking of averages). You can check it like so,
sum(f(:))
Now, to convolve the image with the filter mask
using filter2()
.
ave_m = filter2(f, m);
You can view the filtered image again with imshow()
except that we need to first convert the image matrix to a matrix of
8-bit unsigned integers. It is kind of annoying that we need this, but
this is the way it is as of this writing.
ave_m = uint8(ave_m); imshow(ave_m);
Or, we can save this image to a file
using imwrite()
. Just like with imshow()
,
you will first need to convert the image to uint8
.
imwrite("averaged.png", ave_m);
There are a few things to notice about this image. First there is a
black border around the outside of the filtered image. This is due to
the zero-padding (black border) done by filter2()
. The
border of the image had 0's averaged into them. Second, some parts of
the blurred image are "noisy". Here are some selected parts at 4x zoom.
Notice how the circle, and the "a" seem a little bit boxy? This is due to the shape of our filter. Also notice that the blurring isn't as smooth as it could be. This is because the filter itself isn't very smooth. We'll fix both these problems with a new filter later.
First, here is how we can fix the border problem: we pad the image with itself. Octave provides us three easy ways to do this. The first is replicate padding: the padding outside the image is the same as the nearest border pixel in the image. Circular padding: the padding from from the opposite side of the image, as if it was wrapped. This would be a good choice for a periodic image. Last, and probably the most useful is symmetric: the padding is a mirror reflection of the image itself.
To apply symmetric padding, we use the padarray()
function. We only want to pad the image by the amount that the mask
will "hang off". Let's pad the original image for a 9x9 filter, which
will hang off by 4 pixels each way,
mpad = padarray(m, [4 4], "symmetric");
Next, we will replace the averaging filter with a 2D Gaussian
distribution. The Gaussian, or normal, distribution has many wonderful
and useful properties (as a statistics professor I had once said,
anyone who considers themselves to be educated should know about the
normal distribution). One property that makes it useful is that if we
integrate the Gaussian distribution from minus infinity to infinity,
the result is 1. The easiest way to get the curve without having to
type in the equation is using fspecial()
: a special
function for creating image filters.
f_gauss = fspecial("gaussian", 9, 2);
This creates a 9x9 Gaussian filter with variance 2. The variance controls the effective size of the filter. Increasing the size of the filter from 9 to 99 will actually have virtually no impact on the final result. It just needs to be large enough to cover the curve. Six times the variance covers over 99% of the curve, so for a variance of 2, a filter of size 7x7 (always make your filters odd in size) is plenty. A larger filter means a longer convolution time. Here is what the 9x9 filter looks like,
And to filter with the Gaussian,
gauss_m = filter2(f_gauss, mpad, "valid"; gauss_m = uint8(guass_m);
Notice the extra argument "valid"
? Since we padded the
image before filtering, we don't want this padding to be part of the
image result. filter2()
normally returns an image of the
same size as the input image, but we only want the part that didn't
undergo (additional) zero-padding. The result is now the same size as
the original image, but without the messy border,
Also, compare the result to the average filter above. See how much smoother this image is? If you are interested in blurring an image, you will generally want to go with a Gaussian filter like this.
Now I will let you in on a little shortcut. In Matlab, there is a
function called imfilter
which does the padding and
filtering in one step. As of this writing, the Octave-Forge image
package doesn't officially include this function, but it is there in
the source repository now, meaning that it will probably appear in the
next version of that package. I actually wrote my own before I found
this one. You can grab the official one
here:
imfilter.m
With this new function, we can filter with the Gaussian and save like
this. Notice the flipping of the first two arguments
from filter2
, as well as the lack of converting
to uint8
.
gauss_m = imfilter(m, f, "symmetric"); imwrite("gauss.png", gauss_m);
imfilter()
will also handle the 3-layer color images
seamlessly. Without it, you would need to run filter2()
on each layer separately.
So that is just about all there is. fspecial()
has many
more filters available including motion
blur,
unsharp, and edge detection. For example,
the Sobel edge
detector,
octave:25> fspecial("sobel") ans = 1 2 1 0 0 0 -1 -2 -1
It is good at detecting edges in one direction. We can rotate this each way to detect edges all over the image.
mf = uint8(zeros(size(m))); for i = 0:3 mf += imfilter(m, rot90(fspecial("sobel"), i)); end imshow(mf)
Happy Hacking with Octave!
]]>While studying for my digital image processing final exam yesterday, I came back across unsharp masking. When I first saw this, I thought it was really neat. This time around, I took the hands-on approach and tried it myself in Octave. It has been used by the publishing and printing industry for years.
Unsharp masking is a method of sharpening an image. The idea is this,
Here is an example using a 1-dimensional signal. I blurred the signal
with a 1x5 averaging filter: [1 1 1 1 1] * 1/5
. Then I subtracted
the blurred signal from the original to create a mask. Finally, I
added the unsharp mask to the original signal. For images, we do this
in 2-dimensions, as an image is simply a 2-dimensional signal.
When it comes to image processing, we can create the mask in one easy step! This is done by performing a 2-dimensional convolution with a Laplacian kernel. It does steps 1 and 2 at the same time. This is the Laplacian I used in the example at the beginning,
So, to do it in Octave, this is all you need,
octave> i = imread("moon.png");
octave> m = conv2(i, [0 -1 0; -1 4 -1; 0 -1 0], "same");
octave> imwrite("moon-sharp.png", i + 2 * uint8(m))
i
is the image and m
is the mask. The mask created in step 2 looks
like this,
You could take the above Octave code and drop it into a little she-bang script to create a simple image sharpening program. I leave this as an exercise for the reader.
]]>