Articles tagged compsci at null program

Unix "find" expressions compiled to bytecode

2025-12-23T04:20:22Z

In preparation for a future project, I was thinking about at the unix find utility. It operates a file system hierarchies, with basic operations selected and filtered using a specialized expression language. Users compose operations using unary and binary operators, grouping with parentheses for precedence. find may apply the expression to a great many files, so compiling it into a bytecode, resolving as much as possible ahead of time, and minimizing the per-element work, seems like a prudent implementation strategy. With some thought, I worked out a technique to do so, which was simpler than I expected, and I’m pleased with the results. I was later surprised all the real world find implementations I examined use tree-walk interpreters instead. This article describes how my compiler works, with a runnable example, and lists ideas for improvements.

For a quick overview, the syntax looks like this:

$ find [-H|-L] path... [expression...]

Technically at least one path is required, but most implementations imply . when none are provided. If no expression is supplied, the default is -print, e.g. print everything under each listed path. This prints the whole tree, including directories, under the current directory:

$ find .

To only print files, we could use -type f:

$ find . -type f -a -print

Where -a is the logical AND binary operator. -print always evaluates to true. It’s never necessary to write -a, and adjacent operations are implicitly joined with -a. We can keep chaining them, such as finding all executable files:

$ find . -type f -executable -print

If no -exec, -ok, or -print (or similar side-effect extensions like -print0 or -delete) are present, the whole expression is wrapped in an implicit ( expr ) -print. So we could also write this:

$ find . -type f -executable

Use -o for logical OR. To print all files with the executable bit or with a .exe extension:

$ find . -type f \( -executable -o -name '*.exe' \)

I needed parentheses because -o has lower precedence than -a, and because parentheses are shell metacharacters I also needed to escape them for the shell. It’s a shame find didn’t use [ and ] instead! There’s also a unary logical NOT operator, !. To print all non-executable files:

$ find . -type f ! -executable

Binary operators are short-circuiting, so this:

$ find -type d -a -exec du -sh {} +

Only lists the sizes of directories, as the -type d fails causing the whole expression to evaluate to false without evaluating -exec. Or equivalently with -o:

$ find ! -type d -o -exec du -sh {} +

If it’s not a directory then the left-hand side evaluates to true, and the right-hand side is not evaluated. All three implementations I examined (GNU, BSD, BusyBox) have a -regex extension, and eagerly compile the regular expression even if the operation is never evaluated:

$ find . -print -o -regex [
find: bad regex '[': Invalid regular expression

I was surprised by this because it doesn’t seem to be in the spirit of the original utility (“The second expression shall not be evaluated if the first expression is true.”), and I’m used to the idea of short-circuit validation for the right-hand side of a logical expression. Recompiling for each evaluation would be unwise, but it could happen lazily such that an invalid regular expression only causes an error if it’s actually used. No big deal, just a curiosity.

Bytecode design

A bytecode interpreter needs to track just one result at a time, making it a single register machine, with a 1-bit register at that. I came up with these five opcodes:

halt
not
braf   LABEL
brat   LABEL
action NAME [ARGS...]

Obviously halt stops the program. While I could just let it “run off the end” it’s useful to have an actual instruction so that I can attach a label and jump to it. The not opcode negates the register. braf is “branch if false”, jumping (via relative immediate) to the labeled (in printed form) instruction if the register is false. brat is “branch if true”. Together they implement the -a and -o operators. In practice there are no loops and jumps are always forward: find is not Turing complete.

In a real implementation each possible action (-name, -ok, -print, -type, etc.) would get a dedicated opcode. This requires implementing each operator, at least in part, in order to correctly parse the whole find expression. For now I’m just focused on the bytecode compiler, so this opcode is a stand-in, and it kind of pretends based on looks. Each action sets the register, and actions like -print always set it to true. My compiler is called findc (“find compiler”).

Update: Or try the online demo via Wasm! This version includes a peephole optimizer I wrote after publishing this article.

I assume readers of this program are familiar with push macro and Slice macro. Because of the latter it requires a very recent C compiler, like GCC 15 (e.g. via w64devkit) or Clang 22. Try out some find commands and see how they appear as bytecode. The simplest case is also optimal:

$ findc
// path: .
        action  -print
        halt

Print the path then halt. Simple. Stepping it up:

$ findc -type f -executable
// path: .
        action  -type f
        braf    L1
        action  -executable
L1:     braf    L2
        action  -print
L2:     halt

If the path is not a file, it skips over the rest of the program by way of the second branch instruction. It’s correct, but already we can see room for improvement. This would be better:

        action  -type f
        braf    L1
        action  -executable
        braf    L1
        action  -print
L1:     halt

More complex still:

$ findc -type f \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L1
        action  -name *.exe
L1:     braf    L2
        action  -print
L2:     halt

Inside the parentheses, if -executable succeeds, the right-hand side is skipped. Though the brat jumps straight to a braf. It would be better to jump ahead one more instruction:

        action  -type f
        braf    L2
        action  -executable
        brat    L1
        action  -name *.exe
        braf    L2
L1      action  -print
L2:     halt

Silly things aren’t optimized either:

$ findc ! ! -executable
// path: .
        action  -executable
        not
        not
        braf    L1
        action  -print
L1:     halt

Two not in a row cancel out, and so these instructions could be eliminated. Overall this compiler could benefit from a peephole optimizer, scanning over the program repeatedly, making small improvements until no more can be made:

Delete not-not.
A brat to a braf re-targets ahead one instruction, and vice versa.
Jumping onto an identical jump adopts its target for itself.
A not-braf might convert to a brat, and vice versa.
Delete side-effect-free instructions before halt (e.g. not-halt).
Exploit always-true actions, e.g. -print-braf can drop the branch.

Writing a bunch of peephole pattern matchers sounds kind of fun. Though my compiler would first need a slightly richer representation in order to detect and fix up changes to branches. One more for the road:

$ findc -type f ! \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L2
        action  -name *.exe
L2:     not
L1:     braf    L3
        action  -print
L3:     halt

The unoptimal jumps hint at my compiler’s structure. If you’re feeling up for a challenge, pause here to consider how you’d build this compiler, and how it might produce these particular artifacts.

Parsing and compiling

Before I even considered the shape of the bytecode I knew I needed to convert find infix into a compiler-friendly postfix. That is, this:

-type f -a ! ( -executable -o -name *.exe )

Becomes:

-type f -executable -name *.exe -o ! -a

Which, importantly, erases the parentheses. This comes in as an argv array, so it’s already tokenized for us by the shell or runtime. The classic shunting-yard algorithm solves this problem easily enough. We have an output queue that goes into the compiler, and a token stack for tracking -a, -o, !, and (. Then we walk argv in order:

Actions go straight into the output queue.
If we see one of the special stack tokens we push it onto the stack, first popping operators with greater precedence into the queue, stopping at (.
If we see ) we pop the stack into the output queue until we see (.

When we’re out of tokens, pop the remaining stack into the queue. My parser synthesizes -a where it’s implied, so the compiler always sees logical AND. If the expression contains no -exec, -ok, or -print, after processing is complete the parser puts -print then -a into the queue, which effectively wraps the whole expression in ( expr ) -print. By clearing the stack first, the real expression is effectively wrapped in parentheses, so no parenthesis tokens need to be synthesized.

I’ve used the shunting-yard algorithm many times before, so this part was easy. The new part was coming up with an algorithm to convert a series of postfix tokens into bytecode. My solution is the compiler maintains a stack of bytecode fragments. That is, each stack element is a sequence of one or more bytecode instructions. Branches use relative addresses, so they’re position-independent, and I can concatenate code fragments without any branch fix-ups. It takes the following actions from queue tokens:

For an action token, create an action instruction, and push it onto the fragment stack as a new fragment.
For a ! token, pop the top fragment, append a not instruction, and push it back onto the stack.
For a -a token, pop the top two fragments, join then with a braf in the middle which jumps just beyond the second fragment. That is, if the first fragment evaluates to false, skip over the second fragment into whatever follows.
For a -o token, just like -a but use brat. If the first fragment is true, we skip over the second fragment.

If the expression is valid, at the end of this process the stack contains exactly one fragment. Append a halt instruction to this fragment, and that’s our program! If the final fragment contained a branch just beyond its end, this halt is that branch target. A few peephole optimizations and could probably be an optimal program for this instruction set.

State machines are wonderful tools

2020-12-31T22:48:13Z

This article was discussed on Hacker News.

I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:

Present minimal, tidy interfaces
Require few, fixed resources
Hold no opinions about input and output
Have a compact, concise implementation
Be easy to reason about

State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.

Morse code decoder state machine

Inspired by a puzzle, I came up with this deterministic state machine for decoding Morse code. It accepts a dot ('.'), dash ('-'), or terminator (0) one at a time, advancing through a state machine step by step:

int morse_decode(int state, int c)
{
    static const unsigned char t[] = {
        0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
        0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
        0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
        0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
        0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
        0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
        0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
        0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
    };
    int v = t[-state];
    switch (c) {
    case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
    case 0x2e: return v &  2 ? state*2 - 1 : 0;
    case 0x2d: return v &  1 ? state*2 - 2 : 0;
    default:   return 0;
    }
}

It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers. The full source listing, documentation, and comprehensive test suite:

https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c

The state machine is trie-shaped, and the 100-byte table t is the static encoding of the Morse code trie:

Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).

Each node in the trie is a byte in the table. Dot and dash each have a bit indicating if their edge exists. The remaining bits index into a 1-based character table (at the end of t), and a 0 “index” indicates an empty (red) node. The nodes themselves are laid out as a binary heap in an array: the left and right children of the node at i are found at i*2+1 and i*2+2. No need to waste memory storing edges!

Since C sadly does not have multiple return values, I’m using the sign bit of the return value to create a kind of sum type. A negative return value is a state — which is why the state is negated internally before use. A positive result is a character output. If zero, the input was invalid. Only the initial state is non-negative (zero), which is fine since it’s, by definition, not possible to traverse to the initial state. No c input will produce a bad state.

In the original problem the terminals were missing. Despite being a state machine, morse_decode is a pure function. The caller can save their position in the trie by saving the state integer and trying different inputs from that state.

UTF-8 decoder state machine

The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.

I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:

Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:

https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c

However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):

int utf8_decode(int state, long *cp, int byte)
{
    static const signed char table[8][256] = { /* ... */ };
    static const unsigned char masks[2][8] = { /* ... */ };
    int next = table[state][byte];
    *cp = (*cp << 6) | (byte & masks[!state][next&7]);
    return next;
}

Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.

Despite the huge table I’m pretty happy with it.

Word count state machine

Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).

/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */
long word_count(long state, long codepoint)
{
    switch (codepoint) {
    case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
    case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
    case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
    case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
    case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
        return state < 0 ? -state : state;
    default:
        return state < 0 ? state : -1 - state;
    }
}

I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.

The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.

Reader challenge: Find a slick, efficient way to encode all those code points as a table rather than rely on whatever the compiler generates for the switch (chain of branches, jump table?).

Coroutines and generators as state machines

In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)

Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:

WHITESPACE = {
    0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
    0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
    0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
    0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
    0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}

def wordcount():
    count = 0
    while True:
        while True:
            # low signal
            codepoint = yield count
            if codepoint not in WHITESPACE:
                count += 1
                break
        while True:
            # high signal
            codepoint = yield count
            if codepoint in WHITESPACE:
                break

However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:

wc = wordcount()
next(wc)  # prime the generator
wc.send(ord('A'))  # => 1
wc.send(ord(' '))  # => 1
wc.send(ord('B'))  # => 2
wc.send(ord(' '))  # => 2

Same idea in Lua, which famously has full coroutines:

local WHITESPACE = {
    [0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
    [0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
    [0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
    [0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
    [0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
    [0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
    [0x3000]=true
}

function wordcount()
    local count = 0
    while true do
        while true do
            -- low signal
            local codepoint = coroutine.yield(count)
            if not WHITESPACE[codepoint] then
                count = count + 1
                break
            end
        end
        while true do
            -- high signal
            local codepoint = coroutine.yield(count)
            if WHITESPACE[codepoint] then
                break
            end
        end
    end
end

Except for initially priming the coroutine, at least coroutine.wrap() hides the fact that it’s a coroutine.

wc = coroutine.wrap(wordcount)
wc()  -- prime the coroutine
wc(string.byte('A'))  -- => 1
wc(string.byte(' '))  -- => 1
wc(string.byte('B'))  -- => 2
wc(string.byte(' '))  -- => 2

Extra examples

Finally, a couple more examples not worth describing in detail here. First a Unicode case folding state machine:

https://github.com/skeeto/scratch/blob/master/misc/casefold.c

It’s just an interface to do a lookup into the official case folding table. It was an experiment, and I probably wouldn’t use it in a real program.

Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s not obvious from the interface, but internally it’s just a state machine for both encoder and decoder, which is what it allows it to “pause” between any pair of input/output bytes.

You might not need machine learning

2020-11-24T04:04:36Z

This article was discussed on Hacker News.

Machine learning is a trendy topic, so naturally it’s often used for inappropriate purposes where a simpler, more efficient, and more reliable solution suffices. The other day I saw an illustrative and fun example of this: Neural Network Cars and Genetic Algorithms. The video demonstrates 2D cars driven by a neural network with weights determined by a generic algorithm. However, the entire scheme can be replaced by a first-degree polynomial without any loss in capability. The machine learning part is overkill.

Above demonstrates my implementation using a polynomial to drive the cars. My wife drew the background. There’s no path-finding; these cars are just feeling their way along the track, “following the rails” so to speak.

My intention is not to pick on this project in particular. The likely motivation in the first place was a desire to apply a neural network to something. Many of my own projects are little more than a vehicle to try something new, so I can sympathize. Though a professional setting is different, where machine learning should be viewed with a more skeptical eye than it’s usually given. For instance, don’t use active learning to select sample distribution when a quasirandom sequence will do.

In the video, the car has a limited turn radius, and minimum and maximum speeds. (I’ve retained these contraints in my own simulation.) There are five sensors — forward, forward-diagonals, and sides — each sensing the distance to the nearest wall. These are fed into a 3-layer neural network, and the outputs determine throttle and steering. Sounds pretty cool!

A key feature of neural networks is that the outputs are a nonlinear function of the inputs. However, steering a 2D car is simple enough that a linear function is more than sufficient, and neural networks are unnecessary. Here are my equations:

steering = C0*input1 - C0*input3
throttle = C1*input2

I only need three of the original inputs — forward for throttle, and diagonals for steering — and the driver has just two parameters, C0 and C1, the polynomial coefficients. Optimal values depend on the track layout and car configuration, but for my simulation, most values above 0 and below 1 are good enough in most cases. It’s less a matter of crashing and more about navigating the course quickly.

The lengths of the red lines below are the driver’s three inputs:

These polynomials are obviously much faster than a neural network, but they’re also easy to understand and debug. I can confidently reason about the entire range of possible inputs rather than worry about a trained neural network responding strangely to untested inputs.

Instead of doing anything fancy, my program generates the coefficients at random to explore the space. If I wanted to generate a good driver for a course, I’d run a few thousand of these and pick the coefficients that complete the course in the shortest time. For instance, these coefficients make for a fast, capable driver for the course featured at the top of the article:

C0 = 0.896336973, C1 = 0.0354805067

Many constants can complete the track, but some will be faster than others. If I was developing a racing game using this as the AI, I’d not just pick constants that successfully complete the track, but the ones that do it quickly. Here’s what the spread can look like:

If you want to play around with this yourself, here’s my C source code that implements this driving AI and generates the videos and images above:

aidrivers.c

Racetracks are just images drawn in your favorite image editing program using the colors documented in the source header.

Unintuitive JSON Parsing

2019-12-28T17:23:09Z

This article was discussed on Hacker News and on reddit.

Despite the goal of JSON being a subset of JavaScript — which it failed to achieve (update: this was fixed) — parsing JSON is quite unlike parsing a programming language. For invalid inputs, the specific cause of error is often counter-intuitive. Normally this doesn’t matter, but I recently ran into a case where it does.

Consider this invalid input to a JSON parser:

[01]

To a human this might be interpreted as an array containing a number. Either the leading zero is ignored, or it indicates octal, as it does in many languages, including JavaScript. In either case the number in the array would be 1.

However, JSON does not support leading zeros, neither ignoring them nor supporting octal notation. Here’s the railroad diagram for numbers from the JSON specficaiton:

Or in regular expression form:

-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?

If a token starts with 0 then it can only be followed by ., e, or E. It cannot be followed by a digit. So, the natural human response to mentally parsing [01] is: This input is invalid because it contains a number with a leading zero, and leading zeros are not accepted. But this is not actually why parsing fails!

A simple model for the parser is as consuming tokens from a lexer. The lexer’s job is to read individual code points (characters) from the input and group them into tokens. The possible tokens are string, number, left brace, right brace, left bracket, right bracket, comma, true, false, and null. The lexer skips over insignificant whitespace, and it doesn’t care about structure, like matching braces and brackets. That’s the parser’s job.

In some instances the lexer can fail to parse a token. For example, if while looking for a new token the lexer reads the character %, then the input must be invalid. No token starts with this character. So in some cases invalid input will be detected by the lexer.

The parser consumes tokens from the lexer and, using some state, ensures the sequence of tokens is valid. For example, arrays must be a well formed sequence of left bracket, value, comma, value, comma, etc., right bracket. One way to reject input with trailing garbage, is for the lexer to also produce an EOF (end of file/input) token when there are no more tokens, and the parser could specifically check for that token before accepting the input as valid.

Getting back to the input [01], a JSON parser receives a left bracket token, then updates its bookkeeping to track that it’s parsing an array. When looking for the next token, the lexer sees the character 0 followed by 1. According to the railroad diagram, this is a number token (starts with 0), but 1 cannot be part of this token, so it produces a number token with the contents “0”. Everything is still fine.

Next the lexer sees 1 followed by ]. Since ] cannot be part of a number, it produces another number token with the contents “1”. The parser receives this token but, since it’s parsing an array, it expects either a comma token or a right bracket. Since this is neither, the parser fails with an error about an unexpected number. The parser will not complain about leading zeros because JSON has no concept of leading zeros. Human intuition is right, but for the wrong reasons.

Try this for yourself in your favorite JSON parser. Or even just pop up the JavaScript console in your browser and try it out:

JSON.parse('[01]');

Firefox reports:

SyntaxError: JSON.parse: expected ‘,’ or ‘]’ after array element

Chromium reports:

SyntaxError: Unexpected number in JSON

Edge reports (note it says “number” not “digit”):

Error: Invalid number at position:3

In all cases the parsers accepted a zero as the first array element, then rejected the input after the second number token for being a bad sequence of tokens. In other words, this is a parser error rather than a lexer error, as a human might intuit.

My JSON parser comes with a testing tool that shows the token stream up until the parser rejects the input, useful for understanding these situations:

$ echo '[01]' | tests/stream
struct expect seq[] = {
    {JSON_ARRAY},
    {JSON_NUMBER, "0"},
    {JSON_ERROR},
};

There’s an argument to be made here that perhaps the human readable error message should mention leading zeros, since that’s likely the cause of the invalid input. That is, a human probably thought JSON allowed leading zeros, and so the clearer message would tell the human that JSON does not allow leading zeros. This is the “more art than science” part of parsing.

It’s the same story with this invalid input:

[truefalse]

From this input, the lexer unambiguously produces left bracket, true, false, right bracket. It’s still up to the parser to reject this input. The only reason we never see truefalse in valid JSON is that the overall structure never allows these tokens to be adjacent, not because they’d be ambiguous. Programming languages have identifiers, and in a programming language this would parse as the identifier truefalse rather than true followed by false. From this point of view, JSON seems quite strange.

Just as before, Firefox reports:

SyntaxError: JSON.parse: expected ‘,’ or ‘]’ after array element

Chromium reports the same error as it does for [true false]:

SyntaxError: Unexpected token f in JSON

Edge’s message is probably a minor bug in their JSON parser:

Error: Expected ‘]’ at position:10

Position 10 is the last character in false. The lexer consumed false from the input, produced a “false” token, then the parser rejected the input. When it reported the error, it chose the end of the invalid token as the error position rather than the start, despite the fact that the only two valid tokens (comma, right bracket) are both a single character. It should also say “Expected ‘]’ or ‘,’” (as Firefox does) rather than just “]”.

Concatenated JSON

That’s all pretty academic. Except for producing nice error messages, nobody really cares so much why the input was rejected. The mismatch between intuition and reality isn’t important.

However, it does come up with concatenated JSON. Some parsers, including mine, will optionally consume multiple JSON values, one after another, from the same input. Here’s an example from one of my favorite command line tools, jq:

echo '{"x":0,"y":1}{"x":2,"y":3}{"x":4,"y":5}' | jq '.x + .y'
1
5
9

The input contains three unambiguously-concatenated JSON objects, so the parser produces three distinct objects. Now consider this input, this time outside of the context of an array:

Is this invalid, one number, or two numbers? According to the lexer and parser model described above, this is valid and unambiguously two concatenated numbers. Here’s what my parser says:

$ echo '01' | tests/stream
struct expect seq[] = {
    {JSON_NUMBER, "0"},
    {JSON_DONE},
    {JSON_NUMBER, "1"},
    {JSON_DONE},
    {JSON_ERROR},
};

Note: The JSON_DONE “token” indicates acceptance, and the JSON_ERROR token is an EOF indicator, not a hard error. Since jq allows leading zeros in its JSON input, it’s ambiguous and parses this as the number 1, so asking its opinion on this input isn’t so interesting. I surveyed some other JSON parsers that accept concatenated JSON:

Jackson: Reject as leading zero.
Noggit: Reject as leading zero.
yajl: Accept as two numbers.

For my parser it’s the same story for truefalse:

echo 'truefalse' | tests/stream
struct expect seq[] = {
    {JSON_TRUE, "true"},
    {JSON_DONE},
    {JSON_FALSE, "false"},
    {JSON_DONE},
    {JSON_ERROR},
};

Neither rejecting nor accepting this input is wrong, per se. Concatenated JSON is outside of the scope of JSON itself, and concatenating arbitrary JSON objects without a whitespace delimiter can lead to weird and ill-formed input. This is all a great argument in favor or Newline Delimited JSON, and its two simple rules:

Line separator is '\n'
Each line is a valid JSON value

This solves the concatenation issue, and, even more, it works well with parsers not supporting concatenation: Split the input on newlines and pass each line to your JSON parser.

What's in an Emacs Lambda

2017-12-14T18:18:57Z

There was recently some interesting discussion about correctly using backquotes to express a mixture of data and code. Since lambda expressions seem to evaluate to themselves, what’s the difference? For example, an association list of operations:

'((add . (lambda (a b) (+ a b)))
  (sub . (lambda (a b) (- a b)))
  (mul . (lambda (a b) (* a b)))
  (div . (lambda (a b) (/ a b))))

It looks like it would work, and indeed it does work in this case. However, there are good reasons to actually evaluate those lambda expressions. Eventually invoking the lambda expressions in the quoted form above are equivalent to using eval. So, instead, prefer the backquote form:

`((add . ,(lambda (a b) (+ a b)))
  (sub . ,(lambda (a b) (- a b)))
  (mul . ,(lambda (a b) (* a b)))
  (div . ,(lambda (a b) (/ a b))))

There are a lot of interesting things to say about this, but let’s first reduce it to two very simple cases:

(lambda (x) x)

'(lambda (x) x)

What’s the difference between these two forms? The first is a lambda expression, and it evaluates to a function object. The other is a quoted list that looks like a lambda expression, and it evaluates to a list — a piece of data.

A naive evaluation of these expressions in *scratch* (C-x C-e) suggests they are are identical, and so it would seem that quoting a lambda expression doesn’t really matter:

(lambda (x) x)
;; => (lambda (x) x)

'(lambda (x) x)
;; => (lambda (x) x)

However, there are two common situations where this is not the case: byte compilation and lexical scope.

Lambda under byte compilation

It’s a little trickier to evaluate these forms byte compiled in the scratch buffer since that doesn’t happen automatically. But if it did, it would look like this:

;;; -*- lexical-binding: nil; -*-

(lambda (x) x)
;; => #[(x) "\010\207" [x] 1]

'(lambda (x) x)
;; => (lambda (x) x)

The #[...] is the syntax for a byte-code function object. As discussed in detail in my byte-code internals article, it’s a special vector object that contains byte-code, and other metadata, for evaluation by Emacs’ virtual stack machine. Elisp is one of very few languages with readable function objects, and this feature is core to its ahead-of-time byte compilation.

The quote, by definition, prevents evaluation, and so inhibits byte compilation of the lambda expression. It’s vital that the byte compiler does not try to guess the programmer’s intent and compile the expression anyway, since that would interfere with lists that just so happen to look like lambda expressions — i.e. any list containing the lambda symbol.

There are three reasons you want your lambda expressions to get byte compiled:

Byte-compiled functions are significantly faster. That’s the main purpose for byte compilation after all.
The compiler performs static checks, producing warnings and errors ahead of time. This lets you spot certain classes of problems before they occur. The static analysis is even better under lexical scope due to its tighter semantics.
Under lexical scope, byte-compiled closures may use less memory. More specifically, they won’t accidentally keep objects alive longer than necessary. I’ve never seen a name for this implementation issue, but I call it overcapturing. More on this later.

While it’s common for personal configurations to skip byte compilation, Elisp should still generally be written as if it were going to be byte compiled. General rule of thumb: Ensure your lambda expressions are actually evaluated.

Lambda in lexical scope

As I’ve stressed many times, you should always use lexical scope. There’s no practical disadvantage or trade-off involved. Just do it.

Once lexical scope is enabled, the two expressions diverge even without byte compilation:

;;; -*- lexical-binding: t; -*-

(lambda (x) x)
;; => (closure (t) (x) x)

'(lambda (x) x)
;; => (lambda (x) x)

Under lexical scope, lambda expressions evaluate to closures. Closures capture their lexical environment in their closure object — nothing in this particular case. It’s a type of function object, making it a valid first argument to funcall.

Since the quote prevents the second expression from being evaluated, semantically it evaluates to a list that just so happens to look like a (non-closure) function object. Invoking a data object as a function is like using eval — i.e. executing data as code. Everyone already knows eval should not be used lightly.

It’s a little more interesting to look at a closure that actually captures a variable, so here’s a definition for constantly, a higher-order function that returns a closure that accepts any number of arguments and returns a particular constant:

(defun constantly (x)
  (lambda (&rest _) x))

Without byte compiling it, here’s an example of its return value:

(constantly :foo)
;; => (closure ((x . :foo) t) (&rest _) x)

The environment has been captured as an association list (with a trailing t), and we can plainly see that the variable x is bound to the symbol :foo in this closure. Consider that we could manipulate this data structure (e.g. setcdr or setf) to change the binding of x for this closure. This is essentially how closures mutate their own environment. Moreover, closures from the same environment share structure, so such mutations are also shared. More on this later.

Semantically, closures are distinct objects (via eq), even if the variables they close over are bound to the same value. This is because they each have a distinct environment attached to them, even if in some invisible way.

(eq (constantly :foo) (constantly :foo))
;; => nil

Without byte compilation, this is true even when there’s no lexical environment to capture:

(defun dummy ()
  (lambda () t))

(eq (dummy) (dummy))
;; => nil

The byte compiler is smart, though. As an optimization, the same closure object is reused when possible, avoiding unnecessary work, including multiple object allocations. Though this is a bit of an abstraction leak. A function can (ab)use this to introspect whether it’s been byte compiled:

(defun have-i-been-compiled-p ()
  (let ((funcs (vector nil nil)))
    (dotimes (i 2)
      (setf (aref funcs i) (lambda ())))
    (eq (aref funcs 0) (aref funcs 1))))

(have-i-been-compiled-p)
;; => nil

(byte-compile 'have-i-been-compiled-p)

(have-i-been-compiled-p)
;; => t

The trick here is to evaluate the exact same non-capturing lambda expression twice, which requires a loop (or at least some sort of branch). Semantically we should think of these closures as being distinct objects, but, if we squint our eyes a bit, we can see the effects of the behind-the-scenes optimization.

Don’t actually do this in practice, of course. That’s what byte-code-function-p is for, which won’t rely on a subtle implementation detail.

Overcapturing

I mentioned before that one of the potential gotchas of not byte compiling your lambda expressions is overcapturing closure variables in the interpreter.

To evaluate lisp code, Emacs has both an interpreter and a virtual machine. The interpreter evaluates code in list form: cons cells, numbers, symbols, etc. The byte compiler is like the interpreter, but instead of directly executing those forms, it emits byte-code that, when evaluated by the virtual machine, produces identical visible results to the interpreter — in theory.

What this means is that Emacs contains two different implementations of Emacs Lisp, one in the interpreter and one in the byte compiler. The Emacs developers have been maintaining and expanding these implementations side-by-side for decades. A pitfall to this approach is that the implementations can, and do, diverge in their behavior. We saw this above with that introspective function, and it comes up in practice with advice.

Another way they diverge is in closure variable capture. For example:

;;; -*- lexical-binding: t; -*-

(defun overcapture (x y)
  (when y
    (lambda () x)))

(overcapture :x :some-big-value)
;; => (closure ((y . :some-big-value) (x . :x) t) nil x)

Notice that the closure captured y even though it’s unnecessary. This is because the interpreter doesn’t, and shouldn’t, take the time to analyze the body of the lambda to determine which variables should be captured. That would need to happen at run-time each time the lambda is evaluated, which would make the interpreter much slower. Overcapturing can get pretty messy if macros are introducing their own hidden variables.

On the other hand, the byte compiler can do this analysis just once at compile-time. And it’s already doing the analysis as part of its job. It can avoid this problem easily:

(overcapture :x :some-big-value)
;; => #[0 "\300\207" [:x] 1]

It’s clear that :some-big-value isn’t present in the closure.

But… how does this work?

How byte compiled closures are constructed

Recall from the internals article that the four core elements of a byte-code function object are:

Parameter specification
Byte-code string (opcodes)
Constants vector
Maximum stack usage

While a closure seems like compiling a whole new function each time the lambda expression is evaluated, there’s actually not that much to it! Namely, the behavior of the function remains the same. Only the closed-over environment changes.

What this means is that closures produced by a common lambda expression can all share the same byte-code string (second element). Their bodies are identical, so they compile to the same byte-code. Where they differ are in their constants vector (third element), which gets filled out according to the closed over environment. It’s clear just from examining the outputs:

(constantly :a)
;; => #[128 "\300\207" [:a] 2]

(constantly :b)
;; => #[128 "\300\207" [:b] 2]

constantly has three of the four components of the closure in its own constant pool. Its job is to construct the constants vector, and then assemble the whole thing into a byte-code function object (#[...]). Here it is with M-x disassemble:

     constant  make-byte-code
     constant  128
     constant  "\300\207"
     constant  vector
     stack-ref 4
     call      1
     constant  2
     call      4
     return

(Note: since byte compiler doesn’t produce perfectly optimal code, I’ve simplified it for this discussion.)

It pushes most of its constants on the stack. Then the stack-ref 5 (5) puts x on the stack. Then it calls vector to create the constants vector (6). Finally, it constructs the function object (#[...]) by calling make-byte-code (8).

Since this might be clearer, here’s the same thing expressed back in terms of Elisp:

(defun constantly (x)
  (make-byte-code 128 "\300\207" (vector x) 2))

To see the disassembly of the closure’s byte-code:

(disassemble (constantly :x))

The result isn’t very surprising:

0       constant  :x
1       return

Things get a little more interesting when mutation is involved. Consider this adder closure generator, which mutates its environment every time it’s called:

(defun adder ()
  (let ((total 0))
    (lambda () (cl-incf total))))

(let ((count (adder)))
  (funcall count)
  (funcall count)
  (funcall count))
;; => 3

(adder)
;; => #[0 "\300\211\242T\240\207" [(0)] 2]

The adder essentially works like this:

(defun adder ()
  (make-byte-code 0 "\300\211\242T\240\207" (vector (list 0)) 2))

In theory, this closure could operate by mutating its constants vector directly. But that wouldn’t be much of a constants vector, now would it!? Instead, mutated variables are boxed inside a cons cell. Closures don’t share constant vectors, so the main reason for boxing is to share variables between closures from the same environment. That is, they have the same cons in each of their constant vectors.

There’s no equivalent Elisp for the closure in adder, so here’s the disassembly:

     constant  (0)
     dup
     car-safe
     add1
     setcar
     return

It puts two references to boxed integer on the stack (constant, dup), unboxes the top one (car-safe), increments that unboxed integer, stores it back in the box (setcar) via the bottom reference, leaving the incremented value behind to be returned.

This all gets a little more interesting when closures interact:

(defun fancy-adder ()
  (let ((total 0))
    `(:add ,(lambda () (cl-incf total))
      :set ,(lambda (v) (setf total v))
      :get ,(lambda () total))))

(let ((counter (fancy-adder)))
  (funcall (plist-get counter :set) 100)
  (funcall (plist-get counter :add))
  (funcall (plist-get counter :add))
  (funcall (plist-get counter :get)))
;; => 102

(fancy-adder)
;; => (:add #[0 "\300\211\242T\240\207" [(0)] 2]
;;     :set #[257 "\300\001\240\207" [(0)] 3]
;;     :get #[0 "\300\242\207" [(0)] 1])

This is starting to resemble object oriented programming, with methods acting upon fields stored in a common, closed-over environment.

All three closures share a common variable, total. Since I didn’t use print-circle, this isn’t obvious from the last result, but each of those (0) conses are the same object. When one closure mutates the box, they all see the change. Here’s essentially how fancy-adder is transformed by the byte compiler:

(defun fancy-adder ()
  (let ((box (list 0)))
    (list :add (make-byte-code 0 "\300\211\242T\240\207" (vector box) 2)
          :set (make-byte-code 257 "\300\001\240\207" (vector box) 3)
          :get (make-byte-code 0 "\300\242\207" (vector box) 1))))

The backquote in the original fancy-adder brings this article full circle. This final example wouldn’t work correctly if those lambdas weren’t evaluated properly.

Finding the Best 64-bit Simulation PRNG

2017-09-21T21:25:00Z

August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256** is now my preferred PRNG.

I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.

Monte Carlo simulations, where PRNGs are used to compute numeric estimates for problems that are difficult or impossible to solve analytically.
Monte Carlo tree search AI, where massive numbers of games are played out randomly in search of an optimal move. This is a specific application of the last item.
Genetic algorithms, where a PRNG creates the initial population, and then later guides in mutation and breeding of selected solutions.
Cryptography, where a cryptographically-secure PRNGs (CSPRNGs) produce output that is predictable for recipients who know a particular secret, but not for anyone else. This article is only concerned with plain PRNGs.

For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:

The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.
PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.

In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.

Depending on the programming language, the PRNGs found in various standard libraries may be of dubious quality. They’re slower than they need to be, or have poorer quality than required. In some cases, such as rand() in C, the algorithm isn’t specified, and you can’t rely on it for anything outside of trivial examples. In other cases the algorithm and behavior is specified, but you could easily do better yourself.

My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”

For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:

64-bit PRNG Shootout

Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.

Shootout results

Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.

                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060

The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.

These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.

Speed benchmark

In the speed benchmark, the PRNG is initialized, a 1-second alarm(1) is set, then the PRNG fills a large volatile buffer of 64-bit unsigned integers again and again as quickly as possible until the alarm fires. The amount of memory written is measured as the PRNG’s speed.

The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.

The purpose for making the buffer volatile is to force the entire output to actually be “consumed” as far as the compiler is concerned. Otherwise the compiler plays nasty tricks to make the program do as little work as possible. Another way to deal with this would be to write(2) buffer, but of course I didn’t want to introduce unnecessary I/O into a benchmark.

On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.

The SIGALRM handler sets a volatile global variable that tells the generator to stop. The PRNG call was unrolled 8 times to avoid the alarm check from significantly impacting the benchmark. You can see the effect for yourself by changing UNROLL to 1 (i.e. “don’t unroll”) in the code. Unrolling beyond 8 times had no measurable effect to my tests.

Due to the PRNGs being inlined, this unrolling makes the benchmark less realistic, and it shows in the results. Using volatile for the buffer helped to counter this effect and reground the results. This is a fuzzy problem, and there’s not really any way to avoid it, but I will also discuss this below.

Statistical benchmark

To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:

prng | dieharder -g200 -a -m4

This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.

There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.

The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.

Individual PRNGs

Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.

Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.

Blowfish

One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.

Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.

The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.

In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.

In CBC mode, each output depends on the previous, but in CTR mode all blocks are independent. Work can begin on the next output before the previous output is complete. The x86 architecture uses out-of-order execution to achieve many of its performance gains: Instructions may be executed in a different order than they appear in the program, though their observable effects must generally be ordered correctly. Breaking dependencies between instructions allows out-of-order execution to be fully exercised. It also gives the compiler more freedom in instruction scheduling, though the volatile accesses cannot be reordered with respect to each other (hence it helping to reground the benchmark).

Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.

On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?

In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.

Mersenne Twister (MT19937-64)

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:

It’s between 1/4 and 1/5 the speed of xoroshiro128+.
It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.
Its implementation is three times bigger than xoroshiro128+, and much more complicated.
It had one statistical failure (dab_filltree2).

Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Permuted Congruential Generator (PCG)

The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.

PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).

Here’s essentially what the simplified version looks like:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.

The two constants are entirely my own devising. It’s two 64-bit primes generated using Emacs’ M-x calc: 2 64 ^ k r k n k p k p k p.

Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.

However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.

Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.

Here’s my doubled-up version:

uint64_t
spcg64(uint64_t s[2])
{
    uint64_t m  = 0x9b60933458e17d7d;
    uint64_t a0 = 0xd737232eeccdf7ed;
    uint64_t a1 = 0x8b260b70b8e98891;
    uint64_t p0 = s[0];
    uint64_t p1 = s[1];
    s[0] = p0 * m + a0;
    s[1] = p1 * m + a1;
    int r0 = 29 - (p0 >> 61);
    int r1 = 29 - (p1 >> 61);
    uint64_t high = p0 >> r0;
    uint32_t low  = p1 >> r1;
    return (high << 32) | low;
}

The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.

In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.

RC4

This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.

xoroshiro128+

xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:

uint64_t
xoroshiro128plus(uint64_t s[2])
{
    uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    uint64_t result = s0 + s1;
    s1 ^= s0;
    s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
    s[1] = (s1 << 36) | (s1 >> 28);
    return result;
}

There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.

xorshift128+ and xorshift64*

These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.

uint64_t
xorshift64star(uint64_t s[1])
{
    uint64_t x = s[0];
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    s[0] = x;
    return x * UINT64_C(0x2545f4914f6cdd1d);
}

However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.

If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.

uint64_t
xorshift128plus(uint64_t s[2])
{
    uint64_t x = s[0];
    uint64_t y = s[1];
    s[0] = y;
    x ^= x << 23;
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
    return s[1] + y;
}

It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.

Conclusion

My own takeaway (until I re-evaluate some years in the future):

The best 64-bit simulation PRNG is xoroshiro128+.
“Simplified” PCG can be useful in a pinch.
When only 32-bit integers are necessary, use PCG.

Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:

                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080

LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.

So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.

Some Performance Advantages of Lexical Scope

2016-12-22T02:33:36Z

I recently had a discussion with Xah Lee about lexical scope in Emacs Lisp. The topic was why lexical-binding exists at a file-level when there was already lexical-let (from cl-lib), prompted by my previous article on JIT byte-code compilation. The specific context is Emacs Lisp, but these concepts apply to language design in general.

Until Emacs 24.1 (June 2012), Elisp only had dynamically scoped variables — a feature, mostly by accident, common to old lisp dialects. While dynamic scope has some selective uses, it’s widely regarded as a mistake for local variables, and virtually no other languages have adopted it.

Way back in 1993, Dave Gillespie’s deviously clever lexical-let macro was committed to the cl package, providing a rudimentary form of opt-in lexical scope. The macro walks its body replacing local variable names with guaranteed-unique gensym names: the exact same technique used in macros to create “hygienic” bindings that aren’t visible to the macro body. It essentially “fakes” lexical scope within Elisp’s dynamic scope by preventing variable name collisions.

For example, here’s one of the consequences of dynamic scope.

(defun inner ()
  (setq v :inner))

(defun outer ()
  (let ((v :outer))
    (inner)
    v))

(outer)
;; => :inner

The “local” variable v in outer is visible to its callee, inner, which can access and manipulate it. The meaning of the free variable v in inner depends entirely on the run-time call stack. It might be a global variable, or it might be a local variable for a caller, direct or indirect.

Using lexical-let deconflicts these names, giving the effect of lexical scope.

(defvar v)

(defun lexical-outer ()
  (lexical-let ((v :outer))
    (inner)
    v))

(lexical-outer)
;; => :outer

But there’s more to lexical scope than this. Closures only make sense in the context of lexical scope, and the most useful feature of lexical-let is that lambda expressions evaluate to closures. The macro implements this using a technique called closure conversion. Additional parameters are added to the original lambda function, one for each lexical variable (and not just each closed-over variable), and the whole thing is wrapped in another lambda function that invokes the original lambda function with the additional parameters filled with the closed-over variables — yes, the variables (e.g. symbols) themselves, not just their values, (e.g. pass-by-reference). The last point means different closures can properly close over the same variables, and they can bind new values.

To roughly illustrate how this works, the first lambda expression below, which closes over the lexical variables x and y, would be converted into the latter by lexical-let. The #: is Elisp’s syntax for uninterned variables. So #:x is a symbol x, but not the symbol x (see print-gensym).

;; Before conversion:
(lambda ()
  (+ x y))

;; After conversion:
(lambda (&rest args)
  (apply (lambda (x y)
           (+ (symbol-value x)
              (symbol-value y)))
         '#:x '#:y args))

I’ve said on multiple occasions that lexical-binding: t has significant advantages, both in performance and static analysis, and so it should be used for all future Elisp code. The only reason it’s not the default is because it breaks some old (badly written) code. However, lexical-let doesn’t realize any of these advantages! In fact, it has worse performance than straightforward dynamic scope with let.

New symbol objects are allocated and initialized (make-symbol) on each run-time evaluation, one per lexical variable.
Since it’s just faking it, lexical-let still uses dynamic bindings, which are more expensive than lexical bindings. It varies depending on the C compiler that built Emacs, but dynamic variable accesses (opcode varref) take around 30% longer than lexical variable accesses (opcode stack-ref). Assignment is far worse, where dynamic variable assignment (varset) takes 650% longer than lexical variable assignment (stack-set). How I measured all this is a topic for another article.
The “lexical” variables are accessed using symbol-value, a full function call, so they’re even slower than normal dynamic variables.
Because converted lambda expressions are constructed dynamically at run-time within the body of lexical-let, the resulting closure is only partially byte-compiled even if the code as a whole has been byte-compiled. In contrast, lexical-binding: t closures are fully compiled. How this works is worth its own article.
Converted lambda expressions include the additional internal function invocation, making them slower.

While lexical-let is clever, and occasionally useful prior to Emacs 24, it may come at a hefty performance cost if evaluated frequently. There’s no reason to use it anymore.

Constraints on code generation

Another reason to be weary of dynamic scope is that it puts needless constraints on the compiler, preventing a number of important optimization opportunities. For example, consider the following function, bar:

(defun bar ()
  (let ((x 1)
        (y 2))
    (foo)
    (+ x y)))

Byte-compile this function under dynamic scope (lexical-binding: nil) and disassemble it to see what it looks like.

(byte-compile #'bar)
(disassemble #'bar)

That pops up a buffer with the disassembly listing:

     constant  1
     constant  2
     varbind   y
     varbind   x
     constant  foo
     call      0
     discard
     varref    x
     varref    y
     plus
    unbind    2
    return

It’s 12 instructions, 5 of which deal with dynamic bindings. The byte-compiler doesn’t always produce optimal byte-code, but this just so happens to be nearly optimal byte-code. The discard (a very fast instruction) isn’t necessary, but otherwise no more compiler smarts can improve on this. Since the variables x and y are visible to foo, they must be bound before the call and loaded after the call. While generally this function will return 3, the compiler cannot assume so since it ultimately depends on the behavior foo. Its hands are tied.

Compare this to the lexical scope version (lexical-binding: t):

     constant  1
     constant  2
     constant  foo
     call      0
     discard
     stack-ref 1
     stack-ref 1
     plus
     return

It’s only 8 instructions, none of which are expensive dynamic variable instructions. And this isn’t even close to the optimal byte-code. In fact, as of Emacs 25.1 the byte-compiler often doesn’t produce the optimal byte-code for lexical scope code and still needs some work. Despite not firing on all cylinders, lexical scope still manages to beat dynamic scope in performance benchmarks.

Here’s the optimal byte-code, should the byte-compiler become smarter someday:

     constant  foo
     call      0
     constant  3
     return

It’s down to 4 instructions due to computing the math operation at compile time. Emacs’ byte-compiler only has rudimentary constant folding, so it doesn’t notice that x and y are constants and misses this optimization. I speculate this is due to its roots compiling under dynamic scope. Since x and y are no longer exposed to foo, the compiler has the opportunity to optimize them out of existence. I haven’t measured it, but I would expect this to be significantly faster than the dynamic scope version of this function.

Optional dynamic scope

You might be thinking, “What if I really do want x and y to be dynamically bound for foo?” This is often useful. Many of Emacs’ own functions are designed to have certain variables dynamically bound around them. For example, the print family of functions use the global variable standard-output to determine where to send output by default.

(let ((standard-output (current-buffer)))
  (princ "value = ")
  (prin1 value))

Have no fear: With lexical-binding: t you can have your cake and eat it too. Variables declared with defvar, defconst, or defvaralias are marked as “special” with an internal bit flag (declared_special in C). When the compiler detects one of these variables (special-variable-p), it uses a classical dynamic binding.

Declaring both x and y as special restores the original semantics, reverting bar back to its old byte-code definition (next time it’s compiled, that is). But it would be poor form to mark x or y as special: You’d de-optimize all code (compiled after the declaration) anywhere in Emacs that uses these names. As a package author, only do this with the namespace-prefixed variables that belong to you.

The only way to unmark a special variable is with the undocumented function internal-make-var-non-special. I expected makunbound to do this, but as of Emacs 25.1 it does not. This could possibly be considered a bug.

Accidental closures

I’ve said there are absolutely no advantages to lexical-binding: nil. It’s only the default for the sake of backwards-compatibility. However, there is one case where lexical-binding: t introduces a subtle issue that would otherwise not exist. Take this code for example (and nevermind prin1-to-string for a moment):

;; -*- lexical-binding: t; -*-

(defun function-as-string ()
  (with-temp-buffer
    (prin1 (lambda () :example) (current-buffer))
    (buffer-string)))

This creates and serializes a closure, which is one of Elisp’s unique features. It doesn’t close over any variables, so it should be pretty simple. However, this function will only work correctly under lexical-binding: t when byte-compiled.

(function-as-string)
;; => "(closure ((temp-buffer . #) t) nil :example)"

The interpreter doesn’t analyze the closure, so just closes over everything. This includes the hidden variable temp-buffer created by the with-temp-buffer macro, resulting in an abstraction leak. Buffers aren’t readable, so this will signal an error if an attempt is made to read this function back into an s-expression. The byte-compiler fixes this by noticing temp-buffer isn’t actually closed over and so doesn’t include it in the closure, making it work correctly.

Under lexical-binding: nil it works correctly either way:

(function-as-string)
;; -> "(lambda nil :example)"

This may seem contrived — it’s certainly unlikely — but it has come up in practice. Still, it’s no reason to avoid lexical-binding: t.

Use lexical scope in all new code

As I’ve said again and again, always use lexical-binding: t. Use dynamic variables judiciously. And lexical-let is no replacement. It has virtually none of the benefits, performs worse, and it only applies to let, not any of the other places bindings are created: function parameters, dotimes, dolist, and condition-case.

Zero-allocation Trie Traversal

2016-11-13T06:03:24Z

As part of a demonstration in an upcoming article, I wrote a simple trie implementation. A trie is a search tree where the keys are a sequence of symbols (i.e. strings). Strings with a common prefix share an initial path down the trie, and the keys themselves are stored implicitly by the structure of the trie. It’s commonly used as a sorted set or, when values are associated with nodes, an associative array.

This wasn’t my first time writing a trie. The curse of programming in C is rewriting the same data structures and algorithms over and over. It’s the problem C++ templates are intended to solve. This rewriting isn’t always bad since each implementation is typically customized for its specific use, often resulting in greater performance and a smaller resource footprint.

Every time I’ve rewritten a trie, my implementation is a little bit better than the last. This time around I discovered an approach for traversing, both depth-first and breadth-first, an arbitrarily-sized trie without memory allocation. I’m definitely not the first to discover something like this. There’s Deutsch-Schorr-Waite pointer reversal for binary graphs (1965) — which I originally learned from reading the Scheme 9 from Outer Space garbage collector source — and Morris in-order traversal (1979) for binary trees. The former requires two extra tag bits per node and the latter requires no modifications at all.

What’s a trie?

But before I go further, some background. A trie can come in many shapes and sizes, but in the simple case each node of a trie has as many pointers as its alphabet. For illustration purposes, imagine a trie for strings of only four characters: A, B, C, and D. Each node is essentially four pointers.

#define TRIE_ALPHABET_SIZE  4
#define TRIE_STATIC_INIT    {.flags = 0}
#define TRIE_TERMINAL_FLAG  (1U << 0)

struct trie {
    struct trie *next[TRIE_ALPHABET_SIZE];
    unsigned flags;
};

It includes a flags field, where a single bit tracks whether or not a node is terminal — that is, a key terminates at this node. Terminal nodes are not necessarily leaf nodes, which is the case when one key is a prefix of another key. I could instead have used a 1-bit bit-field (e.g. int is_terminal : 1;) but I don’t like bit-fields.

A trie with the following keys, inserted in any order:

AAAAA
ABCD
CAA
CAD
CDBD

Looks like this (terminal nodes illustrated as small black squares):

The root of the trie is the empty string, and each child represents a trie prefixed with one of the symbols from the alphabet. This is a nice recursive definition, and it’s tempting to write recursive functions to process it. For example, here’s a recursive insertion function.

int
trie_insert_recursive(struct trie *t, const char *s)
{
    if (!*s) {
        t->flags |= TRIE_TERMINAL_FLAG;
        return 1;
    }

    int i = *s - 'A';
    if (!t->next[i]) {
        t->next[i] = malloc(sizeof(*t->next[i]));
        if (!t->next[i])
            return 0;
        *t->next[i] = (struct trie)TRIE_STATIC_INIT;
    }
    return trie_insert_recursive(t->next[i], s + 1);
}

If the string is empty (!*s), mark the current node as terminal. Otherwise recursively insert the substring under the appropriate child. That’s a tail call, and any optimizing compiler would optimize this call into a jump back to the beginning of of the function (tail-call optimization), reusing the stack frame as if it were a simple loop.

If that’s not good enough, such as when optimization is disabled for debugging and the recursive definition is blowing the stack, this is trivial to convert to a safe, iterative function. I prefer this version anyway.

int
trie_insert(struct trie *t, const char *s)
{
    for (; *s; s++) {
        int i = *s - 'A';
        if (!t->next[i]) {
            t->next[i] = malloc(sizeof(*t->next[i]));
            if (!t->next[i])
                return 0;
            *t->next[i] = (struct trie)TRIE_STATIC_INIT;
        }
        t = t->next[i];
    }
    t->flags |= TRIE_TERMINAL_FLAG;
    return 1;
}

Finding a particular prefix in the trie iteratively is also easy. This would be used to narrow the trie to a chosen prefix before iterating over the keys (e.g. find all strings matching a prefix).

struct trie *
trie_find(struct trie *t, const char *s)
{
    for (; *s; s++) {
        int i = *s - 'A';
        if (!t->next[i])
            return NULL;
        t = t->next[i];
    }
    return t;
}

Depth-first traversal is stack-oriented. The stack represents the path through the graph, and each new vertex is pushed into this stack as it’s visited. A recursive traversal function can implicitly use the call stack for storing this information, so no additional data structure is needed.

The downside is that the call is no longer tail-recursive, so a large trie will blow the stack. Also, the caller needs to provide a callback function because the stack cannot unwind to return a value: The stack has important state on it. Here’s a typedef for the callback.

typedef void (*trie_visitor)(const char *key, void *arg);

And here’s the recursive depth-first traversal function. The top-level caller passes the same buffer for buf and bufend, which must be at least as large as the largest key. The visited key will be written to this buffer and passed to the visitor.

void
trie_dfs_recursive(struct trie *t,
                   char *buf,
                   char *bufend,
                   trie_visitor v,
                   void *arg)
{
    if (t->flags & TRIE_TERMINAL_FLAG) {
        *bufend = 0;
        v(buf, arg);
    }

    for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
        if (t->next[i]) {
            *bufend = 'A' + i;
            trie_dfs_recursive(t->next[i], buf, bufend + 1, v, arg);
        }
    }
}

Heap-allocated Traversal Stack

Moving the traversal stack to the heap would eliminate the stack overflow problem and it would allow control to return to the caller. This is going to be a lot of code for an article, but bear with me.

First define an iterator object. The stack will need two pieces of information: which node did we come from (p) and through which pointer (i). When a node has been exhausted, this will allow return to the parent. The root field tracks when traversal is complete.

struct trie_iter {
    struct trie *root;
    char *buf;
    char *bufend;
    struct {
        struct trie *p;
        int i;
    } *stack;
};

A special value of -1 in i means it’s the first visit for this node and it should be visited by the callback if it’s terminal.

The iterator is initialized with trie_iter_init. The max indicates the maximum length of any key. A more elaborate implementation could automatically grow the stack to accommodate (e.g. realloc()), but I’m keeping it as simple as possible.

int
trie_iter_init(struct trie_iter *it, struct trie *t, size_t max)
{
    it->root = t;
    it->stack = malloc(sizeof(*it->stack) * max);
    if (!it->stack)
        return 0;
    it->buf = it->bufend = malloc(max);
    if (!it->buf) {
        free(it->stack);
        return 0;
    }
    it->stack->p = t;
    it->stack->i = -1;
    return 1;
}

void
trie_iter_destroy(struct trie_iter *it)
{
    free(it->stack);
    it->stack = NULL;
    free(it->buf);
    it->buf = NULL;
}

And finally the complicated part. This uses the allocated stack to explore the trie in a loop until it hits a terminal, at which point it returns. A further call continues the traversal from where it left off. It’s like a hand-coded generator. With the way it’s written, the caller is obligated to follow through with the entire iteration before destroying the iterator, but this would be easy to correct.

int
trie_iter_next(struct trie_iter *it)
{
    for (;;) {
        struct trie *current = it->stack->p;
        int i = it->stack->i++;

        if (i == -1) {
            /* Return result if terminal node. */
            if (current->flags & TRIE_TERMINAL_FLAG) {
                *it->bufend = 0;
                return 1;
            }
            continue;
        }

        if (i == TRIE_ALPHABET_SIZE) {
            /* End of current node. */
            if (current == it->root)
                return 0;  // back at root, done
            it->stack--;
            it->bufend--;
            continue;
        }

        if (current->next[i]) {
            /* Push on next child node. */
            *it->bufend = 'A' + i;
            it->stack++;
            it->bufend++;
            it->stack->p = current->next[i];
            it->stack->i = -1;
        }
    }
}

This is much nicer for the caller since there’s no control inverse.

struct trie_iter it;
trie_iter_init(&it, &trie_root, KEY_MAX);
while (trie_iter_next(&it)) {
    // ... do something with it.buf ...
}
trie_iter_destroy(&it);

There are a few downsides to this:

Initialization could fail (not checked in the example) since it allocates memory.
Either the caller has to keep track of the maximum key length, or the iterator grows the stack automatically, which would mean iteration could fail at any point in the middle.
In order to destroy the trie, it needs to be traversed: Freeing memory first requires allocating memory. If the program is out of memory, it cannot destroy the trie to clean up before handling the situation, nor to make more memory available. It’s not good for resilience.

Wouldn’t it be nice to traverse the trie without memory allocation?

Modifying the Trie

Rather than allocate a separate stack, the stack can be allocated across the individual nodes of the trie. Remember those p and i fields from before? Put them on the trie.

struct trie_v2 {
    struct trie_v2 *next[TRIE_ALPHABET_SIZE];
    struct trie_v2 *p;
    int i;
    unsigned flags;
};

This automatically scales with the size of the trie, so there will always be enough of this stack. With the stack “pre-allocated” like this, traversal requires no additional memory allocation.

The iterator itself becomes a little simpler. It cannot fail and it doesn’t need a destructor.

struct trie_v2_iter {
    struct trie_v2 *current;
    char *buf;
};

void
trie_v2_iter_init(struct trie_v2_iter *it, struct trie_v2 *t, char *buf)
{
    t->p = NULL;
    t->i = -1;
    it->current = t;
    it->buf = buf;
}

The iteration function itself is almost identical to before. Rather than increment a stack pointer, it uses p to chain the nodes as a linked list.

int
trie_v2_iter_next(struct trie_v2_iter *it)
{
    for (;;) {
        struct trie_v2 *current = it->current;
        int i = it->current->i++;

        if (i == -1) {
            /* Return result if terminal node. */
            if (current->flags & TRIE_TERMINAL_FLAG) {
                *it->buf = 0;
                return 1;
            }
            continue;
        }

        if (i == TRIE_ALPHABET_SIZE) {
            /* End of current node. */
            if (!current->p)
                return 0;
            it->current = current->p;
            it->buf--;
            continue;
        }

        if (current->next[i]) {
            /* Push on next child node. */
            *it->buf = 'A' + i;
            it->buf++;
            it->current = current->next[i];
            it->current->p = current;
            it->current->i = -1;
        }

    }
}

During traversal the iteration pointers look something like this:

This is not without its downsides:

Traversal is not re-entrant nor thread-safe. It’s not possible to run multiple in-place iterators side by side on the same trie since they’ll clobber each other.
It uses more memory — O(n) rather than O(max-key-length) — and sits on this extra memory for its entire lifetime.

Breadth-first Traversal

The same technique can be used for breadth-first search, which is queue-oriented rather than stack-oriented. The p pointers are instead chained into a queue, with a head and tail pointer variable for each end. As each node is visited, its children are pushed into the queue linked list.

This isn’t good for visiting keys by name. buf was itself a stack and played nicely with depth-first traversal, but there’s no easy way to build up a key in a buffer breadth-first. So instead here’s a function to destroy a trie breadth-first.

void
trie_v2_destroy(struct trie_v2 *t)
{
    struct trie_v2 *head = t;
    struct trie_v2 *tail = t;
    while (head) {
        for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
            struct trie_v2 *next = head->next[i];
            if (next) {
                next->p = NULL;
                tail->p = next;
                tail = next;
            }
        }
        struct trie_v2 *dead = head;
        head = head->p;
        free(dead);
    }
}

During its traversal the p pointers link up like so:

Further Research

In my real code there’s also a flag to indicate the node’s allocation type: static or heap. This allows a trie to be composed of nodes from both kinds of allocations while still safe to destroy. It might also be useful to pack a reference counter into this space so that a node could be shared by more than one trie.

For a production implementation it may be worth packing i into the flags field since it only needs a few bits, even with larger alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the p field could be eliminated and instead one of the child pointers is temporarily reversed. With these changes, this technique would fit into the original struct trie without changes, eliminating the extra memory usage.

Update: Over on Hacker News, psi-squared has interesting suggestions such as leaving the traversal pointers intact, particularly in the case of a breadth-first search, which, until the next trie modification, allows for concurrent follow-up traversals.

Makefile Assignments are Turing-Complete

2016-04-30T03:01:22Z

For over a decade now, GNU Make has almost exclusively been my build system of choice, either directly or indirectly. Unfortunately this means I unnecessarily depend on some GNU extensions — an annoyance when porting to the BSDs. In an effort to increase the portability of my Makefiles, I recently read the POSIX make specification. I learned two important things: 1) ~~POSIX make is so barren it’s not really worth striving for~~ (update: I’ve changed my mind), and 2) make’s macro assignment mechanism is Turing-complete.

If you want to see it in action for yourself before reading further, here’s a Makefile that implements Conway’s Game of Life (40x40) using only macro assignments.

life.mak (174kB) [or generate your own]

Run it with any make program in an ANSI terminal. It must literally be named life.mak. Beware: if you run it longer than a few minutes, your computer may begin thrashing.

make -f life.mak

It’s 100% POSIX-compatible except for the sleep 0.1 (fractional sleep), which is only needed for visual effect.

A POSIX workaround

Unlike virtually every real world implementation, POSIX make doesn’t support conditional parts. For example, you might want your Makefile’s behavior to change depending on the value of certain variables. In GNU Make it looks like this:

ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
else
    EXTRA_FLAGS = -Wbar
endif

Or BSD-style:

.ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
.else
    EXTRA_FLAGS = -Wbar
.endif

If the goal is to write a strictly POSIX Makefile, how could I work around the lack of conditional parts and maintain a similar interface? The selection of macro/variable to evaluate can be dynamically selected, allowing for some useful tricks. First define the option’s default:

USE_FOO = 0

Then define both sets of flags:

EXTRA_FLAGS_0 = -Wbar
EXTRA_FLAGS_1 = -ffoo -lfoo

Now dynamically select one of these macros for assignment to EXTRA_FLAGS.

EXTRA_FLAGS = $(EXTRA_FLAGS_$(USE_FOO))

The assignment on the command line overrides the assignment in the Makefile, so the user gets to override USE_FOO.

$ make              # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0    # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1    # EXTRA_FLAGS = -ffoo -lfoo

Before reading the POSIX specification, I didn’t realize that the left side of an assignment can get the same treatment. For example, if I really want the “if defined” behavior back, I can use the macro to mangle the left-hand side. For example,

EXTRA_FLAGS = -O0 -g3
EXTRA_FLAGS$(DEBUG) = -O3 -DNDEBUG

Caveat: If DEBUG is set to empty, it may still result in true for ifdef depending on which make flavor you’re using, but will always appear to be unset in this hack.

$ make             # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes   # EXTRA_FLAGS = -O0 -g3

This last case had me thinking: This is very similar to the (ab)use of the x86 mov instruction in mov is Turing-complete. These macro assignments alone should be enough to compute any algorithm.

Macro Operations

Macro names are just keys to a global associative array. This can be used to build lookup tables. Here’s a Makefile to “compute” the square root of integers between 0 and 10.

sqrt_0  = 0.000000
sqrt_1  = 1.000000
sqrt_2  = 1.414214
sqrt_3  = 1.732051
sqrt_4  = 2.000000
sqrt_5  = 2.236068
sqrt_6  = 2.449490
sqrt_7  = 2.645751
sqrt_8  = 2.828427
sqrt_9  = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))

The BSD flavors of make have a -V option for printing variables, which is an easy way to retrieve output. I used an “immediate” assignment (:=) for result since some versions of make won’t evaluate the expression before -V printing.

$ make -f sqrt.mak -V result n=8
2.828427

Without -V, a default target could be used instead:

output :
        @printf "$(result)\n"

There are no math operators, so performing arithmetic requires some creativity. For example, integers could be represented as a series of x characters. The number 4 is xxxx, the number 6 is xxxxxx, etc. Addition is concatenation (note: macros can have + in their names):

A      = xxx
B      = xxxx
A+B    = $(A)$(B)

However, since there’s no way to “slice” a value, subtraction isn’t possible. A more realistic approach to arithmetic would require lookup tables.

Branching

Branching could be achieved through more lookup tables. For example,

square_0  = 1
square_1  = 2
square_2  = 4
# ...
result := $($(op)_$(n))

And called as:

$ make n=5 op=sqrt    # 2.236068
$ make n=5 op=square  # 25

Or using the DEBUG trick above, use the condition to mask out the results of the unwanted branch. This is similar to the mov paper.

result           := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))

And its usage:

$ make n=5 op=square             # 25
$ make n=5 op=square verbose=1   # square(5) = 25

What about loops?

Looping is a tricky problem. However, one of the most common build (anti?)patterns is the recursive Makefile. Borrowing from the mov paper, which used an unconditional jump to restart the program from the beginning, for a Makefile Turing-completeness I can invoke the Makefile recursively, restarting the program with a new set of inputs.

Remember the print target above? I can loop by invoking make again with new inputs in this target,

output :
    @printf "$(result)\n"
    @$(MAKE) $(args)

Before going any further, now that loops have been added, the natural next question is halting. In reality, the operating system will take care of that after some millions of make processes have carelessly been invoked by this horribly inefficient scheme. However, we can do better. The program can clobber the MAKE variable when it’s ready to halt. Let’s formalize it.

loop = $(MAKE) $(args)
output :
    @printf "$(result)\n"
    @$(loop)

To halt, the program just needs to clear loop.

Suppose we want to count down to 0. There will be an initial count:

count = 6

A decrement table:

= 5
= 4
= 3
= 2
= 1
= 0
= loop

The last line will be used to halt by clearing the name on the right side. This is three star territory.

$($($(count))) =

The result (current iteration) loop value is computed from the lookup table.

result = $($(count))

The next loop value is passed via args. If loop was cleared above, this result will be discarded.

args = count=$(result)

With all that in place, invoking the Makefile will print a countdown from 5 to 0 and quit. This is the general structure for the Game of Life macro program.

Game of Life

A universal Turing machine has been implemented in Conway’s Game of Life. With all that heavy lifting done, one of the easiest methods today to prove a language’s Turing-completeness is to implement Conway’s Game of Life. Ignoring the criminal inefficiency of it, the Game of Life Turing machine could be run on the Game of Life simulation running on make’s macro assignments.

In the Game of Life program — the one linked at the top of this article — each cell is stored in a macro named xxyy, after its position. The top-left most cell is named 0000, then going left to right, 0100, 0200, etc. Providing input is a matter of assigning each of these macros. I chose X for alive and - for dead, but, as you’ll see, any two characters permitted in macro names would work as well.

$ make 0000=X 0100=- 0200=- 0300=X ...

The next part should be no surprise: The rules of the Game of Life are encoded as a 512-entry lookup table. The key is formed by concatenating the cell’s value along with all its neighbors, with itself in the center.

The “beginning” of the table looks like this:

--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...

Note: The two right-hand X values here are the cell coming to life (exactly three living neighbors). Computing the next value (n0101) for 0101 is done like so:

n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))

Given these results, constructing the input to the next loop is simple:

args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...

The display output, to be given to printf, is built similarly:

output = $(n0000)$(n0100)$(n0200)$(n0300)...

In the real version, this is decorated with an ANSI escape code that clears the terminal. The printf interprets the escape byte (\033) so that it doesn’t need to appear literally in the source.

And that’s all there is to it: Conway’s Game of Life running in a Makefile. Life, uh, finds a way.

Duck Typing vs. Type Erasure

2014-04-01T21:07:31Z

Consider the following C++ class.

#include 

template <typename T>
struct Caller {
  const T callee_;
  Caller(const T callee) : callee_(callee) {}
  void go() { callee_.call(); }
};

Caller can be parameterized to any type so long as it has a call() method. For example, introduce two types, Foo and Bar.

struct Foo {
  void call() const { std::cout << "Foo"; }
};

struct Bar {
  void call() const { std::cout << "Bar"; }
};

int main() {
  Caller<Foo> foo{Foo()};
  Caller<Bar> bar{Bar()};
  foo.go();
  bar.go();
  std::cout << std::endl;
  return 0;
}

This code compiles cleanly and, when run, emits “FooBar”. This is an example of duck typing — i.e., “If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.” Foo and Bar are unrelated types. They have no common inheritance, but by providing the expected interface, they both work with with Caller. This is a special case of polymorphism.

Duck typing is normally only found in dynamically typed languages. Thanks to templates, a statically, strongly typed language like C++ can have duck typing without sacrificing any type safety.

Java Duck Typing

Let’s try the same thing in Java using generics.

class Caller<T> {
    final T callee;
    Caller(T callee) {
        this.callee = callee;
    }
    public void go() {
        callee.call();  // compiler error: cannot find symbol call
    }
}

class Foo {
    public void call() { System.out.print("Foo"); }
}

class Bar {
    public void call() { System.out.print("Bar"); }
}

public class Main {
    public static void main(String args[]) {
        Caller<Foo> f = new Caller<>(new Foo());
        Caller<Bar> b = new Caller<>(new Bar());
        f.go();
        b.go();
        System.out.println();
    }
}

The program is practically identical, but this will fail with a compile-time error. This is the result of type erasure. Unlike C++’s templates, there will only ever be one compiled version of Caller, and T will become Object. Since Object has no call() method, compilation fails. The generic type is only for enabling additional compiler checks later on.

C++ templates behave like a macros, expanded by the compiler once for each different type of applied parameter. The call symbol is looked up later, after the type has been fully realized, not when the template is defined.

To fix this, Foo and Bar need a common ancestry. Let’s make this Callee.

interface Callee {
    void call();
}

Caller needs to be redefined such that T is a subclass of Callee.

class Caller<T extends Callee> {
    // ...
}

This now compiles cleanly because call() will be found in Callee. Finally, implement Callee.

class Foo implements Callee {
    // ...
}

class Bar implements Callee {
    // ...
}

This is no longer duck typing, just plain old polymorphism. Type erasure prohibits duck typing in Java (outside of dirty reflection hacks).

Signals and Slots and Events! Oh My!

Duck typing is useful for implementing the observer pattern without as much boilerplate. A class can participate in the observer pattern without inheriting from some specialized class or interface. For example, see the various signal and slots systems for C++. In constrast, Java has an EventListener type for everything:

KeyListener
MouseListener
MouseMotionListener
FocusListener
ActionListener, etc.

A class concerned with many different kinds of events, such as an event logger, would need to inherit a large number of interfaces.

The Physical Analog for Encryption is the Hyperdrive

2012-08-06T00:00:00Z

I was recently watching GetDaved play through X-Wing Alliance, a game I myself played in college. I have a lot of nostalgia for it, especially because TIE Fighter was the first games I ever invested a lot of time into playing. Just hearing the sounds and music brings back relaxing memories.

In one of the early missions the player travels through hyperspace (which ain’t like dusting crops) to a storage area located in deep space. It’s a family business and the player is out there to take inventory of storage containers. Like when I saw the wormhole minefield in Deep Space 9, it got me thinking, “Why?” Why keep all these storage containers in deep space? There’s no defense or security out there to stop someone from stealing containers. It seems like it would be better to store those at the home base where they can be protected.

Storing items at random locations in deep space is actually very secure — more so than any lock! Space is huge. Even with faster-than-light travel searching a galaxy for a storage location would be impractical. It would be as impractical as using brute-force to find an encryption key — another huge search space. Also, if the storage location as been in use for X years, you’d need to come within X light-years of it, at least, in order to find it, since even gravity itself is limited by the speed of light.

Physical locks are usually described as the physical analogy of cryptography. Honestly, it’s not a very good analogy. The brute-force method for bypassing a lock isn’t to keep trying different keys or combinations until it works. No, it’s to just smash something (a window, the lock) or pick the lock. When translated back into the crypto world that’s like breaking a cipher, which isn’t a practical attack in modern cryptography.

No, the physical analogy for cryptography is deep space storage. The only practical way to access deep space items is to learn the coordinates of the storage location, which is the equivalent of the encryption key. If the coordinates are lost or forgotten, the items are as good as destroyed, just like data.

There are actually some advantages of physical “encryption.” Ciphertext can be decrypted offline without being detected. It’s not possible to visit deep space storage without having a physical presence, which is certainly more detectable than offline decryption. There’s also the advantage that it’s somewhat easier to tell when the key (location) generation algorithm is busted or you’re just bad at picking passphrases: someone else’s stuff will already be there. A literal collision.

A Fractran Short Story

2010-03-09T00:00:00Z

Fractran is a Turing-complete esoteric programming language. A Fractran program is just an ordered list of positive, irreducible fractions. The program's output for an input n is the output of the program run on n multiplied by the first fraction in the list that results in an integer. If no such multiplication results in an integer, the output is the input n. Variables are encoded in the exponents of the prime factorization of the input and output.

Some time ago I thought up an idea for a short story involving Fractran. A mathematician accidentally creates a Fractran program that can trivially factor large composites. Think something like O(log n). It's just the right magical string of, say, 31 fractions.

The story would be a first-person narrative of the mathematician's thoughts during a short time after the discovery, considering many of the consequences of the program. For example, it would render much of cryptography, which plays an essential role in the modern world, useless. He would also wonder if mankind should deserve such a discovery, considering how accidental it was.

This whole idea vanished once I realized that this Fractran program is actually completely trivial. It even runs in O(1) time. It's so trivial as to be worthless. Remember that Fractran stores its data in the number's prime factorization? The Fractran program that can factor any number in constant time is the identity function. To decode the output, which matches the input, all you need to do is factor it!

Interestingly, it doesn't seem to actually be possible to implement the identity function in Fractran (But somehow it's Turing-complete? Hmmm... more investigation needed.), unless you can define your program in terms of its input. For example, the program 1/(n+1) is the identity function for input n.

Unorderable Sets

2009-09-27T00:00:00Z

Under Gavin's suggestion, I've been watching The Prisoner, a 1960's British television show. The main character is an ex-spy held prisoner in "the Village", an Orwellian, isolated, enclosed town. No one in the Village has a name, but is instead assigned a number. The main character's number is 6.

As far as I can tell, after number 2 the order of the numbers is not important. Number 56 is no more important than number 12. By using numbers to name things there is an implied ordering, even if the the ordering is insignificant. It could be misleading to a newcomer.

Is there an unordered set could be used to name things? More specifically, is there a set that cannot be ordered? If it is unorderable then there is no implicit ordering to cause confusion. It's easy to have an unorderable set in theory, but I think it is difficult to have in practice.

Using letters is obviously out, as the alphabet has an order. Words and names made of letters can be sorted according to the alphabet. However, the ability to order words is almost never used outside of indexing. If words are used to name things, a newcomer is unlikely to assume relationships based on ordering. No one will assume Alan is more important than Bob.

Large numbers also tend to lack an assumed order. I don't think anyone assumes a larger or smaller social security number has meaning, or a larger or smaller phone number. However, these values are also known to be handed out in some semi-random way.

But can we do better? For at least English speakers, is it possible to create an unorderable set? If the items in the set have a vocal pronunciation, then they can probably be ordered by their phonetics. That could be avoided by using non-standard phonetic components, like clicks and pops, which won't have a standard ordering (in English, anyway).

A set has an order if there is a total, transitive, relational operator for the set. If such an operator does not exist then the set isn't linearly ordered. I want a set that can't easily have such an operator.

If a set of symbols was created, how might they be presented as to show no ordering. The order of the symbols in the original presentation might be considered the ordering, like how the alphabet is always presented in order. A circle could be used, but this is circularly ordered. I think there is also the issue of memorization. A human will have a much better time memorizing the symbols if memorized in some order. For example, try naming all the letters of the alphabet at random, without repeats. Or US states.

Thanks to modern day technology, with dynamic content, the set could be displayed in a random order each time it is viewed. For a web page, the server could select a random order, or a JavaScript program could reorder the images at random.

There could be partially ordered sets, like hierarchies and DAGs. The ordering in The Prisoner is one of these. There is number 1, then number 2, then everyone else. Is there a partially ordered set in use that has unique names at the same level?

The penalties incurred by intentionally prohibiting order would likely outweigh the benefit of the set. If it's not orderable, we can't index it, and it's difficult to deal with. I expect it's much easer to just use numbers and tell people that the order isn't important, or just use an obviously unordered set.

Lisp Number Representations

2008-03-15T00:00:00Z

This exercise partly comes from a couple different chapters in the book The Little Schemer. The book is an introduction to the Scheme programming language, a dialect of Lisp. The purpose to to teach basic programming concepts in a way that anyone can follow along just as well as someone with a degree in, say, computer science. It is still very useful for us programmer types because there are some good practice you get from reading and playing along.

First of all, Lisp is famous (infamous?) for lacking syntax. Any Lisp program is simply an S-expression, put simply, a list of lists. There is no operator precedence because operators are treated just like functions. This leads to prefix notation for mathematical expressions,

(+ 4 5)
=> 9

where the => indicates the result of evaluating the expression. We can apply as many operands as we want,

(+ 2 3 4 5 10)
=> 24

We can put another list right in there as an operand,

(+ 3 (* 2 5) 4)
=> 17

You get the idea. In a function, the value of the last expression is the return value. For example, here is the square function in Scheme, which squares its input,

(define (square x)
  (* x x))

Then we can use it,

(+ (square 2) (square 5))
=> 29

There are three important list operators to understand as well: car, cdr, and cons. car returns the first element in a list. In the example below, the ', a single quote, tells the interpreter or compiler that the list is to be treated as data and not to be executed. This is shorthand, or syntactic sugar, for the quote operator: (quote (stallman moglen)) is the same as '(stallman moglen).

(car '(stallman moglen lessig))
=> stallman

cdr returns the "rest" of a list (everything but the car of the list). When passing a list with only one element cdr returns the empty list: ().

(cdr '(stallman moglen lessig))
=> (moglen lessig)
(cdr '(stallman))
=> ()

We can ask if a list is empty or not with null?. #t and #f are true and false.

(null? '(stallman moglen lessig))
=> #f
(null? '())
=> #t

And finally, for lists, we have cons. This function allows us to build a list. It glues the first argument to the front of the list in the second argument,

(cons 'stallman '(moglen lessig))
=> (stallman moglen lessig)
(cons 'stallman '())
=> (stallman)

And one last function you need to know: eq?. It determines the two atoms are the same atom,

(eq? 'stallman 'moglen)
=> #f
(eq? 'stallman 'stallman)
=> #t

Now, for this exercise we will pretend that the basic arithmetic functions have not been defined for us. Instead all we have is add1 and sub1, each of which adds or subtracts 1 from its argument respectively.

(add1 5)
=> 6
(sub1 5)
=> 4

Oh, I almost forgot. We also have the zero? function defined for us, which tells us if its argument is 0 or not. Notice that functions that return true or false, called predicates, have a ? on the end.

(zero? 2)
=> #f
(zero? 0)
=> #t

To make things simple, these definitions will only consider positive numbers. We can define the + function (for only two arguments) in terms of the three basic functions shown above. It might be interesting to try to write this yourself before you look any further. (Hint: define it recursively!)

;; Adds together n and m
(define (+ n m)
  (if (zero? m) n
      (add1 (+ n (sub1 m)))))

If the second argument is 0 we are done and simply return the first argument. If not, we add 1 to n + (m - 1). The - function is defined similarly.

;; Subtracts m from n
(define (- n m)
  (if (zero? m) n
      (sub1 (- n (sub1 m)))))

Multiplication is the act of performing addition many times. We can go on defining it in terms of addition,

(define (* n m)
  (if (zero? m) 0
      (+ n (* n (sub1 m)))))

(We'll leave division as an exercise for the reader as it gets a little more complicated than I need to go in order to get my overall point across.)

We will leave math behind for a moment take a look at The Roots of Lisp. In that link is an excellent paper written by Paul Graham about John McCarthy, the inventor (or perhaps discoverer?) of Lisp, and how Lisp came to be. It turns out that in order to have a fully functional Lisp engine we only need seven primitive operators: operators defined outside of the language itself as building blocks for the language. For Lisp these seven operators are (Scheme-ized for our purposes): eq?, atom?, car, cdr, cons, quote, and if.

Notice how none of these are math operators. You may wonder how we can possibly perform mathematical operations when we lack these facilities. The answer: we have to define our own representation for numbers! Let's try this, define a number as a list of empty lists. So, the number 3 is,

'(() () ())

And here is 0, 2, and 4,

'()
'(() ())
'(() () () ())

See how that works? Before, when we wanted to define addition and subtraction, we needed three other functions: zero?, add1, and sub1. With our number representation, how could we define add1 with our seven primitive operators? Our numbers are defined as lists, so we can use our list operators. To add 1 to a number, we append another empty list. Hey, that sounds a lot like cons!

(define (add1 n)
  (cons '() n))

Subtraction is removing an element from the list, which sounds a lot like cdr,

(define (sub1 n)
  (cdr n))

And to define zero? we need to check for an empty list. Notice this will also be the definition for null?.

(define (zero? n)
  (eq? '() n))

And now we are back where we started. In fact, you can use the exact definitions above to define +, -, and *. Our entire method number representation depends on how we define add1, sub1, and zero?. Let's try it out,

;; 3 + 4
(+ '(() () ()) '(() () () ()))
=> (() () () () () () ())

;; 5 - 2
(- '(() () () () ()) '(() ()))
=> (() () ())

;; 2 * 2
(* '(() ()) '(() ()))
=> (() () () ())

;; 3 + 4 * 2   bolded for clarity
(+ (* '(() () () ()) '(() ())) '(() () ()))
=> (() () () () () () () () () () ())

Pretty cool, huh? We just added arithmetic (albeit extremely simple) to our basic Lisp engine. With some modifications we should be able to define and operate on negative integers and even define any rational number (limited by how much memory your computer's hardware can provide).

Now, thank goodness this isn't how real Lisp implementations actually handle numbers. It would be incredibly slow and impractical, not to mention annoying to read. Normally, numbers and math operators are primitive so that they are fast.