I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:
State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.
Inspired by a puzzle, I came up with this deterministic state
machine for decoding Morse code. It accepts a dot ('.'
), dash
('-'
), or terminator (0) one at a time, advancing through a state
machine step by step:
int morse_decode(int state, int c)
{
static const unsigned char t[] = {
0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
};
int v = t[-state];
switch (c) {
case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
case 0x2e: return v & 2 ? state*2 - 1 : 0;
case 0x2d: return v & 1 ? state*2 - 2 : 0;
default: return 0;
}
}
It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers. The full source listing, documentation, and comprehensive test suite:
https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c
The state machine is trie-shaped, and the 100-byte table t
is the static
encoding of the Morse code trie:
Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).
Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of t
), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as a binary heap in an
array: the left and right children of the node at i
are found at
i*2+1
and i*2+2
. No need to waste memory storing edges!
Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No c
input
will produce a bad state.
In the original problem the terminals were missing. Despite being a state
machine, morse_decode
is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.
The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.
I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:
Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:
https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c
However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):
int utf8_decode(int state, long *cp, int byte)
{
static const signed char table[8][256] = { /* ... */ };
static const unsigned char masks[2][8] = { /* ... */ };
int next = table[state][byte];
*cp = (*cp << 6) | (byte & masks[!state][next&7]);
return next;
}
Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.
Despite the huge table I’m pretty happy with it.
Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).
/* State machine counting words in a sequence of code points.
*
* The current word count is the absolute value of the state, so
* the initial state is zero. Code points are fed into the state
* machine one at a time, each call returning the next state.
*/
long word_count(long state, long codepoint)
{
switch (codepoint) {
case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
return state < 0 ? -state : state;
default:
return state < 0 ? state : -1 - state;
}
}
I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.
The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.
Reader challenge: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the switch
(chain of branches, jump table?).
In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)
Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:
WHITESPACE = {
0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}
def wordcount():
count = 0
while True:
while True:
# low signal
codepoint = yield count
if codepoint not in WHITESPACE:
count += 1
break
while True:
# high signal
codepoint = yield count
if codepoint in WHITESPACE:
break
However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:
wc = wordcount()
next(wc) # prime the generator
wc.send(ord('A')) # => 1
wc.send(ord(' ')) # => 1
wc.send(ord('B')) # => 2
wc.send(ord(' ')) # => 2
Same idea in Lua, which famously has full coroutines:
local WHITESPACE = {
[0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
[0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
[0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
[0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
[0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
[0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
[0x3000]=true
}
function wordcount()
local count = 0
while true do
while true do
-- low signal
local codepoint = coroutine.yield(count)
if not WHITESPACE[codepoint] then
count = count + 1
break
end
end
while true do
-- high signal
local codepoint = coroutine.yield(count)
if WHITESPACE[codepoint] then
break
end
end
end
end
Except for initially priming the coroutine, at least coroutine.wrap()
hides the fact that it’s a coroutine.
wc = coroutine.wrap(wordcount)
wc() -- prime the coroutine
wc(string.byte('A')) -- => 1
wc(string.byte(' ')) -- => 1
wc(string.byte('B')) -- => 2
wc(string.byte(' ')) -- => 2
Finally, a couple more examples not worth describing in detail here. First a Unicode case folding state machine:
https://github.com/skeeto/scratch/blob/master/misc/casefold.c
It’s just an interface to do a lookup into the official case folding table. It was an experiment, and I probably wouldn’t use it in a real program.
Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s not obvious from the interface, but internally it’s just a state machine for both encoder and decoder, which is what it allows it to “pause” between any pair of input/output bytes.
]]>Machine learning is a trendy topic, so naturally it’s often used for inappropriate purposes where a simpler, more efficient, and more reliable solution suffices. The other day I saw an illustrative and fun example of this: Neural Network Cars and Genetic Algorithms. The video demonstrates 2D cars driven by a neural network with weights determined by a generic algorithm. However, the entire scheme can be replaced by a first-degree polynomial without any loss in capability. The machine learning part is overkill.
Above demonstrates my implementation using a polynomial to drive the cars. My wife drew the background. There’s no path-finding; these cars are just feeling their way along the track, “following the rails” so to speak.
My intention is not to pick on this project in particular. The likely motivation in the first place was a desire to apply a neural network to something. Many of my own projects are little more than a vehicle to try something new, so I can sympathize. Though a professional setting is different, where machine learning should be viewed with a more skeptical eye than it’s usually given. For instance, don’t use active learning to select sample distribution when a quasirandom sequence will do.
In the video, the car has a limited turn radius, and minimum and maximum speeds. (I’ve retained these contraints in my own simulation.) There are five sensors — forward, forward-diagonals, and sides — each sensing the distance to the nearest wall. These are fed into a 3-layer neural network, and the outputs determine throttle and steering. Sounds pretty cool!
A key feature of neural networks is that the outputs are a nonlinear function of the inputs. However, steering a 2D car is simple enough that a linear function is more than sufficient, and neural networks are unnecessary. Here are my equations:
steering = C0*input1 - C0*input3
throttle = C1*input2
I only need three of the original inputs — forward for throttle, and
diagonals for steering — and the driver has just two parameters, C0
and
C1
, the polynomial coefficients. Optimal values depend on the track
layout and car configuration, but for my simulation, most values above 0
and below 1 are good enough in most cases. It’s less a matter of crashing
and more about navigating the course quickly.
The lengths of the red lines below are the driver’s three inputs:
These polynomials are obviously much faster than a neural network, but they’re also easy to understand and debug. I can confidently reason about the entire range of possible inputs rather than worry about a trained neural network responding strangely to untested inputs.
Instead of doing anything fancy, my program generates the coefficients at random to explore the space. If I wanted to generate a good driver for a course, I’d run a few thousand of these and pick the coefficients that complete the course in the shortest time. For instance, these coefficients make for a fast, capable driver for the course featured at the top of the article:
C0 = 0.896336973, C1 = 0.0354805067
Many constants can complete the track, but some will be faster than others. If I was developing a racing game using this as the AI, I’d not just pick constants that successfully complete the track, but the ones that do it quickly. Here’s what the spread can look like:
If you want to play around with this yourself, here’s my C source code that implements this driving AI and generates the videos and images above:
Racetracks are just images drawn in your favorite image editing program using the colors documented in the source header.
]]>Despite the goal of JSON being a subset of JavaScript — which it failed to achieve (update: this was fixed) — parsing JSON is quite unlike parsing a programming language. For invalid inputs, the specific cause of error is often counter-intuitive. Normally this doesn’t matter, but I recently ran into a case where it does.
Consider this invalid input to a JSON parser:
[01]
To a human this might be interpreted as an array containing a number. Either the leading zero is ignored, or it indicates octal, as it does in many languages, including JavaScript. In either case the number in the array would be 1.
However, JSON does not support leading zeros, neither ignoring them nor supporting octal notation. Here’s the railroad diagram for numbers from the JSON specficaiton:
Or in regular expression form:
-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?
If a token starts with 0
then it can only be followed by .
, e
,
or E
. It cannot be followed by a digit. So, the natural human
response to mentally parsing [01]
is: This input is invalid because
it contains a number with a leading zero, and leading zeros are not
accepted. But this is not actually why parsing fails!
A simple model for the parser is as consuming tokens from a lexer. The lexer’s job is to read individual code points (characters) from the input and group them into tokens. The possible tokens are string, number, left brace, right brace, left bracket, right bracket, comma, true, false, and null. The lexer skips over insignificant whitespace, and it doesn’t care about structure, like matching braces and brackets. That’s the parser’s job.
In some instances the lexer can fail to parse a token. For example, if
while looking for a new token the lexer reads the character %
, then
the input must be invalid. No token starts with this character. So in
some cases invalid input will be detected by the lexer.
The parser consumes tokens from the lexer and, using some state, ensures the sequence of tokens is valid. For example, arrays must be a well formed sequence of left bracket, value, comma, value, comma, etc., right bracket. One way to reject input with trailing garbage, is for the lexer to also produce an EOF (end of file/input) token when there are no more tokens, and the parser could specifically check for that token before accepting the input as valid.
Getting back to the input [01]
, a JSON parser receives a left bracket
token, then updates its bookkeeping to track that it’s parsing an array.
When looking for the next token, the lexer sees the character 0
followed by 1
. According to the railroad diagram, this is a number
token (starts with 0
), but 1
cannot be part of this token, so it
produces a number token with the contents “0”. Everything is still fine.
Next the lexer sees 1
followed by ]
. Since ]
cannot be part of a
number, it produces another number token with the contents “1”. The
parser receives this token but, since it’s parsing an array, it expects
either a comma token or a right bracket. Since this is neither, the
parser fails with an error about an unexpected number. The parser will
not complain about leading zeros because JSON has no concept of leading
zeros. Human intuition is right, but for the wrong reasons.
Try this for yourself in your favorite JSON parser. Or even just pop up the JavaScript console in your browser and try it out:
JSON.parse('[01]');
Firefox reports:
SyntaxError: JSON.parse: expected ‘,’ or ‘]’ after array element
Chromium reports:
SyntaxError: Unexpected number in JSON
Edge reports (note it says “number” not “digit”):
Error: Invalid number at position:3
In all cases the parsers accepted a zero as the first array element, then rejected the input after the second number token for being a bad sequence of tokens. In other words, this is a parser error rather than a lexer error, as a human might intuit.
My JSON parser comes with a testing tool that shows the token stream up until the parser rejects the input, useful for understanding these situations:
$ echo '[01]' | tests/stream
struct expect seq[] = {
{JSON_ARRAY},
{JSON_NUMBER, "0"},
{JSON_ERROR},
};
There’s an argument to be made here that perhaps the human readable error message should mention leading zeros, since that’s likely the cause of the invalid input. That is, a human probably thought JSON allowed leading zeros, and so the clearer message would tell the human that JSON does not allow leading zeros. This is the “more art than science” part of parsing.
It’s the same story with this invalid input:
[truefalse]
From this input, the lexer unambiguously produces left bracket,
true, false, right bracket. It’s still up to the parser to reject this
input. The only reason we never see truefalse
in valid JSON is that
the overall structure never allows these tokens to be adjacent, not
because they’d be ambiguous. Programming languages have identifiers,
and in a programming language this would parse as the identifier
truefalse
rather than true
followed by false
. From this point of
view, JSON seems quite strange.
Just as before, Firefox reports:
SyntaxError: JSON.parse: expected ‘,’ or ‘]’ after array element
Chromium reports the same error as it does for [true false]
:
SyntaxError: Unexpected token f in JSON
Edge’s message is probably a minor bug in their JSON parser:
Error: Expected ‘]’ at position:10
Position 10 is the last character in false
. The lexer consumed false
from the input, produced a “false” token, then the parser rejected the
input. When it reported the error, it chose the end of the invalid
token as the error position rather than the start, despite the fact that
the only two valid tokens (comma, right bracket) are both a single
character. It should also say “Expected ‘]’ or ‘,’” (as Firefox does)
rather than just “]”.
That’s all pretty academic. Except for producing nice error messages, nobody really cares so much why the input was rejected. The mismatch between intuition and reality isn’t important.
However, it does come up with concatenated JSON. Some parsers, including mine, will optionally consume multiple JSON values, one after another, from the same input. Here’s an example from one of my favorite command line tools, jq:
echo '{"x":0,"y":1}{"x":2,"y":3}{"x":4,"y":5}' | jq '.x + .y'
1
5
9
The input contains three unambiguously-concatenated JSON objects, so the parser produces three distinct objects. Now consider this input, this time outside of the context of an array:
01
Is this invalid, one number, or two numbers? According to the lexer and parser model described above, this is valid and unambiguously two concatenated numbers. Here’s what my parser says:
$ echo '01' | tests/stream
struct expect seq[] = {
{JSON_NUMBER, "0"},
{JSON_DONE},
{JSON_NUMBER, "1"},
{JSON_DONE},
{JSON_ERROR},
};
Note: The JSON_DONE
“token” indicates acceptance, and the JSON_ERROR
token is an EOF indicator, not a hard error. Since jq allows leading
zeros in its JSON input, it’s ambiguous and parses this as the number 1,
so asking its opinion on this input isn’t so interesting. I surveyed
some other JSON parsers that accept concatenated JSON:
For my parser it’s the same story for truefalse
:
echo 'truefalse' | tests/stream
struct expect seq[] = {
{JSON_TRUE, "true"},
{JSON_DONE},
{JSON_FALSE, "false"},
{JSON_DONE},
{JSON_ERROR},
};
Neither rejecting nor accepting this input is wrong, per se. Concatenated JSON is outside of the scope of JSON itself, and concatenating arbitrary JSON objects without a whitespace delimiter can lead to weird and ill-formed input. This is all a great argument in favor or Newline Delimited JSON, and its two simple rules:
'\n'
This solves the concatenation issue, and, even more, it works well with parsers not supporting concatenation: Split the input on newlines and pass each line to your JSON parser.
]]>'((add . (lambda (a b) (+ a b)))
(sub . (lambda (a b) (- a b)))
(mul . (lambda (a b) (* a b)))
(div . (lambda (a b) (/ a b))))
It looks like it would work, and indeed it does work in this case.
However, there are good reasons to actually evaluate those lambda
expressions. Eventually invoking the lambda expressions in the quoted
form above are equivalent to using eval
. So, instead, prefer the
backquote form:
`((add . ,(lambda (a b) (+ a b)))
(sub . ,(lambda (a b) (- a b)))
(mul . ,(lambda (a b) (* a b)))
(div . ,(lambda (a b) (/ a b))))
There are a lot of interesting things to say about this, but let’s first reduce it to two very simple cases:
(lambda (x) x)
'(lambda (x) x)
What’s the difference between these two forms? The first is a lambda expression, and it evaluates to a function object. The other is a quoted list that looks like a lambda expression, and it evaluates to a list — a piece of data.
A naive evaluation of these expressions in *scratch*
(C-x C-e
)
suggests they are are identical, and so it would seem that quoting a
lambda expression doesn’t really matter:
(lambda (x) x)
;; => (lambda (x) x)
'(lambda (x) x)
;; => (lambda (x) x)
However, there are two common situations where this is not the case: byte compilation and lexical scope.
It’s a little trickier to evaluate these forms byte compiled in the scratch buffer since that doesn’t happen automatically. But if it did, it would look like this:
;;; -*- lexical-binding: nil; -*-
(lambda (x) x)
;; => #[(x) "\010\207" [x] 1]
'(lambda (x) x)
;; => (lambda (x) x)
The #[...]
is the syntax for a byte-code function object. As
discussed in detail in my byte-code internals article, it’s a
special vector object that contains byte-code, and other metadata, for
evaluation by Emacs’ virtual stack machine. Elisp is one of very few
languages with readable function objects, and this feature is
core to its ahead-of-time byte compilation.
The quote, by definition, prevents evaluation, and so inhibits byte
compilation of the lambda expression. It’s vital that the byte compiler
does not try to guess the programmer’s intent and compile the expression
anyway, since that would interfere with lists that just so happen to
look like lambda expressions — i.e. any list containing the lambda
symbol.
There are three reasons you want your lambda expressions to get byte compiled:
Byte-compiled functions are significantly faster. That’s the main purpose for byte compilation after all.
The compiler performs static checks, producing warnings and errors ahead of time. This lets you spot certain classes of problems before they occur. The static analysis is even better under lexical scope due to its tighter semantics.
Under lexical scope, byte-compiled closures may use less memory. More specifically, they won’t accidentally keep objects alive longer than necessary. I’ve never seen a name for this implementation issue, but I call it overcapturing. More on this later.
While it’s common for personal configurations to skip byte compilation, Elisp should still generally be written as if it were going to be byte compiled. General rule of thumb: Ensure your lambda expressions are actually evaluated.
As I’ve stressed many times, you should always use lexical scope. There’s no practical disadvantage or trade-off involved. Just do it.
Once lexical scope is enabled, the two expressions diverge even without byte compilation:
;;; -*- lexical-binding: t; -*-
(lambda (x) x)
;; => (closure (t) (x) x)
'(lambda (x) x)
;; => (lambda (x) x)
Under lexical scope, lambda expressions evaluate to closures.
Closures capture their lexical environment in their closure object —
nothing in this particular case. It’s a type of function object,
making it a valid first argument to funcall
.
Since the quote prevents the second expression from being evaluated,
semantically it evaluates to a list that just so happens to look like
a (non-closure) function object. Invoking a data object as a
function is like using eval
— i.e. executing data as code.
Everyone already knows eval
should not be used lightly.
It’s a little more interesting to look at a closure that actually
captures a variable, so here’s a definition for constantly
, a
higher-order function that returns a closure that accepts any number of
arguments and returns a particular constant:
(defun constantly (x)
(lambda (&rest _) x))
Without byte compiling it, here’s an example of its return value:
(constantly :foo)
;; => (closure ((x . :foo) t) (&rest _) x)
The environment has been captured as an association list (with a
trailing t
), and we can plainly see that the variable x
is bound to
the symbol :foo
in this closure. Consider that we could manipulate
this data structure (e.g. setcdr
or setf
) to change the binding of
x
for this closure. This is essentially how closures mutate their own
environment. Moreover, closures from the same environment share
structure, so such mutations are also shared. More on this later.
Semantically, closures are distinct objects (via eq
), even if the
variables they close over are bound to the same value. This is because
they each have a distinct environment attached to them, even if in
some invisible way.
(eq (constantly :foo) (constantly :foo))
;; => nil
Without byte compilation, this is true even when there’s no lexical environment to capture:
(defun dummy ()
(lambda () t))
(eq (dummy) (dummy))
;; => nil
The byte compiler is smart, though. As an optimization, the same closure object is reused when possible, avoiding unnecessary work, including multiple object allocations. Though this is a bit of an abstraction leak. A function can (ab)use this to introspect whether it’s been byte compiled:
(defun have-i-been-compiled-p ()
(let ((funcs (vector nil nil)))
(dotimes (i 2)
(setf (aref funcs i) (lambda ())))
(eq (aref funcs 0) (aref funcs 1))))
(have-i-been-compiled-p)
;; => nil
(byte-compile 'have-i-been-compiled-p)
(have-i-been-compiled-p)
;; => t
The trick here is to evaluate the exact same non-capturing lambda expression twice, which requires a loop (or at least some sort of branch). Semantically we should think of these closures as being distinct objects, but, if we squint our eyes a bit, we can see the effects of the behind-the-scenes optimization.
Don’t actually do this in practice, of course. That’s what
byte-code-function-p
is for, which won’t rely on a subtle
implementation detail.
I mentioned before that one of the potential gotchas of not byte compiling your lambda expressions is overcapturing closure variables in the interpreter.
To evaluate lisp code, Emacs has both an interpreter and a virtual machine. The interpreter evaluates code in list form: cons cells, numbers, symbols, etc. The byte compiler is like the interpreter, but instead of directly executing those forms, it emits byte-code that, when evaluated by the virtual machine, produces identical visible results to the interpreter — in theory.
What this means is that Emacs contains two different implementations of Emacs Lisp, one in the interpreter and one in the byte compiler. The Emacs developers have been maintaining and expanding these implementations side-by-side for decades. A pitfall to this approach is that the implementations can, and do, diverge in their behavior. We saw this above with that introspective function, and it comes up in practice with advice.
Another way they diverge is in closure variable capture. For example:
;;; -*- lexical-binding: t; -*-
(defun overcapture (x y)
(when y
(lambda () x)))
(overcapture :x :some-big-value)
;; => (closure ((y . :some-big-value) (x . :x) t) nil x)
Notice that the closure captured y
even though it’s unnecessary.
This is because the interpreter doesn’t, and shouldn’t, take the time
to analyze the body of the lambda to determine which variables should
be captured. That would need to happen at run-time each time the
lambda is evaluated, which would make the interpreter much slower.
Overcapturing can get pretty messy if macros are introducing their own
hidden variables.
On the other hand, the byte compiler can do this analysis just once at compile-time. And it’s already doing the analysis as part of its job. It can avoid this problem easily:
(overcapture :x :some-big-value)
;; => #[0 "\300\207" [:x] 1]
It’s clear that :some-big-value
isn’t present in the closure.
But… how does this work?
Recall from the internals article that the four core elements of a byte-code function object are:
While a closure seems like compiling a whole new function each time the lambda expression is evaluated, there’s actually not that much to it! Namely, the behavior of the function remains the same. Only the closed-over environment changes.
What this means is that closures produced by a common lambda expression can all share the same byte-code string (second element). Their bodies are identical, so they compile to the same byte-code. Where they differ are in their constants vector (third element), which gets filled out according to the closed over environment. It’s clear just from examining the outputs:
(constantly :a)
;; => #[128 "\300\207" [:a] 2]
(constantly :b)
;; => #[128 "\300\207" [:b] 2]
constantly
has three of the four components of the closure in its own
constant pool. Its job is to construct the constants vector, and then
assemble the whole thing into a byte-code function object (#[...]
).
Here it is with M-x disassemble
:
0 constant make-byte-code
1 constant 128
2 constant "\300\207"
4 constant vector
5 stack-ref 4
6 call 1
7 constant 2
8 call 4
9 return
(Note: since byte compiler doesn’t produce perfectly optimal code, I’ve simplified it for this discussion.)
It pushes most of its constants on the stack. Then the stack-ref 5
(5)
puts x
on the stack. Then it calls vector
to create the constants
vector (6). Finally, it constructs the function object (#[...]
) by
calling make-byte-code
(8).
Since this might be clearer, here’s the same thing expressed back in terms of Elisp:
(defun constantly (x)
(make-byte-code 128 "\300\207" (vector x) 2))
To see the disassembly of the closure’s byte-code:
(disassemble (constantly :x))
The result isn’t very surprising:
0 constant :x
1 return
Things get a little more interesting when mutation is involved. Consider this adder closure generator, which mutates its environment every time it’s called:
(defun adder ()
(let ((total 0))
(lambda () (cl-incf total))))
(let ((count (adder)))
(funcall count)
(funcall count)
(funcall count))
;; => 3
(adder)
;; => #[0 "\300\211\242T\240\207" [(0)] 2]
The adder essentially works like this:
(defun adder ()
(make-byte-code 0 "\300\211\242T\240\207" (vector (list 0)) 2))
In theory, this closure could operate by mutating its constants vector directly. But that wouldn’t be much of a constants vector, now would it!? Instead, mutated variables are boxed inside a cons cell. Closures don’t share constant vectors, so the main reason for boxing is to share variables between closures from the same environment. That is, they have the same cons in each of their constant vectors.
There’s no equivalent Elisp for the closure in adder
, so here’s the
disassembly:
0 constant (0)
1 dup
2 car-safe
3 add1
4 setcar
5 return
It puts two references to boxed integer on the stack (constant
,
dup
), unboxes the top one (car-safe
), increments that unboxed
integer, stores it back in the box (setcar
) via the bottom reference,
leaving the incremented value behind to be returned.
This all gets a little more interesting when closures interact:
(defun fancy-adder ()
(let ((total 0))
`(:add ,(lambda () (cl-incf total))
:set ,(lambda (v) (setf total v))
:get ,(lambda () total))))
(let ((counter (fancy-adder)))
(funcall (plist-get counter :set) 100)
(funcall (plist-get counter :add))
(funcall (plist-get counter :add))
(funcall (plist-get counter :get)))
;; => 102
(fancy-adder)
;; => (:add #[0 "\300\211\242T\240\207" [(0)] 2]
;; :set #[257 "\300\001\240\207" [(0)] 3]
;; :get #[0 "\300\242\207" [(0)] 1])
This is starting to resemble object oriented programming, with methods acting upon fields stored in a common, closed-over environment.
All three closures share a common variable, total
. Since I didn’t
use print-circle
, this isn’t obvious from the last result, but each
of those (0)
conses are the same object. When one closure mutates
the box, they all see the change. Here’s essentially how fancy-adder
is transformed by the byte compiler:
(defun fancy-adder ()
(let ((box (list 0)))
(list :add (make-byte-code 0 "\300\211\242T\240\207" (vector box) 2)
:set (make-byte-code 257 "\300\001\240\207" (vector box) 3)
:get (make-byte-code 0 "\300\242\207" (vector box) 1))))
The backquote in the original fancy-adder
brings this article full
circle. This final example wouldn’t work correctly if those lambdas
weren’t evaluated properly.
I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.
Monte Carlo simulations, where PRNGs are used to compute numeric estimates for problems that are difficult or impossible to solve analytically.
Monte Carlo tree search AI, where massive numbers of games are played out randomly in search of an optimal move. This is a specific application of the last item.
Genetic algorithms, where a PRNG creates the initial population, and then later guides in mutation and breeding of selected solutions.
Cryptography, where a cryptographically-secure PRNGs (CSPRNGs) produce output that is predictable for recipients who know a particular secret, but not for anyone else. This article is only concerned with plain PRNGs.
For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:
The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.
PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.
In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.
Depending on the programming language, the PRNGs found in various
standard libraries may be of dubious quality. They’re slower than they
need to be, or have poorer quality than required. In some cases, such
as rand()
in C, the algorithm isn’t specified, and you can’t rely on
it for anything outside of trivial examples. In other cases the
algorithm and behavior is specified, but you could easily do better
yourself.
My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.
A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”
For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:
Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.
Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.
Speed (MB/s)
PRNG FAIL WEAK gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline X X 15000 13100
blowfishcbc16 0 1 169 157
blowfishcbc4 0 5 725 676
blowfishctr16 1 3 187 184
blowfishctr4 1 5 890 1000
mt64 1 7 1700 1970
pcg64 0 4 4150 3290
rc4 0 5 366 185
spcg64 0 8 5140 4960
xoroshiro128+ 0 6 8100 7720
xorshift128+ 0 2 7660 6530
xorshift64* 0 3 4990 5060
And the actual dieharder outputs:
The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.
These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.
In the speed benchmark, the PRNG is initialized, a 1-second alarm(1)
is set, then the PRNG fills a large volatile
buffer of 64-bit unsigned
integers again and again as quickly as possible until the alarm fires.
The amount of memory written is measured as the PRNG’s speed.
The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.
The purpose for making the buffer volatile
is to force the entire
output to actually be “consumed” as far as the compiler is concerned.
Otherwise the compiler plays nasty tricks to make the program do as
little work as possible. Another way to deal with this would be to
write(2)
buffer, but of course I didn’t want to introduce
unnecessary I/O into a benchmark.
On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.
The SIGALRM handler sets a volatile
global variable that tells the
generator to stop. The PRNG call was unrolled 8 times to avoid the
alarm check from significantly impacting the benchmark. You can see
the effect for yourself by changing UNROLL
to 1 (i.e. “don’t
unroll”) in the code. Unrolling beyond 8 times had no measurable
effect to my tests.
Due to the PRNGs being inlined, this unrolling makes the benchmark
less realistic, and it shows in the results. Using volatile
for the
buffer helped to counter this effect and reground the results. This is
a fuzzy problem, and there’s not really any way to avoid it, but I
will also discuss this below.
To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:
prng | dieharder -g200 -a -m4
This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.
There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.
The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.
Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.
Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.
One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.
Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.
The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.
In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.
In CBC mode, each output depends on the previous, but in CTR mode all
blocks are independent. Work can begin on the next output before the
previous output is complete. The x86 architecture uses out-of-order
execution to achieve many of its performance gains: Instructions may
be executed in a different order than they appear in the program,
though their observable effects must generally be ordered
correctly. Breaking dependencies between instructions allows
out-of-order execution to be fully exercised. It also gives the
compiler more freedom in instruction scheduling, though the volatile
accesses cannot be reordered with respect to each other (hence it
helping to reground the benchmark).
Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.
On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?
In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.
Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:
It’s between 1/4 and 1/5 the speed of xoroshiro128+.
It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.
Its implementation is three times bigger than xoroshiro128+, and much more complicated.
It had one statistical failure (dab_filltree2).
Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.
Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.
The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.
PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).
Here’s essentially what the simplified version looks like:
uint32_t
spcg32(uint64_t s[1])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a = 0xd737232eeccdf7ed;
*s = *s * m + a;
int shift = 29 - (*s >> 61);
return *s >> shift;
}
The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.
The two constants are entirely my own devising. It’s two 64-bit primes
generated using Emacs’ M-x calc
: 2 64 ^ k r k n k p k p k p
.
Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.
However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.
Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.
Here’s my doubled-up version:
uint64_t
spcg64(uint64_t s[2])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a0 = 0xd737232eeccdf7ed;
uint64_t a1 = 0x8b260b70b8e98891;
uint64_t p0 = s[0];
uint64_t p1 = s[1];
s[0] = p0 * m + a0;
s[1] = p1 * m + a1;
int r0 = 29 - (p0 >> 61);
int r1 = 29 - (p1 >> 61);
uint64_t high = p0 >> r0;
uint32_t low = p1 >> r1;
return (high << 32) | low;
}
The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.
In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.
This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.
xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:
uint64_t
xoroshiro128plus(uint64_t s[2])
{
uint64_t s0 = s[0];
uint64_t s1 = s[1];
uint64_t result = s0 + s1;
s1 ^= s0;
s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
s[1] = (s1 << 36) | (s1 >> 28);
return result;
}
There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.
These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.
uint64_t
xorshift64star(uint64_t s[1])
{
uint64_t x = s[0];
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
s[0] = x;
return x * UINT64_C(0x2545f4914f6cdd1d);
}
However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.
If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.
uint64_t
xorshift128plus(uint64_t s[2])
{
uint64_t x = s[0];
uint64_t y = s[1];
s[0] = y;
x ^= x << 23;
s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
return s[1] + y;
}
It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.
My own takeaway (until I re-evaluate some years in the future):
Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:
Speed (MB/s)
PRNG gcc-5.4.0 clang-3.8.0
------------------------------------
baseline 2560 2400
blowfishcbc16 36.5 45.4
blowfishcbc4 135 173
blowfishctr16 36.4 45.2
blowfishctr4 133 168
mt64 207 254
pcg64 980 712
rc4 96.6 44.0
spcg64 1021 948
xoroshiro128+ 2560 1570
xorshift128+ 2560 1520
xorshift64* 1360 1080
LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.
So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.
]]>lexical-binding
exists at a file-level
when there was already lexical-let
(from cl-lib
), prompted by my
previous article on JIT byte-code compilation. The specific
context is Emacs Lisp, but these concepts apply to language design in
general.
Until Emacs 24.1 (June 2012), Elisp only had dynamically scoped variables — a feature, mostly by accident, common to old lisp dialects. While dynamic scope has some selective uses, it’s widely regarded as a mistake for local variables, and virtually no other languages have adopted it.
Way back in 1993, Dave Gillespie’s deviously clever lexical-let
macro was committed to the cl
package, providing a rudimentary
form of opt-in lexical scope. The macro walks its body replacing local
variable names with guaranteed-unique gensym names: the exact same
technique used in macros to create “hygienic” bindings that aren’t
visible to the macro body. It essentially “fakes” lexical scope within
Elisp’s dynamic scope by preventing variable name collisions.
For example, here’s one of the consequences of dynamic scope.
(defun inner ()
(setq v :inner))
(defun outer ()
(let ((v :outer))
(inner)
v))
(outer)
;; => :inner
The “local” variable v
in outer
is visible to its callee, inner
,
which can access and manipulate it. The meaning of the free variable
v
in inner
depends entirely on the run-time call stack. It might
be a global variable, or it might be a local variable for a caller,
direct or indirect.
Using lexical-let
deconflicts these names, giving the effect of
lexical scope.
(defvar v)
(defun lexical-outer ()
(lexical-let ((v :outer))
(inner)
v))
(lexical-outer)
;; => :outer
But there’s more to lexical scope than this. Closures only make sense
in the context of lexical scope, and the most useful feature of
lexical-let
is that lambda expressions evaluate to closures. The
macro implements this using a technique called closure
conversion. Additional parameters are added to the original
lambda function, one for each lexical variable (and not just each
closed-over variable), and the whole thing is wrapped in another
lambda function that invokes the original lambda function with the
additional parameters filled with the closed-over variables — yes, the
variables (e.g. symbols) themselves, not just their values, (e.g.
pass-by-reference). The last point means different closures can
properly close over the same variables, and they can bind new values.
To roughly illustrate how this works, the first lambda expression
below, which closes over the lexical variables x
and y
, would be
converted into the latter by lexical-let
. The #:
is Elisp’s syntax
for uninterned variables. So #:x
is a symbol x
, but not the
symbol x
(see print-gensym
).
;; Before conversion:
(lambda ()
(+ x y))
;; After conversion:
(lambda (&rest args)
(apply (lambda (x y)
(+ (symbol-value x)
(symbol-value y)))
'#:x '#:y args))
I’ve said on multiple occasions that lexical-binding: t
has
significant advantages, both in performance and static analysis, and
so it should be used for all future Elisp code. The only reason it’s
not the default is because it breaks some old (badly written) code.
However, lexical-let
doesn’t realize any of these advantages! In
fact, it has worse performance than straightforward dynamic scope with
let
.
New symbol objects are allocated and initialized (make-symbol
) on
each run-time evaluation, one per lexical variable.
Since it’s just faking it, lexical-let
still uses dynamic
bindings, which are more expensive than lexical bindings. It varies
depending on the C compiler that built Emacs, but dynamic variable
accesses (opcode varref
) take around 30% longer than lexical
variable accesses (opcode stack-ref
). Assignment is far worse,
where dynamic variable assignment (varset
) takes 650% longer than
lexical variable assignment (stack-set
). How I measured all this
is a topic for another article.
The “lexical” variables are accessed using symbol-value
, a full
function call, so they’re even slower than normal dynamic
variables.
Because converted lambda expressions are constructed dynamically at
run-time within the body of lexical-let
, the resulting closure is
only partially byte-compiled even if the code as a whole has been
byte-compiled. In contrast, lexical-binding: t
closures are fully
compiled. How this works is worth its own article.
Converted lambda expressions include the additional internal function invocation, making them slower.
While lexical-let
is clever, and occasionally useful prior to Emacs
24, it may come at a hefty performance cost if evaluated frequently.
There’s no reason to use it anymore.
Another reason to be weary of dynamic scope is that it puts needless
constraints on the compiler, preventing a number of important
optimization opportunities. For example, consider the following
function, bar
:
(defun bar ()
(let ((x 1)
(y 2))
(foo)
(+ x y)))
Byte-compile this function under dynamic scope (lexical-binding:
nil
) and disassemble it to see what it looks like.
(byte-compile #'bar)
(disassemble #'bar)
That pops up a buffer with the disassembly listing:
0 constant 1
1 constant 2
2 varbind y
3 varbind x
4 constant foo
5 call 0
6 discard
7 varref x
8 varref y
9 plus
10 unbind 2
11 return
It’s 12 instructions, 5 of which deal with dynamic bindings. The
byte-compiler doesn’t always produce optimal byte-code, but this just
so happens to be nearly optimal byte-code. The discard
(a very
fast instruction) isn’t necessary, but otherwise no more compiler
smarts can improve on this. Since the variables x
and y
are
visible to foo
, they must be bound before the call and loaded after
the call. While generally this function will return 3, the
compiler cannot assume so since it ultimately depends on the behavior
foo
. Its hands are tied.
Compare this to the lexical scope version (lexical-binding: t
):
0 constant 1
1 constant 2
2 constant foo
3 call 0
4 discard
5 stack-ref 1
6 stack-ref 1
7 plus
8 return
It’s only 8 instructions, none of which are expensive dynamic variable instructions. And this isn’t even close to the optimal byte-code. In fact, as of Emacs 25.1 the byte-compiler often doesn’t produce the optimal byte-code for lexical scope code and still needs some work. Despite not firing on all cylinders, lexical scope still manages to beat dynamic scope in performance benchmarks.
Here’s the optimal byte-code, should the byte-compiler become smarter someday:
0 constant foo
1 call 0
2 constant 3
3 return
It’s down to 4 instructions due to computing the math operation at
compile time. Emacs’ byte-compiler only has rudimentary constant
folding, so it doesn’t notice that x
and y
are constants and
misses this optimization. I speculate this is due to its roots
compiling under dynamic scope. Since x
and y
are no longer exposed
to foo
, the compiler has the opportunity to optimize them out of
existence. I haven’t measured it, but I would expect this to be
significantly faster than the dynamic scope version of this function.
You might be thinking, “What if I really do want x
and y
to be
dynamically bound for foo
?” This is often useful. Many of Emacs’ own
functions are designed to have certain variables dynamically bound
around them. For example, the print family of functions use the global
variable standard-output
to determine where to send output by
default.
(let ((standard-output (current-buffer)))
(princ "value = ")
(prin1 value))
Have no fear: With lexical-binding: t
you can have your cake and
eat it too. Variables declared with defvar
, defconst
, or
defvaralias
are marked as “special” with an internal bit flag
(declared_special
in C). When the compiler detects one of these
variables (special-variable-p
), it uses a classical dynamic binding.
Declaring both x
and y
as special restores the original semantics,
reverting bar
back to its old byte-code definition (next time it’s
compiled, that is). But it would be poor form to mark x
or y
as
special: You’d de-optimize all code (compiled after the declaration)
anywhere in Emacs that uses these names. As a package author, only do
this with the namespace-prefixed variables that belong to you.
The only way to unmark a special variable is with the undocumented
function internal-make-var-non-special
. I expected makunbound
to
do this, but as of Emacs 25.1 it does not. This could possibly be
considered a bug.
I’ve said there are absolutely no advantages to lexical-binding: nil
.
It’s only the default for the sake of backwards-compatibility. However,
there is one case where lexical-binding: t
introduces a subtle issue
that would otherwise not exist. Take this code for example (and
nevermind prin1-to-string
for a moment):
;; -*- lexical-binding: t; -*-
(defun function-as-string ()
(with-temp-buffer
(prin1 (lambda () :example) (current-buffer))
(buffer-string)))
This creates and serializes a closure, which is one of Elisp’s unique
features. It doesn’t close over any variables, so it should be
pretty simple. However, this function will only work correctly under
lexical-binding: t
when byte-compiled.
(function-as-string)
;; => "(closure ((temp-buffer . #<buffer *temp*>) t) nil :example)"
The interpreter doesn’t analyze the closure, so just closes over
everything. This includes the hidden variable temp-buffer
created by
the with-temp-buffer
macro, resulting in an abstraction leak.
Buffers aren’t readable, so this will signal an error if an attempt is
made to read this function back into an s-expression. The
byte-compiler fixes this by noticing temp-buffer
isn’t actually
closed over and so doesn’t include it in the closure, making it work
correctly.
Under lexical-binding: nil
it works correctly either way:
(function-as-string)
;; -> "(lambda nil :example)"
This may seem contrived — it’s certainly unlikely — but it has come
up in practice. Still, it’s no reason to avoid lexical-binding: t
.
As I’ve said again and again, always use lexical-binding: t
. Use
dynamic variables judiciously. And lexical-let
is no replacement. It
has virtually none of the benefits, performs worse, and it only
applies to let
, not any of the other places bindings are created:
function parameters, dotimes
, dolist
, and condition-case
.
This wasn’t my first time writing a trie. The curse of programming in C is rewriting the same data structures and algorithms over and over. It’s the problem C++ templates are intended to solve. This rewriting isn’t always bad since each implementation is typically customized for its specific use, often resulting in greater performance and a smaller resource footprint.
Every time I’ve rewritten a trie, my implementation is a little bit better than the last. This time around I discovered an approach for traversing, both depth-first and breadth-first, an arbitrarily-sized trie without memory allocation. I’m definitely not the first to discover something like this. There’s Deutsch-Schorr-Waite pointer reversal for binary graphs (1965) — which I originally learned from reading the Scheme 9 from Outer Space garbage collector source — and Morris in-order traversal (1979) for binary trees. The former requires two extra tag bits per node and the latter requires no modifications at all.
But before I go further, some background. A trie can come in many shapes and sizes, but in the simple case each node of a trie has as many pointers as its alphabet. For illustration purposes, imagine a trie for strings of only four characters: A, B, C, and D. Each node is essentially four pointers.
#define TRIE_ALPHABET_SIZE 4
#define TRIE_STATIC_INIT {.flags = 0}
#define TRIE_TERMINAL_FLAG (1U << 0)
struct trie {
struct trie *next[TRIE_ALPHABET_SIZE];
unsigned flags;
};
It includes a flags
field, where a single bit tracks whether or not
a node is terminal — that is, a key terminates at this node. Terminal
nodes are not necessarily leaf nodes, which is the case when one key
is a prefix of another key. I could instead have used a 1-bit
bit-field (e.g. int is_terminal : 1;
) but I don’t like bit-fields.
A trie with the following keys, inserted in any order:
AAAAA
ABCD
CAA
CAD
CDBD
Looks like this (terminal nodes illustrated as small black squares):
The root of the trie is the empty string, and each child represents a trie prefixed with one of the symbols from the alphabet. This is a nice recursive definition, and it’s tempting to write recursive functions to process it. For example, here’s a recursive insertion function.
int
trie_insert_recursive(struct trie *t, const char *s)
{
if (!*s) {
t->flags |= TRIE_TERMINAL_FLAG;
return 1;
}
int i = *s - 'A';
if (!t->next[i]) {
t->next[i] = malloc(sizeof(*t->next[i]));
if (!t->next[i])
return 0;
*t->next[i] = (struct trie)TRIE_STATIC_INIT;
}
return trie_insert_recursive(t->next[i], s + 1);
}
If the string is empty (!*s
), mark the current node as terminal.
Otherwise recursively insert the substring under the appropriate
child. That’s a tail call, and any optimizing compiler would optimize
this call into a jump back to the beginning of of the function
(tail-call optimization), reusing the stack frame as if it were a
simple loop.
If that’s not good enough, such as when optimization is disabled for debugging and the recursive definition is blowing the stack, this is trivial to convert to a safe, iterative function. I prefer this version anyway.
int
trie_insert(struct trie *t, const char *s)
{
for (; *s; s++) {
int i = *s - 'A';
if (!t->next[i]) {
t->next[i] = malloc(sizeof(*t->next[i]));
if (!t->next[i])
return 0;
*t->next[i] = (struct trie)TRIE_STATIC_INIT;
}
t = t->next[i];
}
t->flags |= TRIE_TERMINAL_FLAG;
return 1;
}
Finding a particular prefix in the trie iteratively is also easy. This would be used to narrow the trie to a chosen prefix before iterating over the keys (e.g. find all strings matching a prefix).
struct trie *
trie_find(struct trie *t, const char *s)
{
for (; *s; s++) {
int i = *s - 'A';
if (!t->next[i])
return NULL;
t = t->next[i];
}
return t;
}
Depth-first traversal is stack-oriented. The stack represents the path through the graph, and each new vertex is pushed into this stack as it’s visited. A recursive traversal function can implicitly use the call stack for storing this information, so no additional data structure is needed.
The downside is that the call is no longer tail-recursive, so a large trie will blow the stack. Also, the caller needs to provide a callback function because the stack cannot unwind to return a value: The stack has important state on it. Here’s a typedef for the callback.
typedef void (*trie_visitor)(const char *key, void *arg);
And here’s the recursive depth-first traversal function. The top-level
caller passes the same buffer for buf
and bufend
, which must be at
least as large as the largest key. The visited key will be written to
this buffer and passed to the visitor.
void
trie_dfs_recursive(struct trie *t,
char *buf,
char *bufend,
trie_visitor v,
void *arg)
{
if (t->flags & TRIE_TERMINAL_FLAG) {
*bufend = 0;
v(buf, arg);
}
for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
if (t->next[i]) {
*bufend = 'A' + i;
trie_dfs_recursive(t->next[i], buf, bufend + 1, v, arg);
}
}
}
Moving the traversal stack to the heap would eliminate the stack overflow problem and it would allow control to return to the caller. This is going to be a lot of code for an article, but bear with me.
First define an iterator object. The stack will need two pieces of
information: which node did we come from (p
) and through which
pointer (i
). When a node has been exhausted, this will allow return
to the parent. The root
field tracks when traversal is complete.
struct trie_iter {
struct trie *root;
char *buf;
char *bufend;
struct {
struct trie *p;
int i;
} *stack;
};
A special value of -1 in i
means it’s the first visit for this node
and it should be visited by the callback if it’s terminal.
The iterator is initialized with trie_iter_init
. The max
indicates
the maximum length of any key. A more elaborate implementation could
automatically grow the stack to accommodate (e.g. realloc()), but I’m
keeping it as simple as possible.
int
trie_iter_init(struct trie_iter *it, struct trie *t, size_t max)
{
it->root = t;
it->stack = malloc(sizeof(*it->stack) * max);
if (!it->stack)
return 0;
it->buf = it->bufend = malloc(max);
if (!it->buf) {
free(it->stack);
return 0;
}
it->stack->p = t;
it->stack->i = -1;
return 1;
}
void
trie_iter_destroy(struct trie_iter *it)
{
free(it->stack);
it->stack = NULL;
free(it->buf);
it->buf = NULL;
}
And finally the complicated part. This uses the allocated stack to explore the trie in a loop until it hits a terminal, at which point it returns. A further call continues the traversal from where it left off. It’s like a hand-coded generator. With the way it’s written, the caller is obligated to follow through with the entire iteration before destroying the iterator, but this would be easy to correct.
int
trie_iter_next(struct trie_iter *it)
{
for (;;) {
struct trie *current = it->stack->p;
int i = it->stack->i++;
if (i == -1) {
/* Return result if terminal node. */
if (current->flags & TRIE_TERMINAL_FLAG) {
*it->bufend = 0;
return 1;
}
continue;
}
if (i == TRIE_ALPHABET_SIZE) {
/* End of current node. */
if (current == it->root)
return 0; // back at root, done
it->stack--;
it->bufend--;
continue;
}
if (current->next[i]) {
/* Push on next child node. */
*it->bufend = 'A' + i;
it->stack++;
it->bufend++;
it->stack->p = current->next[i];
it->stack->i = -1;
}
}
}
This is much nicer for the caller since there’s no control inverse.
struct trie_iter it;
trie_iter_init(&it, &trie_root, KEY_MAX);
while (trie_iter_next(&it)) {
// ... do something with it.buf ...
}
trie_iter_destroy(&it);
There are a few downsides to this:
Initialization could fail (not checked in the example) since it allocates memory.
Either the caller has to keep track of the maximum key length, or the iterator grows the stack automatically, which would mean iteration could fail at any point in the middle.
In order to destroy the trie, it needs to be traversed: Freeing memory first requires allocating memory. If the program is out of memory, it cannot destroy the trie to clean up before handling the situation, nor to make more memory available. It’s not good for resilience.
Wouldn’t it be nice to traverse the trie without memory allocation?
Rather than allocate a separate stack, the stack can be allocated
across the individual nodes of the trie. Remember those p
and i
fields from before? Put them on the trie.
struct trie_v2 {
struct trie_v2 *next[TRIE_ALPHABET_SIZE];
struct trie_v2 *p;
int i;
unsigned flags;
};
This automatically scales with the size of the trie, so there will always be enough of this stack. With the stack “pre-allocated” like this, traversal requires no additional memory allocation.
The iterator itself becomes a little simpler. It cannot fail and it doesn’t need a destructor.
struct trie_v2_iter {
struct trie_v2 *current;
char *buf;
};
void
trie_v2_iter_init(struct trie_v2_iter *it, struct trie_v2 *t, char *buf)
{
t->p = NULL;
t->i = -1;
it->current = t;
it->buf = buf;
}
The iteration function itself is almost identical to before. Rather
than increment a stack pointer, it uses p
to chain the nodes as a
linked list.
int
trie_v2_iter_next(struct trie_v2_iter *it)
{
for (;;) {
struct trie_v2 *current = it->current;
int i = it->current->i++;
if (i == -1) {
/* Return result if terminal node. */
if (current->flags & TRIE_TERMINAL_FLAG) {
*it->buf = 0;
return 1;
}
continue;
}
if (i == TRIE_ALPHABET_SIZE) {
/* End of current node. */
if (!current->p)
return 0;
it->current = current->p;
it->buf--;
continue;
}
if (current->next[i]) {
/* Push on next child node. */
*it->buf = 'A' + i;
it->buf++;
it->current = current->next[i];
it->current->p = current;
it->current->i = -1;
}
}
}
During traversal the iteration pointers look something like this:
This is not without its downsides:
Traversal is not re-entrant nor thread-safe. It’s not possible to run multiple in-place iterators side by side on the same trie since they’ll clobber each other.
It uses more memory — O(n) rather than O(max-key-length) — and sits on this extra memory for its entire lifetime.
The same technique can be used for breadth-first search, which is
queue-oriented rather than stack-oriented. The p
pointers are
instead chained into a queue, with a head
and tail
pointer
variable for each end. As each node is visited, its children are
pushed into the queue linked list.
This isn’t good for visiting keys by name. buf
was itself a stack
and played nicely with depth-first traversal, but there’s no easy way
to build up a key in a buffer breadth-first. So instead here’s a
function to destroy a trie breadth-first.
void
trie_v2_destroy(struct trie_v2 *t)
{
struct trie_v2 *head = t;
struct trie_v2 *tail = t;
while (head) {
for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
struct trie_v2 *next = head->next[i];
if (next) {
next->p = NULL;
tail->p = next;
tail = next;
}
}
struct trie_v2 *dead = head;
head = head->p;
free(dead);
}
}
During its traversal the p
pointers link up like so:
In my real code there’s also a flag to indicate the node’s allocation type: static or heap. This allows a trie to be composed of nodes from both kinds of allocations while still safe to destroy. It might also be useful to pack a reference counter into this space so that a node could be shared by more than one trie.
For a production implementation it may be worth packing i
into the
flags
field since it only needs a few bits, even with larger
alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the p
field
could be eliminated and instead one of the child pointers is
temporarily reversed. With these changes, this technique would fit
into the original struct trie
without changes, eliminating the extra
memory usage.
Update: Over on Hacker News, psi-squared has interesting suggestions such as leaving the traversal pointers intact, particularly in the case of a breadth-first search, which, until the next trie modification, allows for concurrent follow-up traversals.
]]>If you want to see it in action for yourself before reading further, here’s a Makefile that implements Conway’s Game of Life (40x40) using only macro assignments.
Run it with any make program in an ANSI terminal. It must literally
be named life.mak
. Beware: if you run it longer than a few minutes,
your computer may begin thrashing.
make -f life.mak
It’s 100% POSIX-compatible except for the sleep 0.1
(fractional
sleep), which is only needed for visual effect.
Unlike virtually every real world implementation, POSIX make doesn’t support conditional parts. For example, you might want your Makefile’s behavior to change depending on the value of certain variables. In GNU Make it looks like this:
ifdef USE_FOO
EXTRA_FLAGS = -ffoo -lfoo
else
EXTRA_FLAGS = -Wbar
endif
Or BSD-style:
.ifdef USE_FOO
EXTRA_FLAGS = -ffoo -lfoo
.else
EXTRA_FLAGS = -Wbar
.endif
If the goal is to write a strictly POSIX Makefile, how could I work around the lack of conditional parts and maintain a similar interface? The selection of macro/variable to evaluate can be dynamically selected, allowing for some useful tricks. First define the option’s default:
USE_FOO = 0
Then define both sets of flags:
EXTRA_FLAGS_0 = -Wbar
EXTRA_FLAGS_1 = -ffoo -lfoo
Now dynamically select one of these macros for assignment to
EXTRA_FLAGS
.
EXTRA_FLAGS = $(EXTRA_FLAGS_$(USE_FOO))
The assignment on the command line overrides the assignment in the
Makefile, so the user gets to override USE_FOO
.
$ make # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0 # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1 # EXTRA_FLAGS = -ffoo -lfoo
Before reading the POSIX specification, I didn’t realize that the left side of an assignment can get the same treatment. For example, if I really want the “if defined” behavior back, I can use the macro to mangle the left-hand side. For example,
EXTRA_FLAGS = -O0 -g3
EXTRA_FLAGS$(DEBUG) = -O3 -DNDEBUG
Caveat: If DEBUG
is set to empty, it may still result in true for
ifdef
depending on which make flavor you’re using, but will always
appear to be unset in this hack.
$ make # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes # EXTRA_FLAGS = -O0 -g3
This last case had me thinking: This is very similar to the (ab)use of
the x86 mov
instruction in mov is Turing-complete. These
macro assignments alone should be enough to compute any algorithm.
Macro names are just keys to a global associative array. This can be used to build lookup tables. Here’s a Makefile to “compute” the square root of integers between 0 and 10.
sqrt_0 = 0.000000
sqrt_1 = 1.000000
sqrt_2 = 1.414214
sqrt_3 = 1.732051
sqrt_4 = 2.000000
sqrt_5 = 2.236068
sqrt_6 = 2.449490
sqrt_7 = 2.645751
sqrt_8 = 2.828427
sqrt_9 = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))
The BSD flavors of make have a -V
option for printing variables,
which is an easy way to retrieve output. I used an “immediate”
assignment (:=
) for result
since some versions of make won’t
evaluate the expression before -V
printing.
$ make -f sqrt.mak -V result n=8
2.828427
Without -V
, a default target could be used instead:
output :
@printf "$(result)\n"
There are no math operators, so performing arithmetic requires some
creativity. For example, integers could be represented as a
series of x characters. The number 4 is xxxx
, the number 6 is
xxxxxx
, etc. Addition is concatenation (note: macros can have +
in
their names):
A = xxx
B = xxxx
A+B = $(A)$(B)
However, since there’s no way to “slice” a value, subtraction isn’t possible. A more realistic approach to arithmetic would require lookup tables.
Branching could be achieved through more lookup tables. For example,
square_0 = 1
square_1 = 2
square_2 = 4
# ...
result := $($(op)_$(n))
And called as:
$ make n=5 op=sqrt # 2.236068
$ make n=5 op=square # 25
Or using the DEBUG
trick above, use the condition to mask out the
results of the unwanted branch. This is similar to the mov
paper.
result := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))
And its usage:
$ make n=5 op=square # 25
$ make n=5 op=square verbose=1 # square(5) = 25
Looping is a tricky problem. However, one of the most common build
(anti?)patterns is the recursive Makefile. Borrowing from the
mov
paper, which used an unconditional jump to restart the program
from the beginning, for a Makefile Turing-completeness I can invoke
the Makefile recursively, restarting the program with a new set of
inputs.
Remember the print target above? I can loop by invoking make again with new inputs in this target,
output :
@printf "$(result)\n"
@$(MAKE) $(args)
Before going any further, now that loops have been added, the natural
next question is halting. In reality, the operating system will take
care of that after some millions of make processes have carelessly
been invoked by this horribly inefficient scheme. However, we can do
better. The program can clobber the MAKE
variable when it’s ready to
halt. Let’s formalize it.
loop = $(MAKE) $(args)
output :
@printf "$(result)\n"
@$(loop)
To halt, the program just needs to clear loop
.
Suppose we want to count down to 0. There will be an initial count:
count = 6
A decrement table:
6 = 5
5 = 4
4 = 3
3 = 2
2 = 1
1 = 0
0 = loop
The last line will be used to halt by clearing the name on the right side. This is three star territory.
$($($(count))) =
The result (current iteration) loop value is computed from the lookup table.
result = $($(count))
The next loop value is passed via args
. If loop
was cleared above,
this result will be discarded.
args = count=$(result)
With all that in place, invoking the Makefile will print a countdown from 5 to 0 and quit. This is the general structure for the Game of Life macro program.
A universal Turing machine has been implemented in Conway’s Game of Life. With all that heavy lifting done, one of the easiest methods today to prove a language’s Turing-completeness is to implement Conway’s Game of Life. Ignoring the criminal inefficiency of it, the Game of Life Turing machine could be run on the Game of Life simulation running on make’s macro assignments.
In the Game of Life program — the one linked at the top of this
article — each cell is stored in a macro named xxyy, after its
position. The top-left most cell is named 0000, then going left to
right, 0100, 0200, etc. Providing input is a matter of assigning each
of these macros. I chose X
for alive and -
for dead, but, as
you’ll see, any two characters permitted in macro names would work as
well.
$ make 0000=X 0100=- 0200=- 0300=X ...
The next part should be no surprise: The rules of the Game of Life are encoded as a 512-entry lookup table. The key is formed by concatenating the cell’s value along with all its neighbors, with itself in the center.
The “beginning” of the table looks like this:
--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...
Note: The two right-hand X
values here are the cell coming to life
(exactly three living neighbors). Computing the next value (n0101)
for 0101 is done like so:
n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))
Given these results, constructing the input to the next loop is simple:
args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...
The display output, to be given to printf
, is built similarly:
output = $(n0000)$(n0100)$(n0200)$(n0300)...
In the real version, this is decorated with an ANSI escape code that
clears the terminal. The printf
interprets the escape byte (\033
)
so that it doesn’t need to appear literally in the source.
And that’s all there is to it: Conway’s Game of Life running in a Makefile. Life, uh, finds a way.
]]>#include <iostream>
template <typename T>
struct Caller {
const T callee_;
Caller(const T callee) : callee_(callee) {}
void go() { callee_.call(); }
};
Caller can be parameterized to any type so long as it has a call()
method. For example, introduce two types, Foo and Bar.
struct Foo {
void call() const { std::cout << "Foo"; }
};
struct Bar {
void call() const { std::cout << "Bar"; }
};
int main() {
Caller<Foo> foo{Foo()};
Caller<Bar> bar{Bar()};
foo.go();
bar.go();
std::cout << std::endl;
return 0;
}
This code compiles cleanly and, when run, emits “FooBar”. This is an example of duck typing — i.e., “If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.” Foo and Bar are unrelated types. They have no common inheritance, but by providing the expected interface, they both work with with Caller. This is a special case of polymorphism.
Duck typing is normally only found in dynamically typed languages. Thanks to templates, a statically, strongly typed language like C++ can have duck typing without sacrificing any type safety.
Let’s try the same thing in Java using generics.
class Caller<T> {
final T callee;
Caller(T callee) {
this.callee = callee;
}
public void go() {
callee.call(); // compiler error: cannot find symbol call
}
}
class Foo {
public void call() { System.out.print("Foo"); }
}
class Bar {
public void call() { System.out.print("Bar"); }
}
public class Main {
public static void main(String args[]) {
Caller<Foo> f = new Caller<>(new Foo());
Caller<Bar> b = new Caller<>(new Bar());
f.go();
b.go();
System.out.println();
}
}
The program is practically identical, but this will fail with a
compile-time error. This is the result of type erasure. Unlike C++’s
templates, there will only ever be one compiled version of Caller, and
T will become Object. Since Object has no call()
method, compilation
fails. The generic type is only for enabling additional compiler
checks later on.
C++ templates behave like a macros, expanded by the compiler once for
each different type of applied parameter. The call
symbol is looked
up later, after the type has been fully realized, not when the
template is defined.
To fix this, Foo and Bar need a common ancestry. Let’s make this
Callee
.
interface Callee {
void call();
}
Caller needs to be redefined such that T is a subclass of Callee.
class Caller<T extends Callee> {
// ...
}
This now compiles cleanly because call()
will be found in Callee
.
Finally, implement Callee.
class Foo implements Callee {
// ...
}
class Bar implements Callee {
// ...
}
This is no longer duck typing, just plain old polymorphism. Type erasure prohibits duck typing in Java (outside of dirty reflection hacks).
Duck typing is useful for implementing the observer pattern without as much boilerplate. A class can participate in the observer pattern without inheriting from some specialized class or interface. For example, see the various signal and slots systems for C++. In constrast, Java has an EventListener type for everything:
A class concerned with many different kinds of events, such as an event logger, would need to inherit a large number of interfaces.
]]>In one of the early missions the player travels through hyperspace (which ain’t like dusting crops) to a storage area located in deep space. It’s a family business and the player is out there to take inventory of storage containers. Like when I saw the wormhole minefield in Deep Space 9, it got me thinking, “Why?” Why keep all these storage containers in deep space? There’s no defense or security out there to stop someone from stealing containers. It seems like it would be better to store those at the home base where they can be protected.
Storing items at random locations in deep space is actually very secure — more so than any lock! Space is huge. Even with faster-than-light travel searching a galaxy for a storage location would be impractical. It would be as impractical as using brute-force to find an encryption key — another huge search space. Also, if the storage location as been in use for X years, you’d need to come within X light-years of it, at least, in order to find it, since even gravity itself is limited by the speed of light.
Physical locks are usually described as the physical analogy of cryptography. Honestly, it’s not a very good analogy. The brute-force method for bypassing a lock isn’t to keep trying different keys or combinations until it works. No, it’s to just smash something (a window, the lock) or pick the lock. When translated back into the crypto world that’s like breaking a cipher, which isn’t a practical attack in modern cryptography.
No, the physical analogy for cryptography is deep space storage. The only practical way to access deep space items is to learn the coordinates of the storage location, which is the equivalent of the encryption key. If the coordinates are lost or forgotten, the items are as good as destroyed, just like data.
There are actually some advantages of physical “encryption.” Ciphertext can be decrypted offline without being detected. It’s not possible to visit deep space storage without having a physical presence, which is certainly more detectable than offline decryption. There’s also the advantage that it’s somewhat easier to tell when the key (location) generation algorithm is busted or you’re just bad at picking passphrases: someone else’s stuff will already be there. A literal collision.
]]>Fractran is a Turing-complete esoteric programming language. A Fractran program is just an ordered list of positive, irreducible fractions. The program's output for an input n is the output of the program run on n multiplied by the first fraction in the list that results in an integer. If no such multiplication results in an integer, the output is the input n. Variables are encoded in the exponents of the prime factorization of the input and output.
Some time ago I thought up an idea for a short story involving Fractran. A mathematician accidentally creates a Fractran program that can trivially factor large composites. Think something like O(log n). It's just the right magical string of, say, 31 fractions.
The story would be a first-person narrative of the mathematician's thoughts during a short time after the discovery, considering many of the consequences of the program. For example, it would render much of cryptography, which plays an essential role in the modern world, useless. He would also wonder if mankind should deserve such a discovery, considering how accidental it was.
This whole idea vanished once I realized that this Fractran program is actually completely trivial. It even runs in O(1) time. It's so trivial as to be worthless. Remember that Fractran stores its data in the number's prime factorization? The Fractran program that can factor any number in constant time is the identity function. To decode the output, which matches the input, all you need to do is factor it!
Interestingly, it doesn't seem to actually be possible to implement
the identity function in Fractran (But somehow it's
Turing-complete? Hmmm... more investigation needed.), unless you
can define your program in terms of its input. For example, the
program 1/(n+1)
is the identity function for
input n.
Under Gavin's suggestion, I've been watching The Prisoner, a 1960's British television show. The main character is an ex-spy held prisoner in "the Village", an Orwellian, isolated, enclosed town. No one in the Village has a name, but is instead assigned a number. The main character's number is 6.
As far as I can tell, after number 2 the order of the numbers is not important. Number 56 is no more important than number 12. By using numbers to name things there is an implied ordering, even if the the ordering is insignificant. It could be misleading to a newcomer.
Is there an unordered set could be used to name things? More specifically, is there a set that cannot be ordered? If it is unorderable then there is no implicit ordering to cause confusion. It's easy to have an unorderable set in theory, but I think it is difficult to have in practice.
Using letters is obviously out, as the alphabet has an order. Words and names made of letters can be sorted according to the alphabet. However, the ability to order words is almost never used outside of indexing. If words are used to name things, a newcomer is unlikely to assume relationships based on ordering. No one will assume Alan is more important than Bob.
Large numbers also tend to lack an assumed order. I don't think anyone assumes a larger or smaller social security number has meaning, or a larger or smaller phone number. However, these values are also known to be handed out in some semi-random way.
But can we do better? For at least English speakers, is it possible to create an unorderable set? If the items in the set have a vocal pronunciation, then they can probably be ordered by their phonetics. That could be avoided by using non-standard phonetic components, like clicks and pops, which won't have a standard ordering (in English, anyway).
A set has an order if there is a total, transitive, relational operator for the set. If such an operator does not exist then the set isn't linearly ordered. I want a set that can't easily have such an operator.
If a set of symbols was created, how might they be presented as to show no ordering. The order of the symbols in the original presentation might be considered the ordering, like how the alphabet is always presented in order. A circle could be used, but this is circularly ordered. I think there is also the issue of memorization. A human will have a much better time memorizing the symbols if memorized in some order. For example, try naming all the letters of the alphabet at random, without repeats. Or US states.
Thanks to modern day technology, with dynamic content, the set could be displayed in a random order each time it is viewed. For a web page, the server could select a random order, or a JavaScript program could reorder the images at random.
There could be partially ordered sets, like hierarchies and DAGs. The ordering in The Prisoner is one of these. There is number 1, then number 2, then everyone else. Is there a partially ordered set in use that has unique names at the same level?
The penalties incurred by intentionally prohibiting order would likely outweigh the benefit of the set. If it's not orderable, we can't index it, and it's difficult to deal with. I expect it's much easer to just use numbers and tell people that the order isn't important, or just use an obviously unordered set.
]]>This is related to a project I am working on and will post here soon. I imagine that, with a little more effort, this algorithm could turn into a short amateur paper.
Suppose you want to use a computer to simulate the roll of two
six-sided dice (notated 2d6
). The simplest approach would
be to replicate the results the same way you would roll dice:
independently and randomly generate two numbers between 1 and 6
inclusively. We easily can do this for any number of dice, we just
iterate and roll each die. Like this recursive function,
However, generating a number between 1 and 6 wastes small amounts of entropy. A six-sided die only takes about 2.58 bits of entropy to generate. Since we can only use bits discretely we have to spend 3 bits, throwing out 0.42 bits. On top of that, when we pull out 3 bits and they are out of range (0 or 7) we have to throw them out and try again.
Let's say we wanted to roll 10 dice, or 100 dice, or 1000 dice? Do we really need to generate that many numbers individually? That's a lot of wasted entropy adding up, entropy which can be expensive to gather. Well, we could instead use the probability distribution of the roll so that only a single number needs to be generated.
For a 2d6
roll, there are 36 unique possible outcomes
(6^2). We could select a number between 0 and 35, then choose that
specific roll. This roll can be calculated with a series of division
and modulus operations (u
for a number from a
uniform distribution) (also, note that the division is
integer division),
If we're only interested in the sum, we could save memory by making
this tail recursive — or iterative — and summing the dice as we
calculate them. Ignoring the exponent, this is O(n)
, not
better than the simple algorithm in terms of growth rate. This
algorithm is more efficient when it comes to entropy, though.
Consider 3d6
, with 216 possible outcomes, ideally with
the simple algorithm takes 3 3-bit rolls, consuming 9 bits. About 1.25
bits was not actually used (0.42 * 3). In the entropy-efficient
algorithm we need about 7.75 bits, so it only consumes 8 bits of
entropy. We saved a bit. That gap only gets larger with more dice. For
100d6
the simple algorithm uses 41 more bits than
necessary.
The efficient roll is basically defragmenting the individual rolls on the entropy stream.
In non-ideal world, though, some cases don't work out well. In
12d6
, almost half the numbers (compared to 25% in the
case of 1d6) from the uniform distribution will be out of range and a
lot more bits would be needed. On average, rolling dice individually
(or only some of them individually) for 12d6
will
be more efficient.
The efficient algorithm is only more efficient above a point near
where mod(log2(s), 1) < mod(log2(n^s), 1)
.
And, all of this doesn't come without a cost. You must pay the piper,
and this algorithm is paid with CPU and memory. Notice that exponent
there? That has to be done to exact precision (no floating point), and
it grows very quickly. If you want to roll more than a handful of
dice, you will be crunching some large numbers. Rolling just
100d6
means you have to work with a 78 digit
integer. 10000d6
is a 7782 digit integer. These can't be
done in floating point because the resolution of floating point is too
low: some rolls would not be possible.
The exponent could be memoized to trade some of that CPU time for more memory usage. Still, pretty costly. If you don't value your entropy, the tradeoff might not be worth it.
I can't see a way around performing that calculation. We need to know that big number exactly. Perhaps a mathematician might be able to manipulate the formulas such that it's not so expensive.
If you're rolling lots of dice and you want to preserve binary
entropy, try it out. If you want to be really efficient queue up rolls
— or generate them ahead of time — so that the number of outcomes is
just below a power of two. In the case of d6
, some good
number of dice to roll are 17 (~43.94 bits), 29 (~74.96), 41
(~105.983), 94 (~242.986 bits), 200 (~516.993), 253 (~653.995 bits),
306 (~791.99853 bits), and 971 (~2509.99859 bits). (Notice these get
closer and closer to an integer number of bits.)
This exercise partly comes from a couple different chapters in the book The Little Schemer. The book is an introduction to the Scheme programming language, a dialect of Lisp. The purpose to to teach basic programming concepts in a way that anyone can follow along just as well as someone with a degree in, say, computer science. It is still very useful for us programmer types because there are some good practice you get from reading and playing along.
First of all, Lisp is famous (infamous?) for lacking syntax. Any Lisp program is simply an S-expression, put simply, a list of lists. There is no operator precedence because operators are treated just like functions. This leads to prefix notation for mathematical expressions,
(+ 4 5) => 9
where the =>
indicates the result of evaluating the
expression. We can apply as many operands as we want,
(+ 2 3 4 5 10) => 24
We can put another list right in there as an operand,
(+ 3 (* 2 5) 4) => 17
You get the idea. In a function, the value of the last expression is
the return value. For example, here is the square
function in Scheme, which squares its input,
(define (square x) (* x x))
Then we can use it,
(+ (square 2) (square 5)) => 29
There are three important list operators to understand as
well: car
, cdr
,
and cons
. car
returns the first element in a
list. In the example below, the '
, a single quote, tells
the interpreter or compiler that the list is to be treated as data and
not to be executed. This is shorthand, or syntactic sugar, for
the quote
operator: (quote (stallman
moglen))
is the same as '(stallman moglen)
.
(car '(stallman moglen lessig)) => stallman
cdr
returns the "rest" of a list (everything but
the car
of the list). When passing a list with only one
element cdr
returns the empty list: ()
.
(cdr '(stallman moglen lessig)) => (moglen lessig) (cdr '(stallman)) => ()
We can ask if a list is empty or not
with null?
. #t
and #f
are true
and false.
(null? '(stallman moglen lessig)) => #f (null? '()) => #t
And finally, for lists, we have cons
. This function
allows us to build a list. It glues the first argument to the front of
the list in the second argument,
(cons 'stallman '(moglen lessig)) => (stallman moglen lessig) (cons 'stallman '()) => (stallman)
And one last function you need to know: eq?
. It
determines the two atoms are the same atom,
(eq? 'stallman 'moglen) => #f (eq? 'stallman 'stallman) => #t
Now, for this exercise we will pretend that the basic arithmetic
functions have not been defined for us. Instead all we have
is add1
and sub1
, each of which adds or
subtracts 1 from its argument respectively.
(add1 5) => 6 (sub1 5) => 4
Oh, I almost forgot. We also have the zero?
function
defined for us, which tells us if its argument is 0 or not. Notice
that functions that return true or false, called predicates, have
a ?
on the end.
(zero? 2) => #f (zero? 0) => #t
To make things simple, these definitions will only consider positive
numbers. We can define the +
function (for only two
arguments) in terms of the three basic functions shown above. It might
be interesting to try to write this yourself before you look any
further. (Hint: define it recursively!)
;; Adds together n and m (define (+ n m) (if (zero? m) n (add1 (+ n (sub1 m)))))
If the second argument is 0 we are done and simply return the first
argument. If not, we add 1 to n + (m -
1)
. The -
function is defined similarly.
;; Subtracts m from n (define (- n m) (if (zero? m) n (sub1 (- n (sub1 m)))))
Multiplication is the act of performing addition many times. We can go on defining it in terms of addition,
(define (* n m) (if (zero? m) 0 (+ n (* n (sub1 m)))))
(We'll leave division as an exercise for the reader as it gets a little more complicated than I need to go in order to get my overall point across.)
We will leave math behind for a moment take a look at
The Roots of
Lisp. In that link is an excellent paper written by Paul Graham
about John McCarthy, the inventor (or perhaps discoverer?) of Lisp,
and how Lisp came to be. It turns out that in order to have a fully
functional Lisp engine we only need seven primitive operators:
operators defined outside of the language itself as building blocks
for the language. For Lisp these seven operators are (Scheme-ized for
our purposes): eq?
, atom?
, car
,
cdr
, cons
, quote
, and
if
.
Notice how none of these are math operators. You may wonder how we can possibly perform mathematical operations when we lack these facilities. The answer: we have to define our own representation for numbers! Let's try this, define a number as a list of empty lists. So, the number 3 is,
'(() () ())
And here is 0, 2, and 4,
'() '(() ()) '(() () () ())
See how that works? Before, when we wanted to define addition and
subtraction, we needed three other
functions: zero?
, add1
,
and sub1
. With our number representation, how could we
define add1
with our seven primitive operators? Our
numbers are defined as lists, so we can use our list operators. To add
1 to a number, we append another empty list. Hey, that sounds a lot
like cons
!
(define (add1 n) (cons '() n))
Subtraction is removing an element from the list, which sounds a lot
like cdr
,
(define (sub1 n) (cdr n))
And to define zero?
we need to check for an empty
list. Notice this will also be the definition for null?
.
(define (zero? n) (eq? '() n))
And now we are back where we started. In fact, you can use the exact
definitions above to define +
, -
,
and *
. Our entire method number representation depends on
how we define add1
, sub1
,
and zero?
. Let's try it out,
;; 3 + 4 (+ '(() () ()) '(() () () ())) => (() () () () () () ()) ;; 5 - 2 (- '(() () () () ()) '(() ())) => (() () ()) ;; 2 * 2 (* '(() ()) '(() ())) => (() () () ()) ;; 3 + 4 * 2 bolded for clarity (+ (* '(() () () ()) '(() ())) '(() () ())) => (() () () () () () () () () () ())
Pretty cool, huh? We just added arithmetic (albeit extremely simple) to our basic Lisp engine. With some modifications we should be able to define and operate on negative integers and even define any rational number (limited by how much memory your computer's hardware can provide).
Now, thank goodness this isn't how real Lisp implementations actually handle numbers. It would be incredibly slow and impractical, not to mention annoying to read. Normally, numbers and math operators are primitive so that they are fast.
]]>