Articles tagged lua at null program

State machines are wonderful tools

2020-12-31T22:48:13Z

This article was discussed on Hacker News.

I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:

Present minimal, tidy interfaces
Require few, fixed resources
Hold no opinions about input and output
Have a compact, concise implementation
Be easy to reason about

State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.

Morse code decoder state machine

Inspired by a puzzle, I came up with this deterministic state machine for decoding Morse code. It accepts a dot ('.'), dash ('-'), or terminator (0) one at a time, advancing through a state machine step by step:

int morse_decode(int state, int c)
{
    static const unsigned char t[] = {
        0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
        0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
        0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
        0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
        0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
        0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
        0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
        0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
    };
    int v = t[-state];
    switch (c) {
    case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
    case 0x2e: return v &  2 ? state*2 - 1 : 0;
    case 0x2d: return v &  1 ? state*2 - 2 : 0;
    default:   return 0;
    }
}

It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers. The full source listing, documentation, and comprehensive test suite:

https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c

The state machine is trie-shaped, and the 100-byte table t is the static encoding of the Morse code trie:

Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).

Each node in the trie is a byte in the table. Dot and dash each have a bit indicating if their edge exists. The remaining bits index into a 1-based character table (at the end of t), and a 0 “index” indicates an empty (red) node. The nodes themselves are laid out as a binary heap in an array: the left and right children of the node at i are found at i*2+1 and i*2+2. No need to waste memory storing edges!

Since C sadly does not have multiple return values, I’m using the sign bit of the return value to create a kind of sum type. A negative return value is a state — which is why the state is negated internally before use. A positive result is a character output. If zero, the input was invalid. Only the initial state is non-negative (zero), which is fine since it’s, by definition, not possible to traverse to the initial state. No c input will produce a bad state.

In the original problem the terminals were missing. Despite being a state machine, morse_decode is a pure function. The caller can save their position in the trie by saving the state integer and trying different inputs from that state.

UTF-8 decoder state machine

The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.

I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:

Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:

https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c

However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):

int utf8_decode(int state, long *cp, int byte)
{
    static const signed char table[8][256] = { /* ... */ };
    static const unsigned char masks[2][8] = { /* ... */ };
    int next = table[state][byte];
    *cp = (*cp << 6) | (byte & masks[!state][next&7]);
    return next;
}

Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.

Despite the huge table I’m pretty happy with it.

Word count state machine

Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).

/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */
long word_count(long state, long codepoint)
{
    switch (codepoint) {
    case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
    case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
    case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
    case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
    case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
        return state < 0 ? -state : state;
    default:
        return state < 0 ? state : -1 - state;
    }
}

I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.

The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.

Reader challenge: Find a slick, efficient way to encode all those code points as a table rather than rely on whatever the compiler generates for the switch (chain of branches, jump table?).

Coroutines and generators as state machines

In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)

Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:

WHITESPACE = {
    0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
    0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
    0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
    0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
    0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}

def wordcount():
    count = 0
    while True:
        while True:
            # low signal
            codepoint = yield count
            if codepoint not in WHITESPACE:
                count += 1
                break
        while True:
            # high signal
            codepoint = yield count
            if codepoint in WHITESPACE:
                break

However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:

wc = wordcount()
next(wc)  # prime the generator
wc.send(ord('A'))  # => 1
wc.send(ord(' '))  # => 1
wc.send(ord('B'))  # => 2
wc.send(ord(' '))  # => 2

Same idea in Lua, which famously has full coroutines:

local WHITESPACE = {
    [0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
    [0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
    [0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
    [0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
    [0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
    [0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
    [0x3000]=true
}

function wordcount()
    local count = 0
    while true do
        while true do
            -- low signal
            local codepoint = coroutine.yield(count)
            if not WHITESPACE[codepoint] then
                count = count + 1
                break
            end
        end
        while true do
            -- high signal
            local codepoint = coroutine.yield(count)
            if WHITESPACE[codepoint] then
                break
            end
        end
    end
end

Except for initially priming the coroutine, at least coroutine.wrap() hides the fact that it’s a coroutine.

wc = coroutine.wrap(wordcount)
wc()  -- prime the coroutine
wc(string.byte('A'))  -- => 1
wc(string.byte(' '))  -- => 1
wc(string.byte('B'))  -- => 2
wc(string.byte(' '))  -- => 2

Extra examples

Finally, a couple more examples not worth describing in detail here. First a Unicode case folding state machine:

https://github.com/skeeto/scratch/blob/master/misc/casefold.c

It’s just an interface to do a lookup into the official case folding table. It was an experiment, and I probably wouldn’t use it in a real program.

Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s not obvious from the interface, but internally it’s just a state machine for both encoder and decoder, which is what it allows it to “pause” between any pair of input/output bytes.

Looking for Entropy in All the Wrong Places

2019-04-30T22:50:09Z

Imagine we’re writing a C program and we need some random numbers. Maybe it’s for a game, or for a Monte Carlo simulation, or for cryptography. The standard library has a rand() function for some of these purposes.

int r = rand();

There are some problems with this. Typically the implementation is a rather poor PRNG, and we can do much better. It’s a poor choice for Monte Carlo simulations, and outright dangerous for cryptography. Furthermore, it’s usually a dynamic function call, which has a high overhead compared to how little the function actually does. In glibc, it’s also synchronized, adding even more overhead.

But, more importantly, this function returns the same sequences of values each time the program runs. If we want different numbers each time the program runs, it needs to be seeded — but seeded with what? Regardless of what PRNG we ultimately use, we need inputs unique to this particular execution.

The right places

On any modern unix-like system, the classical approach is to open /dev/urandom and read some bytes. It’s not part of POSIX but it is a de facto standard. These random bits are seeded from the physical world by the operating system, making them highly unpredictable and uncorrelated. They’re are suitable for keying a CSPRNG and, from there, generating all the secure random bits you will ever need (perhaps with fast-key-erasure). Why not /dev/random? Because on Linux it’s pointlessly superstitious, which has basically ruined that path for everyone.

/* Returns zero on failure. */
int
getbits(void *buf, size_t len)
{
    int result = 0;
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        result = fread(buf, len, 1, f);
        fclose(f);
    }
    return result;
}

int
main(void)
{
    unsigned seed;
    if (getbits(&seed, sizeof(seed))) {
        srand(seed);
    } else {
        die();
    }

    /* ... */
}

Note how there are two different places getbits() could fail, with multiple potential causes.

It could fail to open the file. Perhaps the program isn’t running on a modern unix-like system. Perhaps it’s running in a chroot and /dev/urandom wasn’t created. Perhaps there are too many file descriptors already open. Perhaps there isn’t enough memory available to open a file. Perhaps the file permissions disallow it or it’s blocked by Mandatory Access Control (MAC).
It could fail to read the file. This essentially can’t happen unless the system is severely misconfigured, in which case a successful read would be suspect anyway. In this case it’s probably still a good idea to check the result.

The need for creating a file descriptor a serious issue for libraries. Libraries that quietly create and close file descriptors can interfere with the main program, especially if its asynchronous. The main program might rely on file descriptors being consecutive, predictable, or monotonic (example). File descriptors are also a limited resource, so it may exhaust a file descriptor slot needed for the main program. For a network service, a remote attacker could perhaps open enough sockets to deny a file descriptor to getbits(), blocking the program from gathering entropy.

/dev/urandom is simple, but it’s not an ideal API.

getentropy(2)

Wouldn’t it be nicer if our program could just directly ask the operating system to fill a buffer with random bits? That’s what the OpenBSD folks thought, so they introduced a getentropy(2) system call. When called correctly it cannot fail!

int getentropy(void *buf, size_t buflen);

Other operating systems followed suit, including Linux, though on Linux getentropy(2) is a library function implemented using getrandom(2), the actual system call. It’s been in the Linux kernel since version 3.17 (October 2014), but the libc wrapper didn’t appear in glibc until version 2.25 (February 2017). So as of this writing, there are still many systems where it’s still not practical to use even if their kernel is new enough.

For now on Linux you may still want to check, and have a strategy in place, for an ENOSYS result. Some systems are still running kernels that are 5 years old, or older.

OpenBSD also has another trick up its trick-filled sleeves: the .openbsd.randomdata section. Just as the .bss section is filled with zeros, the .openbsd.randomdata section is filled with securely-generated random bits. You could put your PRNG state in this section and it will be seeded as part of loading the program. Cool!

RtlGenRandom()

Windows doesn’t have /dev/urandom. Instead it has:

CryptGenRandom()
CryptAcquireContext()
CryptReleaseContext()

Though in typical Win32 fashion, the API is ugly, overly-complicated, and has multiple possible failure points. It’s essentially impossible to use without referencing documentation. Ugh.

However, Windows 98 and later has RtlGenRandom(), which has a much more reasonable interface. Looks an awful lot like getentropy(2), eh?

BOOLEAN RtlGenRandom(
  PVOID RandomBuffer,
  ULONG RandomBufferLength
);

The problem is that it’s not quite an official API, and no promises are made about it. In practice, far too much software now depends on it that the API is unlikely to ever break. Despite the prototype above, this function is actually named SystemFunction036(), and you have to supply your own prototype. Here’s my little drop-in snippet that turns it nearly into getentropy(2):

#ifdef _WIN32
#  define WIN32_LEAN_AND_MEAN
#  include 
#  pragma comment(lib, "advapi32.lib")
   BOOLEAN NTAPI SystemFunction036(PVOID, ULONG);
#  define getentropy(buf, len) (SystemFunction036(buf, len) ? 0 : -1)
#endif

It works in Wine, too, where, at least in my version, it reads from /dev/urandom.

The wrong places

That’s all well and good, but suppose we’re masochists. We want our program to be maximally portable so we’re sticking strictly to functionality found in the standard C library. That means no getentropy(2) and no RtlGenRandom(). We can still try to open /dev/urandom, but it might fail, or it might not actually be useful, so we’ll want a backup.

The usual approach found in a thousand tutorials is time(3):

srand(time(NULL));

It would be better to use an integer hash function to mix up the result from time(0) before using it as a seed. Otherwise two programs started close in time may have similar initial sequences.

srand(triple32(time(NULL)));

The more pressing issue is that time(3) has a resolution of one second. If the program is run twice inside of a second, they’ll both have the same sequence of numbers. It would be better to use a higher resolution clock, but, standard C doesn’t provide a clock with greater than one second resolution. That normally requires calling into POSIX or Win32.

So, we need to find some other sources of entropy unique to each execution of the program.

Quick and dirty “string” hash function

Before we get into that, we need a way to mix these different sources together. Here’s a small, 32-bit “string” hash function. The loop is the same algorithm as Java’s hashCode(), and I appended my own integer hash as a finalizer for much better diffusion.

uint32_t
hash32s(const void *buf, size_t len, uint32_t h)
{
    const unsigned char *p = buf;
    for (size_t i = 0; i < len; i++)
        h = h * 31 + p[i];
    h ^= h >> 17;
    h *= UINT32_C(0xed5ad4bb);
    h ^= h >> 11;
    h *= UINT32_C(0xac4c1b51);
    h ^= h >> 15;
    h *= UINT32_C(0x31848bab);
    h ^= h >> 14;
    return h;
}

It accepts a starting hash value, which is essentially a “context” for the digest that allows different inputs to be appended together. The finalizer acts as an implicit “stop” symbol in between inputs.

I used fixed-width integers, but it could be written nearly as concisely using only unsigned long and some masking to truncate to 32-bits. I leave this as an exercise to the reader.

Some of the values to be mixed in will be pointers themselves. These could instead be cast to integers and passed through an integer hash function, but using string hash avoids various caveats. Besides, one of the inputs will be a string, so we’ll need this function anyway.

Randomized pointers (ASLR, random stack gap, etc.)

Attackers can use predictability to their advantage, so modern systems use unpredictability to improve security. Memory addresses for various objects and executable code are randomized since some attacks require an attacker to know their addresses. We can skim entropy from these pointers to seed our PRNG.

Address Space Layout Randomization (ASLR) is when executable code and its associated data is loaded to a random offset by the loader. Code designed for this is called Position Independent Code (PIC). This has long been used when loading dynamic libraries so that all of the libraries on a system don’t have to coordinate with each other to avoid overlapping.

To improve security, it has more recently been extended to programs themselves. On both modern unix-like systems and Windows, position-independent executables (PIE) are now the default.

To skim entropy from ASLR, we just need the address of one of our functions. All the functions in our program will have the same relative offset, so there’s no reason to use more than one. An obvious choice is main():

    uint32_t h = 0;  /* initial hash value */
    int (*mainptr)() = main;
    h = hash32s(&mainptr, sizeof(mainptr), h);

Notice I had to store the address of main() in a variable, and then treat the pointer itself as a buffer for the hash function? It’s not hashing the machine code behind main, just its address. The symbol main doesn’t store an address, so it can’t be given to the hash function to represent its address. This is analogous to an array versus a pointer.

On a typical x86-64 Linux system, and when this is a PIE, that’s about 3 bytes worth of entropy. On 32-bit systems, virtual memory is so tight that it’s worth a lot less. We might want more entropy than that, and we want to cover the case where the program isn’t compiled as a PIE.

On unix-like systems, programs are typically dynamically linked against the C library, libc. Each shared object gets its own ASLR offset, so we can skim more entropy from each shared object by picking a function or variable from each. Let’s do malloc(3) for libc ASLR:

    void *(*mallocptr)() = malloc;
    h = hash32s(&mallocptr, sizeof(mallocptr), h);

Allocators themselves often randomize the addresses they return so that data objects are stored at unpredictable addresses. In particular, glibc uses different strategies for small (brk(2)) versus big (mmap(2)) allocations. That’s two different sources of entropy:

    void *small = malloc(1);        /* 1 byte */
    h = hash32s(&small, sizeof(small), h);
    free(small);

    void *big = malloc(1UL << 20);  /* 1 MB */
    h = hash32s(&big, sizeof(big), h);
    free(big);

Finally the stack itself is often mapped at a random address, or at least started with a random gap, so that local variable addresses are also randomized.

    void *ptr = &ptr;
    h = hash32s(&ptr, sizeof(ptr), h);

Time sources

We haven’t used time(3) yet! Let’s still do that, using the full width of time_t this time around:

    time_t t = time(0);
    h = hash32s(&t, sizeof(t), h);

We do have another time source to consider: clock(3). It returns an approximation of the processor time used by the program. There’s a tiny bit of noise and inconsistency between repeated calls. We can use this to extract a little bit of entropy over many repeated calls.

Naively we might try to use it like this:

    /* Note: don't use this */
    for (int i = 0; i < 1000; i++) {
        clock_t c = clock();
        h = hash32s(&c, sizeof(c), h);
    }

The problem is that the resolution for clock() is typically rough enough that modern computers can execute multiple instructions between ticks. On Windows, where CLOCKS_PER_SEC is low, that entire loop will typically complete before the result from clock() increments even once. With that arrangement we’re hardly getting anything from it! So here’s a better version:

    for (int i = 0; i < 1000; i++) {
        unsigned long counter = 0;
        clock_t start = clock();
        while (clock() == start)
            counter++;
        h = hash32s(&start, sizeof(start), h);
        h = hash32s(&counter, sizeof(counter), h);
    }

The counter makes the resolution of the clock no longer important. If it’s low resolution, then we’ll get lots of noise from the counter. If it’s high resolution, then we get noise from the clock value itself. Running the hash function an extra time between overall clock(3) samples also helps with noise.

A legitimate use of tmpnam(3)

We’ve got one more source of entropy available: tmpnam(3). This function generates a unique, temporary file name. It’s dangerous to use as intended because it doesn’t actually create the file. There’s a race between generating the name for the file and actually creating it.

Fortunately we don’t actually care about the name as a filename. We’re using this to sample entropy not directly available to us. In attempt to get a unique name, the standard C library draws on its own sources of entropy.

    char buf[L_tmpnam] = {0};
    tmpnam(buf);
    h = hash32s(buf, sizeof(buf), h);

The rather unfortunately downside is that lots of modern systems produce a linker warning when it sees tmpnam(3) being linked, even though in this case it’s completely harmless.

So what goes into a temporary filename? It depends on the implementation.

glibc and musl

Both get a high resolution timestamp and generate the filename directly from the timestamp (no hashing, etc.). Unfortunately glibc does a very poor job of also mixing getpid(2) into the timestamp before using it, and probably makes things worse by doing so.

On these platforms, this is is a way to sample a high resolution timestamp without calling anything non-standard.

dietlibc

In the latest release as of this writing it uses rand(3), which makes this useless. It’s also a bug since the C library isn’t allowed to affect the state of rand(3) outside of rand(3) and srand(3). I submitted a bug report and this has since been fixed.

In the next release it will use a generator seeded by the ELF AT_RANDOM value if available, or ASLR otherwise. This makes it moderately useful.

libiberty

Generated from getpid(2) alone, with a counter to handle multiple calls. It’s basically a way to sample the process ID without actually calling getpid(2).

BSD libc / Bionic (Android)

Actually gathers real entropy from the operating system (via arc4random(2)), which means we’re getting a lot of mileage out of this one.

uclibc

Its implementation is obviously forked from glibc. However, it first tries to read entropy from /dev/urandom, and only if that fails does it fallback to glibc’s original high resolution clock XOR getpid(2) method (still not hashing it).

Finishing touches

Finally, still use /dev/urandom if it’s available. This doesn’t require us to trust that the output is anything useful since it’s just being mixed into the other inputs.

    char rnd[4];
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        if (fread(rnd, sizeof(rnd), 1, f))
            h = hash32s(rnd, sizeof(rnd), h);
        fclose(f);
    }

When we’re all done gathering entropy, set the seed from the result.

    srand(h);   /* or whatever you're seeding */

That’s bound to find some entropy on just about any host. Though definitely don’t rely on the results for cryptography.

Lua

I recently tackled this problem in Lua. It has a no-batteries-included design, demanding very little of its host platform: nothing more than an ANSI C implementation. Because of this, a Lua program has even fewer options for gathering entropy than C. But it’s still not impossible!

To further complicate things, Lua code is often run in a sandbox with some features removed. For example, Lua has os.time() and os.clock() wrapping the C equivalents, allowing for the same sorts of entropy sampling. When run in a sandbox, os might not be available. Similarly, io might not be available for accessing /dev/urandom.

Have you ever printed a table, though? Or a function? It evaluates to a string containing the object’s address.

$ lua -e 'print(math)'
table: 0x559577668a30
$ lua -e 'print(math)'
table: 0x55e4a3679a30

Since the raw pointer values are leaked to Lua, we can skim allocator entropy like before. Here’s the same hash function in Lua 5.3:

local function hash32s(buf, h)
    for i = 1, #buf do
        h = h * 31 + buf:byte(i)
    end
    h = h & 0xffffffff
    h = h ~ (h >> 17)
    h = h * 0xed5ad4bb
    h = h & 0xffffffff
    h = h ~ (h >> 11)
    h = h * 0xac4c1b51
    h = h & 0xffffffff
    h = h ~ (h >> 15)
    h = h * 0x31848bab
    h = h & 0xffffffff
    h = h ~ (h >> 14)
    return h
end

Now hash a bunch of pointers in the global environment:

local h = hash32s({}, 0)  -- hash a new table
for varname, value in pairs(_G) do
    h = hash32s(varname, h)
    h = hash32s(tostring(value), h)
    if type(value) == 'table' then
        for k, v in pairs(value) do
            h = hash32s(tostring(k), h)
            h = hash32s(tostring(v), h)
        end
    end
end

math.randomseed(h)

Unfortunately this doesn’t actually work well on one platform I tested: Cygwin. Cygwin has few security features, notably lacking ASLR, and having a largely deterministic allocator.

When to use it

In practice it’s not really necessary to use these sorts of tricks of gathering entropy from odd places. It’s something that comes up more in coding challenges and exercises than in real programs. I’m probably already making platform-specific calls in programs substantial enough to need it anyway.

On a few occasions I have thought about these things when debugging. ASLR makes return pointers on the stack slightly randomized on each run, which can change the behavior of some kinds of bugs. Allocator and stack randomization does similar things to most of your pointers. GDB tries to disable some of these features during debugging, but it doesn’t get everything.

The CPython Bytecode Compiler is Dumb

2019-02-24T21:56:35Z

This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on Python at work, I recently needed to brush up on my Python skills. So far for me, Python has been little more than a fancy extension language for BeautifulSoup, though I also used it to participate in the recent tradition of writing one’s own static site generator, in this case for my wife’s photo blog. I’ve been reading through Fluent Python by Luciano Ramalho, and it’s been quite effective at getting me up to speed.

As I write Python, like with Emacs Lisp, I can’t help but consider what exactly is happening inside the interpreter. I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options. Ultimately I’d like the code I write to drive the interpreter efficiently and effectively. The Zen of Python says there should “only one obvious way to do it,” but in practice there’s a lot of room for expression. Given multiple ways to express the same algorithm or idea, I tend to prefer the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of Python, is very transparent about its bytecode. It’s easy to inspect and reason about its bytecode. The disassembly listing is easy to read and understand, and I can always follow it without consulting the documentation. This contrasts sharply with modern JavaScript engines and their opaque use of JIT compilation, where performance is guided by obeying certain patterns (hidden classes, etc.), helping the compiler understand my program’s types, and being careful not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been studying the bytecode disassembly of the functions that I write. One fact has become quite apparent: the CPython bytecode compiler is pretty dumb. With a few exceptions, it’s a very literal translation of a Python program, and there is almost no optimization. Below I’ll demonstrate a case where it’s possible to detect one of the missed optimizations without inspecting the bytecode disassembly thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb) because it’s such a close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum himself said, “Python is about having the simplest, dumbest compiler imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance, use something other than Python. (And if you can’t do that, at least use PyPy.) That’s a fairly reasonable and healthy goal. Still, if I’m writing Python, I’d like to do the best I can, which means exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython 3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically typed, are primarily executed on a bytecode virtual machine, and it’s easy to access their disassembly listings. One caveat: CPython and Emacs use a stack-based virtual machine while Lua uses a register-based virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x disassemble, and all code will use lexical scoping. In Lua I’ll use lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the variable around potentially involves allocating memory for it, assigning to it, and accessing it. Take this example:

def foo():
    x = 0
    y = 1
    return x

This function is equivalent to:

def foo():
    return 0

Despite this, CPython completely misses this optimization for both x and y:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE

It assigns both variables, and even loads again from x for the return. Missed optimizations, but, as I said, by keeping these variables around, debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo()
    local x = 0
    local y = 1
    return x
end

It also misses this optimization, though it matters a little less due to its architecture (the return instruction references a register regardless of whether or not that register is allocated to a local variable):

        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1

Emacs Lisp also misses it:

(defun foo ()
  (let ((x 0)
        (y 1))
    x))

Disassembly:

constant  0
constant  1
stack-ref 1
return

All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at compile time? This is simple and everyone does it.

def foo():
    return 1 + 2 * 3 / 4

Disassembly:

  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE

Lua:

function foo()
    return 1 + 2 * 3 / 4
end

Disassembly:

        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (+ 1 (/ (* 2 3) 4.0))

Disassembly:

0	constant  2.5
1	return

That’s something we can count on so long as the operands are all numeric literals (or also, for Python, string literals) that are visible to the compiler. Don’t count on your operator overloads to work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects allocated in a function ever become visible outside of that function. If they don’t then these objects could potentially be stack-allocated (instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython does have a trick up its sleeve: tuple optimization. Since tuples are immutable, in certain circumstances CPython will reuse them and avoid both the constructor and the allocation.

def foo():
    return (1, 2, 3)

Check it out, the tuple is used as a constant:

  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE

Which we can detect by evaluating foo() is foo(), which is True. Though deviate from this too much and the optimization is disabled. Remember how CPython can’t optimize away variables, and that they break constant folding? The break this, too:

def foo():
    x = 1
    return (x, 2, 3)

Disassembly:

  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE

This function might document that it always returns a simple tuple, but we can tell if its being optimized or not using is like before: foo() is foo() is now False! In some future version of Python with a cleverer bytecode compiler, that expression might evaluate to True. (Unless the Python language specification is specific about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with is. Was that deliberate? I’m impressed that PyPy matches CPython’s semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break this optimization. But that’s not the compiler being dumb. That’s a hard constraint on the compiler: the caller might change the mutable component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode compilers eliminate unnecessary intermediate objects.

def foo():
    return [1024][0]

Disassembly:

  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE

Lua:

function foo()
    return ({1024})[1]
end

Disassembly:

        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (car (list 1024)))

Disassembly:

constant  1024
list1
car
return

Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and so on, and each case is almost certainly unoptimized. The general rule of thumb is to simply not expect much from these bytecode compilers. They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious optimizations from the compiler. This frees me to be more expressive in my code. Lots of things are cost-free thanks to these optimizations, such as breaking a complex expression up into several variables, naming my constants, or not using a local variable to manually cache memory accesses. I’m confident the compiler will optimize away my expressiveness. The catch is that clever compilers can take things too far, so I’ve got to be mindful of how it might undermine my intentions — i.e. when I’m doing something unusual or not strictly permitted.

These bytecode compilers will never truly surprise me. The cost is that being more expressive in Python, Lua, or Emacs Lisp may reduce performance at run time because it shows in the bytecode. Usually this doesn’t matter, but sometimes it does.

The 3n + 1 Conjecture

2008-01-29T00:00:00Z

The 3n + 1 conjecture, also known as the Collatz conjecture, is based around this recursive function,

The conjecture is this,

This process will eventually reach the number 1, regardless of which positive integer is chosen initially.

The way I am defining this may not be entirely accurate, as I took a shortcut to make it a bit simpler. I am not a mathematician (IANAM) — but sometimes I pretend to be one. For a really solid definition, click through to the Wikipedia article in the link above.

A sample run, starting at 7, would look like this: 7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1. The sequence starting at 7 contains 17 numbers. So 7 has a cycle-length of 17. Currently, there is no known positive integer that does not eventually lead to 1. If the conjecture is true, then none exists to be found.

I first found out about the problem when I saw it on UVa Online Judge. UVa Online Judge is a system that has a couple thousand programming problems to do. Users can submit solution programs written in C, C++, Java, or Pascal. For normal submissions, the fastest program wins.

Anyway, the way UVa Online Judge runs this problem is by providing the solution program pairs of integers on stdin as text. The integers define an inclusive range of integers over which the program must return the length of the longest Collatz cycle-length for all the integers inside that range. They don't tell you which ranges they are checking, except that all integers will be less than 1,000,000 and the sequences will never overflow a 32-bit integer (allowing shortcuts to be made to increase performance).

The simple approach would be defining a function that returns the cycle length (Lua programming language),

function collatz_len (n)
   local c = 1

   while n > 1 do
      c = c + 1
      if math.mod(n, 2) == 0 then
         n = n / 2
      else
         n = 3 * n + 1
      end
   end

   return c
end

Then we have a function check over a range (assuming n <= m here),

function check_range (n, m)
   local largest = 0

   for i = n, m do
      local len = collatz_len (i)

      if len > largest then
         largest = len
      end

   end

   return largest
end

And top it off with the i/o. (I am just learning Lua, so I hope I did this part properly!)

while not io.stdin.eof do
   n, m = io.stdin:read("*number", "*number")

   -- check for eof
   if n == nil or m == nil then
      break
   end

   print (n .. " " .. m .. " " .. check_range(n, m))
end

Notice anything extremely inefficient? We are doing the same work over and over again! Take, for example, this range: 7, 22. When we start with 7, we get the sequence shown above: 7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1. Eight of these numbers are part of the range that we are looking at. When we get up to 22, we are going to walk down the same range again, less the 7. To make things more efficient, we apply some dynamic programming and store previous calculated cycle-lengths in an array. Once we get to a value we already calculated, we just look it up.

I used dynamic programming in my submission, which I wrote up in C. You can grab my source here. It fills in a large array (1000000 entries) as values are found, so no cycle-length is calculated twice. When I submitted this program, it ranked 60 out of about 300,000 entries. There are probably a number of tweaks that can increase performance, such as increasing the size of the array, but I didn't care much about inching closer to the top. I would bet that the very top entries did some trial-and-error and determined what ranges are tested, using the results to seed their program accordingly. You could take my code and submit it yourself, but that wouldn't be very honest, would it?

So why am I going through all of this describing such a simple problem? Well, it is because of this neat feature of Lua that applies well to this problem. Lua is kind of like Lisp. In Lisp, everything is a list ("list processing" --> Lisp). In Lua, (almost) everything is an associative array (Maybe they should have called it Assp? Or Hashp? I am kidding.) An object is a hash with fields containing function references. There is even some syntactic sugar to help this along.

The cool thing is that we can create a hash with default entries that reference a function that calculates the Collatz cycle-length of its key. Once the cycle-length is calculated, the function reference is replaced with the value, so the function is never called again from that point. The function only actually determines the next integer, then references the hash to get the cycle-length of that next integer.

Now this hash looks like it is infinitely large. This is really a form of lazy evaluation: no values are calculated until they are needed (this is one of my favorite things about Haskell). We don't need to explicitly ask for it to be calculated, either. We just go along looking up values in the array as if they were always there. Here is how you do it,

collatz_len = { 1 }

setmetatable (collatz_len, {
   __index = function (name, n)
      if (math.mod (n, 2) == 0) then
         name[n] = name[n/2] + 1;
      else
         name[n] = name[3 * n + 1] + 1;
      end
         return name[n]
   end
})

So we replace the collatz_len function with this array (and replace the call to an array reference) and we have applied dynamic programming to our old program. If I run the two programs with this sample input,

10 1000
1000 3000
300 500

and look at average running times, the dynamic programming version runs 87% faster than the original.

One problem with this, though, is the use of recursion. In Lua, it is really easy to hit recursion limits. For example, accessing element 10000 will cause the program to crash. This will probably get fixed someday, or in some implementation of Lua.

I thought there might be a way to do this in Perl, by changing the default hash value from undef to something else, but I was mildly disappointed to find out that this is not true.

Here is the source for the original program and the one with dynamic programming (BSD licenced): collatz_simple.lua and collatz.lua