Endlessh: an SSH Tarpit

This article was discussed on Hacker News and on reddit.

I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.

The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost

Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.

How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF

The RFC also notes:

The server MAY send other lines of data before sending the version string.

There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.

This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a fork server: accept a connection, fork, and the child simply alternates between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}

A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a single-threaded poll(2) server, analogous to stackless green threads. The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their only overhead is the socket object in the kernel, and another 78 bytes to track them in Endlessh. Most of those bytes are used only for accurate logging. Only those clients that are overdue for a new line are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is essentially used in place of sleep(3). Though since it still needs to manage the accept server socket, it (almost) never actually waits on nothing.

There’s an option to limit the total number of client connections so that it doesn’t get out of hand. In this case it will stop polling the accept socket until a client disconnects. I probably shouldn’t have bothered with this option and instead relied on ulimit, a feature already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much more efficient than poll(2). The problem with poll(2) is that it’s constantly registering and unregistering Endlessh on each of the overdue sockets each time around the main loop. This is by far the most CPU-intensive part of Endlessh, and it’s all inflicted on the kernel. Most of the time, even with thousands of clients trapped in the tarpit, only a small number of them at polled at once, so I opted for better portability instead.

One consequence of not polling connections that are waiting is that disconnections aren’t noticed in a timely fashion. This makes the logs less accurate than I like, but otherwise it’s pretty harmless. Unforunately even if I wanted to fix this, the poll(2) interface isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the kernel, where it allocates send and receive buffers for each client and manages the proper TCP state. The next step to reducing this overhead is Endlessh opening a raw socket and speaking TCP itself, bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asycio, and it’s actually a near perfect fit for this problem. I should have just used it in the first place. The hard part is already implemented within asyncio, and the problem isn’t CPU-bound, so being written in Python doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asycio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Try it out for yourself. Firefox and Chrome will spin on that server for hours before giving up. I have yet to see curl actually timeout on its own in the default settings (--max-time/-m does work correctly, though).

Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.

An Async / Await Library for Emacs Lisp

As part of building my Python proficiency, I’ve learned how to use asyncio. This new language feature first appeared in Python 3.5 (PEP 492, September 2015). JavaScript grew a nearly identical feature in ES2017 (June 2017). An async function can pause to await on an asynchronously computed result, much like a generator pausing when it yields a value.

In fact, both Python and JavaScript async functions are essentially just fancy generator functions with some specialized syntax and semantics. That is, they’re stackless coroutines. Both languages already had generators, so their generator-like async functions are a natural extension that — unlike stackful coroutines — do not require significant, new runtime plumbing.

Emacs officially got generators in 25.1 (September 2016), though, unlike Python and JavaScript, it didn’t require any additional support from the compiler or runtime. It’s implemented entirely using Lisp macros. In other words, it’s just another library, not a core language feature. In theory, the generator library could be easily backported to the first Emacs release to properly support lexical closures, Emacs 24.1 (June 2012).

For the same reason, stackless async/await coroutines can also be implemented as a library. So that’s what I did, letting Emacs’ generator library do most of the heavy lifting. The package is called aio:

It’s modeled more closely on JavaScript’s async functions than Python’s asyncio, with the core representation being promises rather than a coroutine objects. I just have an easier time reasoning about promises than coroutines.

I’m definitely not the first person to realize this was possible, and was beaten to the punch by two years. Wanting to avoid fragmentation, I set aside all formality in my first iteration on the idea, not even bothering with namespacing my identifiers. It was to be only an educational exercise. However, I got quite attached to my little toy. Once I got my head wrapped around the problem, everything just sort of clicked into place so nicely.

In this article I will show step-by-step one way to build async/await on top of generators, laying out one concept at a time and then building upon each. But first, some examples to illustrate the desired final result.

aio example

Ignoring all its problems for a moment, suppose you want to use url-retrieve to fetch some content from a URL and return it. To keep this simple, I’m going to omit error handling. Also assume that lexical-binding is t for all examples. Besides, lexical scope required by the generator library, and therefore also required by aio.

The most naive approach is to fetch the content synchronously:

(defun fetch-fortune-1 (url)
  (let ((buffer (url-retrieve-synchronously url)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

The result is returned directly, and errors are communicated by an error signal (e.g. Emacs’ version of exceptions). This is convenient, but the function will block the main thread, locking up Emacs until the result has arrived. This is obviously very undesirable, so, in practice, everyone nearly always uses the asynchronous version:

(defun fetch-fortune-2 (url callback)
  (url-retrieve url (lambda (_status)
                      (funcall callback (buffer-string)))))

The main thread no longer blocks, but it’s a whole lot less convenient. The result isn’t returned to the caller, and instead the caller supplies a callback function. The result, whether success or failure, will be delivered via callback, so the caller must split itself into two pieces: the part before the callback and the callback itself. Errors cannot be delivered using a error signal because of the inverted flow control.

The situation gets worse if, say, you need to fetch results from two different URLs. You either fetch results one at a time (inefficient), or you manage two different callbacks that could be invoked in any order, and therefore have to coordinate.

Wouldn’t it be nice for the function to work like the first example, but be asynchronous like the second example? Enter async/await:

(aio-defun fetch-fortune-3 (url)
  (let ((buffer (aio-await (aio-url-retrieve url))))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

A function defined with aio-defun is just like defun except that it can use aio-await to pause and wait on any other function defined with aio-defun — or, more specifically, any function that returns a promise. Borrowing Python parlance: Returning a promise makes a function awaitable. If there’s an error, it’s delivered as a error signal from aio-url-retrieve, just like the first example. When called, this function returns immediately with a promise object that represents a future result. The caller might look like this:

(defcustom fortune-url ...)

(aio-defun display-fortune ()
  (interactive)
  (message "%s" (aio-await (fetch-fortune-3 fortune-url))))

How wonderfully clean that looks! And, yes, it even works with interactive like that. I can M-x display-fortune and a fortune is printed in the minibuffer as soon as the result arrives from the server. In the meantime Emacs doesn’t block and I can continue my work.

You can’t do anything you couldn’t already do before. It’s just a nicer way to organize the same callbacks: implicit rather than explicit.

Promises, simplified

The core object at play is the promise. Promises are already a rather simple concept, but aio promises have been distilled to their essence, as they’re only needed for this singular purpose. More on this later.

As I said, a promise represents a future result. In practical terms, a promise is just an object to which one can subscribe with a callback. When the result is ready, the callbacks are invoked. Another way to put it is that promises reify the concept of callbacks. A callback is no longer just the idea of extra argument on a function. It’s a first-class thing that itself can be passed around as a value.

Promises have two slots: the final promise result and a list of subscribers. A nil result means the result hasn’t been computed yet. It’s so simple I’m not even bothering with cl-struct.

(defun aio-promise ()
  "Create a new promise object."
  (record 'aio-promise nil ()))

(defsubst aio-promise-p (object)
  (and (eq 'aio-promise (type-of object))
       (= 3 (length object))))

(defsubst aio-result (promise)
  (aref promise 1))

To subscribe to a promise, use aio-listen:

(defun aio-listen (promise callback)
  (let ((result (aio-result promise)))
    (if result
        (run-at-time 0 nil callback result)
      (push callback (aref promise 2)))))

If the result isn’t ready yet, add the callback to the list of subscribers. If the result is ready call the callback in the next event loop turn using run-at-time. This is important because it keeps all the asynchronous components isolated from one another. They won’t see each others’ frames on the call stack, nor frames from aio. This is so important that the Promises/A+ specification is explicit about it.

The other half of the equation is resolving a promise, which is done with aio-resolve. Unlike other promises, aio promises don’t care whether the promise is being fulfilled (success) or rejected (error). Instead a promise is resolved using a value function — or, usually, a value closure. Subscribers receive this value function and extract the value by invoking it with no arguments.

Why? This lets the promise’s resolver decide the semantics of the result. Instead of returning a value, this function can instead signal an error, propagating an error signal that terminated an async function. Because of this, the promise doesn’t need to know how it’s being resolved.

When a promise is resolved, subscribers are each scheduled in their own event loop turns in the same order that they subscribed. If a promise has already been resolved, nothing happens. (Thought: Perhaps this should be an error in order to catch API misuse?)

(defun aio-resolve (promise value-function)
  (unless (aio-result promise)
    (let ((callbacks (nreverse (aref promise 2))))
      (setf (aref promise 1) value-function
            (aref promise 2) ())
      (dolist (callback callbacks)
        (run-at-time 0 nil callback value-function)))))

If you’re not an async function, you might subscribe to a promise like so:

(aio-listen promise (lambda (v)
                      (message "%s" (funcall v))))

The simplest example of a non-async function that creates and delivers on a promise is a “sleep” function:

(defun aio-sleep (seconds &optional result)
  (let ((promise (aio-promise))
        (value-function (lambda () result)))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))

Similarly, here’s a “timeout” promise that delivers a special timeout error signal at a given time in the future.

(defun aio-timeout (seconds)
  (let ((promise (aio-promise))
        (value-function (lambda () (signal 'aio-timeout nil))))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))

That’s all there is to promises.

Evaluate in the context of a promise

Before we get into pausing functions, lets deal with the slightly simpler matter of delivering their return values using a promise. What we need is a way to evaluate a “body” and capture its result in a promise. If the body exits due to a signal, we want to capture that as well.

Here’s a macro that does just this:

(defmacro aio-with-promise (promise &rest body)
  `(aio-resolve ,promise
                (condition-case error
                    (let ((result (progn ,@body)))
                      (lambda () result))
                  (error (lambda ()
                           (signal (car error) ; rethrow
                                   (cdr error)))))))

The body result is captured in a closure and delivered to the promise. If there’s an error signal, it’s “rethrown” into subscribers by the promise’s value function.

This is where Emacs Lisp has a serious weak spot. There’s not really a concept of rethrowing a signal. Unlike a language with explicit exception objects that can capture a snapshot of the backtrace, the original backtrace is completely lost where the signal is caught. There’s no way to “reattach” it to the signal when it’s rethrown. This is unfortunate because it would greatly help debugging if you got to see the full backtrace on the other side of the promise.

Async functions

So we have promises and we want to pause a function on a promise. Generators have iter-yield for pausing an iterator’s execution. To tackle this problem:

  1. Yield the promise to pause the iterator.
  2. Subscribe a callback on the promise that continues the generator (iter-next) with the promise’s result as the yield result.

All the hard work is done in either side of the yield, so aio-await is just a simple wrapper around iter-yield:

(defmacro aio-await (expr)
  `(funcall (iter-yield ,expr)))

Remember, that funcall is here to extract the promise value from the value function. If it signals an error, this propagates directly into the iterator just as if it had been a direct call — minus an accurate backtrace.

So aio-lambda / aio-defun needs to wrap the body in a generator (iter-lamba), invoke it to produce a generator, then drive the generator using callbacks. Here’s a simplified, unhygienic definition of aio-lambda:

(defmacro aio-lambda (arglist &rest body)
  `(lambda (&rest args)
     (let ((promise (aio-promise))
           (iter (apply (iter-lambda ,arglist
                          (aio-with-promise promise
                            ,@body))
                        args)))
       (prog1 promise
         (aio--step iter promise nil)))))

The body is evaluated inside aio-with-promise with the result delivered to the promise returned directly by the async function.

Before returning, the iterator is handed to aio--step, which drives the iterator forward until it delivers its first promise. When the iterator yields a promise, aio--step attaches a callback back to itself on the promise as described above. Immediately driving the iterator up to the first yielded promise “primes” it, which is important for getting the ball rolling on any asynchronous operations.

If the iterator ever yields something other than a promise, it’s delivered right back into the iterator.

(defun aio--step (iter promise yield-result)
  (condition-case _
      (cl-loop for result = (iter-next iter yield-result)
               then (iter-next iter (lambda () result))
               until (aio-promise-p result)
               finally (aio-listen result
                                   (lambda (value)
                                     (aio--step iter promise value))))
    (iter-end-of-sequence)))

When the iterator is done, nothing more needs to happen since the iterator resolves its own return value promise.

The definition of aio-defun just uses aio-lambda with defalias. There’s nothing to it.

That’s everything you need! Everything else in the package is merely useful, awaitable functions like aio-sleep and aio-timeout.

Composing promises

Unfortunately url-retrieve doesn’t support timeouts. We can work around this by composing two promises: a url-retrieve promise and aio-timeout promise. First define a promise-returning function, aio-select that takes a list of promises and returns (as another promise) the first promise to resolve:

(defun aio-select (promises)
  (let ((result (aio-promise)))
    (prog1 result
      (dolist (promise promises)
        (aio-listen promise (lambda (_)
                              (aio-resolve
                               result
                               (lambda () promise))))))))

We give aio-select both our url-retrieve and timeout promises, and it tells us which resolved first:

(aio-defun fetch-fortune-4 (url timeout)
  (let* ((promises (list (aio-url-retrieve url)
                         (aio-timeout timeout)))
         (fastest (aio-await (aio-select promises)))
         (buffer (aio-await fastest)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

Cool! Note: This will not actually cancel the URL request, just move the async function forward earlier and prevent it from getting the result.

Threads

Despite aio being entirely about managing concurrent, asynchronous operations, it has nothing at all to do with threads — as in Emacs 26’s support for kernel threads. All async functions and promise callbacks are expected to run only on the main thread. That’s not to say an async function can’t await on a result from another thread. It just must be done very carefully.

Processes

The package also includes two functions for realizing promises on processes, whether they be subprocesses or network sockets.

For example, this function loops over each chunk of output (typically 4kB) from the process, as delivered to a filter function:

(aio-defun process-chunks (process)
  (cl-loop for chunk = (aio-await (aio-process-filter process))
           while chunk
           do (... process chunk ...)))

Exercise for the reader: Write an awaitable function that returns a line at at time rather than a chunk at a time. You can build it on top of aio-process-filter.

I considered wrapping functions like start-process so that their aio versions would return a promise representing some kind of result from the process. However there are so many different ways to create and configure processes that I would have ended up duplicating all the process functions. Focusing on the filter and sentinel, and letting the caller create and configure the process is much cleaner.

Unfortunately Emacs has no asynchronous API for writing output to a process. Both process-send-string and process-send-region will block if the pipe or socket is full. There is no callback, so you cannot await on writing output. Maybe there’s a way to do it with a dedicated thread?

Another issue is that the process-send-* functions are preemptible, made necessary because they block. The aio-process-* functions leave a gap (i.e. between filter awaits) where no filter or sentinel function is attached. It’s a consequence of promises being single-fire. The gap is harmless so long as the async function doesn’t await something else or get preempted. This needs some more thought.

Update: These process functions no longer exist and have been replaced by a small framework for building chains of promises. See aio-make-callback.

Testing aio

The test suite for aio is a bit unusual. Emacs’ built-in test suite, ERT, doesn’t support asynchronous tests. Furthermore, tests are generally run in batch mode, where Emacs invokes a single function and then exits rather than pump an event loop. Batch mode can only handle asynchronous process I/O, not the async functions of aio. So it’s not possible to run the tests in batch mode.

Instead I hacked together a really crude callback-based test suite. It runs in non-batch mode and writes the test results into a buffer (run with make check). Not ideal, but it works.

One of the tests is a sleep sort (with reasonable tolerances). It’s a pretty neat demonstration of what you can do with aio:

(aio-defun sleep-sort (values)
  (let ((promises (mapcar (lambda (v) (aio-sleep v v)) values)))
    (cl-loop while promises
             for next = (aio-await (aio-select promises))
             do (setf promises (delq next promises))
             collect (aio-await next))))

To see it in action (M-x sleep-sort-demo):

(aio-defun sleep-sort-demo ()
  (interactive)
  (let ((values '(0.1 0.4 1.1 0.2 0.8 0.6)))
    (message "%S" (aio-await (sleep-sort values)))))

Async/await is pretty awesome

I’m quite happy with how this all came together. Once I had the concepts straight — particularly resolving to value functions — everything made sense and all the parts fit together well, and mostly by accident. That feels good.

Python Decorators: Syntactic Artificial Sweetener

Python has a feature called function decorators. With a little bit of syntax, the behavior of a function or class can be modified in useful ways. Python comes with a few decorators, but most of the useful ones are found in third-party libraries.

PEP 318 suggests a very simple, but practical decorator called synchronized, though it doesn’t provide a concrete example. Consider this function that increments a global counter:

counter = 0

def increment():
    global counter
    counter = counter + 1

If this function is called from multiple threads, there’s a race condition — though, at least for CPython, it’s not a data race thanks to the Global Interpreter Lock (GIL). Incrementing the counter is not an atomic operation, as illustrated by its byte code:

 0 LOAD_GLOBAL              0 (counter)
 3 LOAD_CONST               1 (1)
 6 BINARY_ADD
 7 STORE_GLOBAL             0 (counter)
10 LOAD_CONST               0 (None)
13 RETURN_VALUE

The variable is loaded, operated upon, and stored. Another thread could be scheduled between any of these instructions and cause an undesired result. It’s easy to see that in practice:

from threading import Thread

def worker():
    for i in range(200000):
        increment()

threads = [Thread(target=worker) for _ in range(8)];
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

print(counter)

The increment function is called exactly 1.6 million times, but on my system I get different results on each run:

$ python3 example.py 
1306205
$ python3 example.py 
1162418
$ python3 example.py 
1076801

I could change the definition of increment() to use synchronization, but wouldn’t it be nice if I could just tell Python to synchronize this function? This is where a function decorator shines:

from threading import Lock

def synchronized(f):
    lock = Lock()
    def wrapper():
        with lock:
            return f()
    return wrapper

The synchronized function is a higher order function that accepts a function and returns a function — or, more specifically, a callable. The purpose is to wrap and decorate the function it’s given. In this case the function is wrapped in a mutual exclusion lock. Note: This implementation is very simple and only works for functions that accept no arguments.

To use it, I just add a single line to increment:

@synchronized
def increment():
    global counter
    counter = counter + 1

With this change my program now always prints 1600000.

Syntactic “sugar”

Everyone is quick to point out that this is just syntactic sugar, and that you can accomplish this without the @ syntax. For example, the last definition of increment is equivalent to:

def increment():
    ...

increment = synchronized(increment)

Decorators can also be parameterized. For example, Python’s functools module has an lru_cache decorator for memoizing a function:

@lru_cache(maxsize=32)
def expensive(n):
    ...

Which is equivalent to this very direct source transformation:

def expensive(n):
    ...

expensive = lru_cache(maxsize=32)(expensive)

So what comes after the @ isn’t just a name. In fact, it looks like it can be any kind of expression that evaluates to a function decorator. Or is it?

Syntactic artificial sweetener

Reality is often disappointing. Let’s try using an “identity” decorator defined using lambda. This decorator will accomplish nothing, but it will test if we can decorate a function using a lambda expression.

@lambda f: f
def foo():
    pass

But Python complains:

    @lambda f: f
          ^
SyntaxError: invalid syntax

Maybe Python is absolutely literal about the syntax sugar thing, and it’s more like a kind of macro replacement. Let’s try wrapping it in parentheses:

@(lambda f: f)
def foo(n):
    pass

Nope, same error, but now pointing at the opening parenthesis. Getting desperate now:

@[synchronized][0]
def foo():
    pass

Again, syntax error. What’s going on?

Pattern matching

The problem is that the Python language reference doesn’t parse an expression after @. It matches a very specific pattern that just so happens to look like a Python expression. It’s not syntactic sugar, it’s syntactic artificial sweetener!

ator ::= "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE

In a way, this puts Python in the ranks of PHP 5 and Matlab: two languages with completely screwed up grammars that can only parse specific constructions that the developers had anticipated. For example, in PHP 5 (fixed in PHP 7):

function foo() {
    return function() {
        return 0;
    };
}

foo()();

That is a syntax error:

PHP Parse error:  syntax error, unexpected '(', expecting ',' or ';'

Or in any version of Matlab:

    magic(4)(:)

That is a syntax error:

Unbalanced or unexpected parenthesis or bracket

In Python’s defense, this strange, limited syntax is only in a single place rather than everywhere, but I still wonder why it was defined that way.

Update: Clément Pit-Claudel pointed out the explanation in the PEP, which references a 2004 email by Guido van Rossum:

I have a gut feeling about this one. I’m not sure where it comes from, but I have it. It may be that I want the compiler to be able to recognize certain decorators.

So while it would be quite easy to change the syntax to @test in the future, I’d like to stick with the more restricted form unless a real use case is presented where allowing @test would increase readability. (@foo().bar() doesn’t count because I don’t expect you’ll ever need that).

The CPython Bytecode Compiler is Dumb

This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on Python at work, I recently needed to brush up on my Python skills. So far for me, Python has been little more than a fancy extension language for BeautifulSoup, though I also used it to participate in the recent tradition of writing one’s own static site generator, in this case for my wife’s photo blog. I’ve been reading through Fluent Python by Luciano Ramalho, and it’s been quite effective at getting me up to speed.

As I write Python, like with Emacs Lisp, I can’t help but consider what exactly is happening inside the interpreter. I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options. Ultimately I’d like the code I write to drive the interpreter efficiently and effectively. The Zen of Python says there should “only one obvious way to do it,” but in practice there’s a lot of room for expression. Given multiple ways to express the same algorithm or idea, I tend to prefer the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of Python, is very transparent about its bytecode. It’s easy to inspect and reason about its bytecode. The disassembly listing is easy to read and understand, and I can always follow it without consulting the documentation. This contrasts sharply with modern JavaScript engines and their opaque use of JIT compilation, where performance is guided by obeying certain patterns (hidden classes, etc.), helping the compiler understand my program’s types, and being careful not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been studying the bytecode disassembly of the functions that I write. One fact has become quite apparent: the CPython bytecode compiler is pretty dumb. With a few exceptions, it’s a very literal translation of a Python program, and there is almost no optimization. Below I’ll demonstrate a case where it’s possible to detect one of the missed optimizations without inspecting the bytecode disassembly thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb) because it’s such a close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum himself said, “Python is about having the simplest, dumbest compiler imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance, use something other than Python. (And if you can’t do that, at least use PyPy.) That’s a fairly reasonable and healthy goal. Still, if I’m writing Python, I’d like to do the best I can, which means exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython 3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically typed, are primarily executed on a bytecode virtual machine, and it’s easy to access their disassembly listings. One caveat: CPython and Emacs use a stack-based virtual machine while Lua uses a register-based virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x disassemble, and all code will use lexical scoping. In Lua I’ll use lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the variable around potentially involves allocating memory for it, assigning to it, and accessing it. Take this example:

def foo():
    x = 0
    y = 1
    return x

This function is equivalent to:

def foo():
    return 0

Despite this, CPython completely misses this optimization for both x and y:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE

It assigns both variables, and even loads again from x for the return. Missed optimizations, but, as I said, by keeping these variables around, debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo()
    local x = 0
    local y = 1
    return x
end

It also misses this optimization, though it matters a little less due to its architecture (the return instruction references a register regardless of whether or not that register is allocated to a local variable):

        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1

Emacs Lisp also misses it:

(defun foo ()
  (let ((x 0)
        (y 1))
    x))

Disassembly:

0	constant  0
1	constant  1
2	stack-ref 1
3	return

All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at compile time? This is simple and everyone does it.

def foo():
    return 1 + 2 * 3 / 4

Disassembly:

  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE

Lua:

function foo()
    return 1 + 2 * 3 / 4
end

Disassembly:

        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (+ 1 (/ (* 2 3) 4.0))

Disassembly:

0	constant  2.5
1	return

That’s something we can count on so long as the operands are all numeric literals (or also, for Python, string literals) that are visible to the compiler. Don’t count on your operator overloads to work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects allocated in a function ever become visible outside of that function. If they don’t then these objects could potentially be stack-allocated (instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython does have a trick up its sleeve: tuple optimization. Since tuples are immutable, in certain circumstances CPython will reuse them and avoid both the constructor and the allocation.

def foo():
    return (1, 2, 3)

Check it out, the tuple is used as a constant:

  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE

Which we can detect by evaluating foo() is foo(), which is True. Though deviate from this too much and the optimization is disabled. Remember how CPython can’t optimize away variables, and that they break constant folding? The break this, too:

def foo():
    x = 1
    return (x, 2, 3)

Disassembly:

  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE

This function might document that it always returns a simple tuple, but we can tell if its being optimized or not using is like before: foo() is foo() is now False! In some future version of Python with a cleverer bytecode compiler, that expression might evaluate to True. (Unless the Python language specification is specific about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with is. Was that deliberate? I’m impressed that PyPy matches CPython’s semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break this optimization. But that’s not the compiler being dumb. That’s a hard constraint on the compiler: the caller might change the mutable component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode compilers eliminate unnecessary intermediate objects.

def foo():
    return [1024][0]

Disassembly:

  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE

Lua:

function foo()
    return ({1024})[1]
end

Disassembly:

        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (car (list 1024)))

Disassembly:

0	constant  1024
1	list1
2	car
3	return

Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and so on, and each case is almost certainly unoptimized. The general rule of thumb is to simply not expect much from these bytecode compilers. They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious optimizations from the compiler. This frees me to be more expressive in my code. Lots of things are cost-free thanks to these optimizations, such as breaking a complex expression up into several variables, naming my constants, or not using a local variable to manually cache memory accesses. I’m confident the compiler will optimize away my expressiveness. The catch is that clever compilers can take things too far, so I’ve got to be mindful of how it might undermine my intentions — i.e. when I’m doing something unusual or not strictly permitted.

These bytecode compilers will never truly surprise me. The cost is that being more expressive in Python, Lua, or Emacs Lisp may reduce performance at run time because it shows in the bytecode. Usually this doesn’t matter, but sometimes it does.

Some Galois Linear Feedback Shift Generators

The Day I Fell in Love with Fuzzing

This article was discussed on Hacker News and on reddit.

In 2007 I wrote a pair of modding tools, binitools, for a space trading and combat simulation game named Freelancer. The game stores its non-art assets in the format of “binary INI” files, or “BINI” files. The motivation for the binary format over traditional INI files was probably performance: it’s faster to load and read these files than it is to parse arbitrary text in INI format.

Much of the in-game content can be changed simply by modifying these files — changing time names, editing commodity prices, tweaking ship statistics, or even adding new ships to the game. The binary nature makes them unsuitable to in-place modification, so the natural approach is to convert them to text INI files, make the desired modifications using a text editor, then convert back to the BINI format and replace the file in the game’s installation.

I didn’t reverse engineer the BINI format, nor was I the first person the create tools to edit them. The existing tools weren’t to my tastes, and I had my own vision for how they should work — an interface more closely following the Unix tradition despite the target being a Windows game.

When I got started, I had just learned how to use yacc (really Bison) and lex (really flex), as well as Autoconf, so I went all-in with these newly-discovered tools. It was exciting to try them out in a real-world situation, though I slavishly aped the practices of other open source projects without really understanding why things were they way they were. Due to the use of yacc/lex and the configure script build, compiling the project required a full, Unix-like environment. This is all visible in the original version of the source.

The project was moderately successful in two ways. First, I was able to use the tools to modify the game. Second, other people were using the tools, since the binaries I built show up in various collections of Freelancer modding tools online.

The Rewrite

That’s the way things were until mid-2018 when I revisited the project. Ever look at your own old code and wonder what they heck you were thinking? My INI format was far more rigid and strict than necessary, I was doing questionable things when writing out binary data, and the build wasn’t even working correctly.

With an additional decade of experience under my belt, I knew I could do way better if I were to rewrite these tools today. So, over the course of a few days, I did, from scratch. That’s what’s visible in the master branch today.

I like to keep things simple which meant no more Autoconf, and instead a simple, portable Makefile. No more yacc or lex, and instead a hand-coded parser. Using only conforming, portable C. The result was so simple that I can build using Visual Studio in a single, short command, so the Makefile isn’t all that necessary. With one small tweak (replace stdint.h with a typedef), I can even build and run binitools in DOS.

The new version is faster, leaner, cleaner, and simpler. It’s far more flexible about its INI input, so its easier to use. But is it more correct?

Fuzzing

I’ve been interested in fuzzing for years, especially american fuzzy lop, or afl. However, I wasn’t having success with it. I’d fuzz some of the tools I use regularly, and it wouldn’t find anything of note, at least not before I gave up. I fuzzed my JSON library, and somehow it turned up nothing. Surely my JSON parser couldn’t be that robust already, could it? Fuzzing just wasn’t accomplishing anything for me. (As it turns out, my JSON library is quite robust, thanks in large part to various contributors!)

So I’ve got this relatively new INI parser, and while it can successfully parse and correctly re-assemble the game’s original set of BINI files, it hasn’t really been exercised that much. Surely there’s something in here for a fuzzer to find. Plus I don’t even have to write a line of code in order to run afl against it. The tools already read from standard input by default, which is perfect.

Assuming you’ve got the necessary tools installed (make, gcc, afl), here’s how easy it is to start fuzzing binitools:

$ make CC=afl-gcc
$ mkdir in out
$ echo '[x]' > in/empty
$ afl-fuzz -i in -o out -- ./bini

The bini utility takes INI as input and produces BINI as output, so it’s far more interesting to fuzz than its inverse, unbini. Since unbini parses relatively simple binary data, there are (probably) no bugs for the fuzzer to find. I did try anyway just in case.

In my example above, I swapped out the default compiler for afl’s GCC wrapper (CC=afl-gcc). It calls GCC in the background, but in doing so adds its own instrumentation to the binary. When fuzzing, afl-fuzz uses that instrumentation to monitor the program’s execution path. The afl whitepaper explains the technical details.

I also created input and output directories, placing a minimal, working example into the input directory, which gives afl a starting point. As afl runs, it mutates a queue of inputs and observes the changes on the program’s execution. The output directory contains the results and, more importantly, a corpus of inputs that cause unique execution paths. In other words, the fuzzer output will be lots of inputs that exercise many different edge cases.

The most exciting and dreaded result is a crash. The first time I ran it against binitools, bini had many such crashes. Within minutes, afl was finding a number of subtle and interesting bugs in my program, which was incredibly useful. It even discovered an unlikely stale pointer bug by exercising different orderings for various memory allocations. This particular bug was the turning point that made me realize the value of fuzzing.

Not all the bugs it found led to crashes. I also combed through the outputs to see what sorts of inputs were succeeding, what was failing, and observe how my program handled various edge cases. It was rejecting some inputs I thought should be valid, accepting some I thought should be invalid, and interpreting some in ways I hadn’t intended. So even after I fixed the crashing inputs, I still made tweaks to the parser to fix each of these troublesome inputs.

Building a test suite

Once I combed out all the fuzzer-discovered bugs, and I agreed with the parser on how all the various edge cases should be handled, I turned the fuzzer’s corpus into a test suite — though not directly.

I had run the fuzzer in parallel — a process that is explained in the afl documentation — so I had lots of redundant inputs. By redundant I mean that the inputs are different but have the same execution path. Fortunately afl has a tool to deal with this: afl-cmin, the corpus minimization tool. It eliminates all the redundant inputs.

Second, many of these inputs were longer than necessary in order to invoke their unique execution path. There’s afl-tmin, the test case minimizer, which I used to further shrink my test corpus.

I sorted the valid from invalid inputs and checked them into the repository. Have a look at all the wacky inputs invented by the fuzzer starting from my single, minimal input:

This essentially locks down the parser, and the test suite ensures a particular build behaves in a very specific way. This is most useful for ensuring that builds on other platforms and by other compilers are indeed behaving identically with respect to their outputs. My test suite even revealed a bug in diet libc, as binitools doesn’t pass the tests when linked against it. If I were to make non-trivial changes to the parser, I’d essentially need to scrap the current test suite and start over, having afl generate an entire new corpus for the new parser.

Fuzzing has certainly proven itself to be a powerful technique. It found a number of bugs that I likely wouldn’t have otherwise discovered on my own. I’ve since gotten more savvy on its use and have used it on other software — not just software I’ve written myself — and discovered more bugs. It’s got a permanent slot on my software developer toolbelt.

A JavaScript Typed Array Gotcha

JavaScript’s prefix increment and decrement operators can be surprising when applied to typed arrays. It caught be by surprise when I was porting some C code over to JavaScript Just using your brain to execute this code, what do you believe is the value of r?

let array = new Uint8Array([255]);
let r = ++array[0];

The increment and decrement operators originated in the B programming language. Its closest living relative today is C, and, as far as these operators are concered, C can be considered an ancestor of JavaScript. So what is the value of r in this similar C code?

uint8_t array[] = {255};
int r = ++array[0];

Of course, if they were the same then there would be nothing to write about, so that should make it easier to guess if you aren’t sure. The answer: In JavaScript, r is 256. In C, r is 0.

What happened to me was that I wrote an 80-bit integer increment routine in C like this:

uint8_t array[10];
/* ... */
for (int i = 9; i >= 0; i--)
    if (++array[i])
        break;

But I was getting the wrong result over in JavaScript from essentially the same code:

let array = new Uint8Array(10);
/* ... */
for (let i = 9; i >= 0; i--)
    if (++array[i])
        break;

So what’s going on here?

JavaScript specification

The ES5 specification says this about the prefix increment operator:

Let expr be the result of evaluating UnaryExpression.

  1. Throw a SyntaxError exception if the following conditions are all true: [omitted]

  2. Let oldValue be ToNumber(GetValue(expr)).

  3. Let newValue be the result of adding the value 1 to oldValue, using the same rules as for the + operator (see 11.6.3).

  4. Call PutValue(expr, newValue).

Return newValue.

So, oldValue is 255. This is a double precision float because all numbers in JavaScript (outside of the bitwise operations) are double precision floating point. Add 1 to this value to get 256, which is newValue. When newValue is stored in the array via PutValue(), it’s converted to an unsigned 8-bit integer, which truncates it to 0.

However, newValue is returned, not the value that was actually stored in the array!

Since JavaScript is dynamically typed, this difference did not actually matter until typed arrays are involved. I suspect if typed arrays were in JavaScript from the beginning, the specified behavior would be more in line with C.

This behavior isn’t limited to the prefix operators. Consider assignment:

let array = new Uint8Array([255]);
let r = (array[0] = array[0] + 1);
let s = (array[0] += 1);

Both r and s will still be 256. The result of the assignment operators is a similar story:

LeftHandSideExpression = AssignmentExpression is evaluated as follows:

  1. Let lref be the result of evaluating LeftHandSideExpression.

  2. Let rref be the result of evaluating AssignmentExpression.

  3. Let rval be GetValue(rref).

  4. Throw a SyntaxError exception if the following conditions are all true: [omitted]

  5. Call PutValue(lref, rval).

  6. Return rval.

Again, the result of the expression is independent of how it was stored with PutValue().

C specification

I’ll be referencing the original C89/C90 standard. The C specification requires a little more work to get to the bottom of the issue. Starting with 3.3.3.1 (Prefix increment and decrement operators):

The value of the operand of the prefix ++ operator is incremented. The result is the new value of the operand after incrementation. The expression ++E is equivalent to (E+=1).

Later in 3.3.16.2 (Compound assignment):

A compound assignment of the form E1 op = E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.

Then finally in 3.3.16 (Assignment operators):

An assignment operator stores a value in the object designated by the left operand. An assignment expression has the value of the left operand after the assignment, but is not an lvalue.

So the result is explicitly the value after assignment. Let’s look at this step by step after rewriting the expression.

int r = (array[0] = array[0] + 1);

In C, all integer operations are performed with at least int precision. Smaller integers are implicitly promoted to int before the operation. The value of array[0] is 255, and, since uint8_t is smaller than int, it gets promoted to int. Additionally, the literal constant 1 is also an int, so there are actually two reasons for this promotion.

So since these are int values, the result of the addition is 256, like in JavaScript. To store the result, this value is then demoted to uint8_t and truncated to 0. Finally, this post-assignment 0 is the result of the expression, not the right-hand result as in JavaScript.

Specifications are useful

These situations are why I prefer programming languages that have a formal and approachable specification. If there’s no specification and I’m observing undocumented, idiosyncratic behavior, is this just some subtle quirk of the current implementation — e.g. something that might change without notice in the future — or is it intended behavior that I can rely upon for correctness?

A Survey of $RANDOM

Most Bourne shell clones support a special RANDOM environment variable that evaluates to a random value between 0 and 32,767 (e.g. 15 bits). Assigment to the variable seeds the generator. This variable is an extension and did not appear in the original Unix Bourne shell. Despite this, the different Bourne-like shells that implement it have converged to the same interface, but only the interface. Each implementation differs in interesting ways. In this article we’ll explore how $RANDOM is implemented in various Bourne-like shells.

Unfortunately I was unable to determine the origin of $RANDOM. Nobody was doing a good job tracking source code changes before the mid-1990s, so that history appears to be lost. Bash was first released in 1989, but the earliest version I could find was 1.14.7, released in 1996. KornShell was first released in 1983, but the earliest source I could find was from 1993. In both cases $RANDOM already existed. My guess is that it first appeared in one of these two shells, probably KornShell.

Update: Quentin Barnes has informed me that his 1986 copy of KornShell (a.k.a. ksh86) implements $RANDOM. This predates Bash and makes it likely that this feature originated in KornShell.

Bash

Of all the shells I’m going to discuss, Bash has the most interesting history. It never made use use of srand(3) / rand(3) and instead uses its own generator — which is generally what I prefer. Prior to Bash 4.0, it used the crummy linear congruential generator (LCG) found in the C89 standard:

static unsigned long rseed = 1;

static int
brand ()
{
  rseed = rseed * 1103515245 + 12345;
  return ((unsigned int)((rseed >> 16) & 32767));
}

For some reason it was naïvely decided that $RANDOM should never produce the same value twice in a row. The caller of brand() filters the output and discards repeats before returning to the shell script. This actually reduces the quality of the generator further since it increases correlation between separate outputs.

When the shell starts up, rseed is seeded from the PID and the current time in seconds. These values are literally summed and used as the seed.

/* Note: not the literal code, but equivalent. */
rseed = getpid() + time(0);

Subshells, which fork and initally share an rseed, are given similar treatment:

rseed = rseed + getpid() + time(0);

Notice there’s no hashing or mixing of these values, so there’s no avalanche effect. That would have prevented shells that start around the same time from having related initial random sequences.

With Bash 4.0, released in 2009, the algorithm was changed to a Park–Miller multiplicative LCG from 1988:

static int
brand ()
{
  long h, l;

  /* can't seed with 0. */
  if (rseed == 0)
    rseed = 123459876;
  h = rseed / 127773;
  l = rseed % 127773;
  rseed = 16807 * l - 2836 * h;
  return ((unsigned int)(rseed & 32767));
}

There’s actually a subtle mistake in this implementation compared to the generator described in the paper. This function will generate different numbers than the paper, and it will generate different numbers on different hosts! More on that later.

This algorithm is a much better choice than the previous LCG. There were many more options available in 2009 compared to 1989, but, honestly, this generator is pretty reasonable for this application. Bash is so slow that you’re never practically going to generate enough numbers for the small state to matter. Since the Park–Miller algorithm is older than Bash, they could have used this in the first place.

I considered submitting a patch to switch to something more modern. However, given Bash’s constraints, it’s harder said than done. Portability to weird systems is still a concern, and I expect they’d reject a patch that started making use of long long in the PRNG. They still support pre-ANSI C compilers that don’t have 64-bit arithmetic.

However, what still really could be improved is seeding. In Bash 4.x here’s what it looks like:

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (tv.tv_sec ^ tv.tv_usec ^ getpid ());
}

Seeding is both better and worse. It’s better that it’s seeded from a higher resolution clock (milliseconds), so two shells started close in time have more variation. However, it’s “mixed” with XOR, which, in this case, is worse than addition.

For example, imagine two Bash shells started one millsecond apart. Both tv_usec and getpid() are incremented by one. Those increments are likely to cancel each other out by an XOR, and they end up with the same seed.

Instead, each of those quantities should be hashed before mixing. Here’s a rough example using my triple32() hash (adapted to glorious GNU-style pre-ANSI C):

static unsigned long
triple32 (x)
     unsigned long x;
{
  x ^= x >> 17;
  x *= 0xed5ad4bbUL;
  x &= 0xffffffffUL;
  x ^= x >> 11;
  x *= 0xac4c1b51UL;
  x &= 0xffffffffUL;
  x ^= x >> 15;
  x *= 0x31848babUL;
  x &= 0xffffffffUL;
  x ^= x >> 14;
  return x;
}

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (hash32 (tv.tv_sec) ^
          hash32 (hash32 (tv.tv_usec) ^ getpid ()));
}

I had said there’s there’s a mistake in the Bash implementation of Park–Miller. Take a closer look at the types and the assignment to rseed:

  /* The variables */
  long h, l;
  unsigned long rseed;

  /* The assignment */
  rseed = 16807 * l - 2836 * h;

The result of the substraction can be negative, and that negative value is converted to unsigned long. The C standard says ULONG_MAX + 1 is added to make the value positive. ULONG_MAX varies by platform — typicially long is either 32 bits or 64 bits — so the results also vary. Here’s how the paper defined it:

  unsigned long test;

  test = 16807 * l - 2836 * h;
  if (test > 0)
    rseed = test;
  else
    rseed = test + 2147483647;

As far as I can tell, this mistake doesn’t hurt the quality of the generator.

$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634

$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115

Zsh

In contrast to Bash, Zsh is the most straightforward: defer to rand(3). Its $RANDOM can return the same value twice in a row, assuming that rand(3) does.

zlong
randomgetfn(UNUSED(Param pm))
{
    return rand() & 0x7fff;
}

void
randomsetfn(UNUSED(Param pm), zlong v)
{
    srand((unsigned int)v);
}

A cool feature is that means you could override it if you wanted with a custom generator.

int
rand(void)
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

Usage:

$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4

This trick also applies to the rest of the shells below.

KornShell (ksh)

KornShell originated in 1983, but it was finally released under an open source license in 2005. There’s a clone of KornShell called Public Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but I’ll get to that next.

KornShell defers to rand(3), but it does some additional naïve filtering on the output. When the shell starts up, it generates 10 values from rand(). If any of them are larger than 32,767 then it will shift right by three all generated numbers.

#define RANDMASK 0x7fff

    for (n = 0; n < 10; n++) {
        // Don't use lower bits when rand() generates large numbers.
        if (rand() > RANDMASK) {
            rand_shift = 3;
            break;
        }
    }

Why not just look at RAND_MAX? I guess they didn’t think of it.

Update: Quentin Barnes pointed out that RAND_MAX didn’t exist until POSIX standardization in 1988. The constant first appeared in Unix in 1990. This KornShell code either predates the standard or needed to work on systems that predate the standard.

Like Bash, repeated values are not allowed. I suspect one shell got this idea from the other.

    do {
        cur = (rand() >> rand_shift) & RANDMASK;
    } while (cur == last);

Who came up with this strange idea first?

OpenBSD’s Public Domain Korn Shell (pdksh)

I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I ever touch in practice, and its $RANDOM is the most interesting of the pdksh forks — at least since 2014.

Like Zsh, pdksh simply defers to rand(3). However, OpenBSD’s rand(3) is infamously and proudly non-standard. By default it returns non-deterministic, cryptographic-quality results seeded from system entropy (via the misnamed arc4random(3)), à la /dev/urandom. Its $RANDOM inherits this behavior.

    setint(vp, (int64_t) (rand() & 0x7fff));

However, if a value is assigned to $RANDOM in order to seed it, it reverts to its old pre-2014 deterministic generation via srand_deterministic(3).

    srand_deterministic((unsigned int)intval(vp));

OpenBSD’s deterministic rand(3) is the crummy LCG from the C89 standard, just like Bash 3.x. So if you assign to $RANDOM, you’ll get nearly the same results as Bash 3.x and earlier — the only difference being that it can repeat numbers.

That’s a slick upgrade to the old interface without breaking anything, making it my favorite version $RANDOM for any shell.

null program

Chris Wellons