Articles tagged python at null program

Assertions should be more debugger-oriented

2022-06-26T18:51:04Z

Prompted by a 20 minute video, over the past month I’ve improved my debugger skills. I’d shamefully acquired a bad habit: avoiding a debugger until exhausting dumber, insufficient methods. My first choice should be a debugger, but I had allowed a bit of friction to dissuade me. With some thoughtful practice and deliberate effort clearing the path, my bad habit is finally broken — at least when a good debugger is available. It feels like I’ve leveled up and, like touch typing, this was a skill I’d neglected far too long. One friction point was the less-than-optimal assert feature in basically every programming language implementation. It ought to work better with debuggers.

An assertion verifies a program invariant, and so if one fails then there’s undoubtedly a defect in the program. In other words, assertions make programs more sensitive to defects, allowing problems to be caught more quickly and accurately. Counter-intuitively, crashing early and often makes for more robust and reliable software in the long run. For exactly this reason, assertions go especially well with fuzzing.

assert(i >= 0 && i < len);   // bounds check
assert((ssize_t)size >= 0);  // suspicious size_t
assert(cur->next != cur);    // circular reference?

They’re sometimes abused for error handling, which is a reason they’ve also been (wrongfully) discouraged at times. For example, failing to open a file is an error, not a defect, so an assertion is inappropriate.

Normal programs have implicit assertions all over, even if we don’t usually think of them as assertions. In some cases they’re checked by the hardware. Examples of implicit assertion failures:

Out-of-bounds indexing
Dereferencing null/nil/None
Dividing by zero
Certain kinds of integer overflow (e.g. -ftrapv)

Programs are generally not intended to recover from these situations because, had they been anticipated, the invalid operation wouldn’t have been attempted in the first place. The program simply crashes because there’s no better alternative. Sanitizers, including Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan), are in essence additional, implicit assertions, checking invariants that aren’t normally checked.

Ideally a failing assertion should have these two effects:

Execution should immediately stop. The program is in an unknown state, so it’s neither safe to “clean up” nor attempt to recover. Additional execution will only make debugging more difficult, and may obscure the defect.
When run under a debugger — or visited as a core dump — it should break exactly at the failed assertion, ready for inspection. I should not need to dig around the call stack to figure out where the failure occurred. I certainly shouldn’t need to manually set a breakpoint and restart the program hoping to fail the assertion a second time. The whole reason for using a debugger is to save time, so if it’s wasting my time then it’s failing at its primary job.

I examined standard assert features across various language implementations, and none strictly meet the criteria. Fortunately, in some cases, it’s trivial to build a better assertion, and you can substitute your own definition. First, let’s discuss the way assertions disappoint.

A test assertion

My test for C and C++ is minimal but establishes some state and gives me a variable to inspect:

#include 

int main(void)
{
    for (int i = 0; i < 10; i++) {
        assert(i < 5);
    }
}

Then I compile and debug in the most straightforward way:

$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt

The r in GDB stands for run, which immediately breaks because of the assert. The bt prints a backtrace. On a typical Linux distribution that shows this backtrace:

#0  __GI_raise
#1  __GI_abort
#2  __assert_fail_base
#3  __GI___assert_fail
#4  main

Well, actually, it’s much messier than this, but I manually cleaned it up:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1  0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2  0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i < 5", file=0x555555556004 "test.c", line=6, f
unction=) at assert.c:92
#3  0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i < 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 <__PRETTY_FUNCTION__.0> "main") at assert
.c:101
#4  0x0000555555555178 in main () at test.c:6

That’s a lot to take in at a glance, and about 95% of it is noise that will never contain useful information. Most notably, GDB didn’t stop at the failing assertion. Instead there’s four stack frames of libc junk I have to navigate before I can even begin debugging.

(gdb) up
(gdb) up
(gdb) up
(gdb) up

I must wade through this for every assertion failure. This is some of the friction that made me avoid the debugger in the first place. glibc loves indirection, so maybe the other libc implementations do better? How about musl?

#0  setjmp
#1  raise
#2  ??
#3  ??
#4  ??
#5  ??
#6  ??
#7  ??
#8  ??
#9  ??
#10 ??
#11 ??

Oops, without musl debugging symbols I can’t debug assertions at all because GDB can’t read the stack, so it’s lost. If you’re on Alpine you can install musl-dbg, but otherwise you’ll probably need to build your own from source. With debugging symbols, musl is no better than glibc:

#0  __restore_sigs
#1  raise
#2  abort
#3  __assert_fail
#4  main

Same with FreeBSD:

#0  thr_kill
#1  in raise
#2  in abort
#3  __assert
#4  main

OpenBSD has one fewer frame:

#0  thrkill
#1  _libc_abort
#2  _libc___assert2
#3  main

How about on Windows with Mingw-w64?

[Inferior 1 (process 7864) exited with code 03]

Oops, on Windows GDB doesn’t break at all on assert. You must first set a breakpoint on abort:

(gdb) b abort

Besides that, it’s the most straightforward so far:

#0 msvcrt!abort
#1 msvcrt!_assert
#2 main

With MSVC (default CRT) I get something slightly different:

#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh

RemedyBG leaves me at the abort like GDB does elsewhere. Visual Studio recognizes that I don’t care about its stack frames and instead puts the focus on the assertion, ready for debugging. The other stack frames are there, but basically invisible. It’s the only case that practically meets all my criteria!

I can’t entirely blame these implementations. The C standard requires that assert print a diagnostic and call abort, and that abort raises SIGABRT. There’s not much implementations can do, and it’s up to the debugger to be smarter about it.

Sanitizers

ASan doesn’t break GDB on assertion failures, which is yet another source of friction. You can work around this with an environment variable:

export ASAN_OPTIONS=abort_on_error=1:print_legend=0

This works, but it’s the worst case of all: I get 7 junk stack frames on top of the failed assertion. It’s also very noisy when it traps, so the print_legend=0 helps to cut it down a bit. I want this variable so often that I set it in my shell’s .profile so that it’s always set.

With UBSan you can use -fsanitize-undefined-trap-on-error, which behaves like the improved assertion. It traps directly on the defect with no junk frames, though it prints no diagnostic. As a bonus, it also means you don’t need to link libubsan. Thanks to the bonus, it fully supplants -ftrapv for me on all platforms.

Update November 2022: This “stop” hook eliminates ASan friction by popping runtime frames — functions with the reserved __ prefix — from the call stack so that they’re not in the way when GDB takes control. It requires Python support, which is the purpose of the feature-sniff outer condition.

if !$_isvoid($_any_caller_matches)
    define hook-stop
        while $_thread && $_any_caller_matches("^__")
            up-silently
        end
    end
end

This is now part of my .gdbinit.

A better assertion

At least when under a debugger, here’s a much better assertion macro for GCC and Clang:

#define assert(c) if (!(c)) __builtin_trap()

__builtin_trap inserts a trap instruction — a built-in breakpoint. By not calling a function to raise a signal, there are no junk stack frames and no need to breakpoint on abort. It stops exactly where it should as quickly as possible. This definition works reliably with GCC across all platforms, too. On MSVC the equivalent is __debugbreak. If you’re really in a pinch then do whatever it takes to trigger a fault, like dereferencing a null pointer. A more complete definition might be:

#ifdef DEBUG
#  if __GNUC__
#    define assert(c) if (!(c)) __builtin_trap()
#  elif _MSC_VER
#    define assert(c) if (!(c)) __debugbreak()
#  else
#    define assert(c) if (!(c)) *(volatile int *)0 = 0
#  endif
#else
#  define assert(c)
#endif

None of these print a diagnostic, but that’s unnecessary when a debugger is involved.

Other languages

Unfortunately the situation mostly gets worse with other language implementations, and it’s generally not possible to build a better assertion. Assertions typically have exception-like semantics, if not literally just another exception, and so they are far less reliable. If a failed assertion raises an exception, then the program won’t stop until it’s unwound the stack — running destructors and such along the way — all the way to the top level looking for a handler. It only knows there’s a problem when nobody was there to catch it.

Go officially doesn’t have assertions, though panics are a kind of assertion. However, panics have exception-like semantics, and so suffer the problems of exceptions. A Go version of my test:

func main() {
    defer fmt.Println("DEFER")
    for i := 0; i < 10; i++ {
        if i >= 5 {
            panic(i)
        }
    }
}

If I run this under Go’s premier debugger, Delve, the unrecovered panic causes it to break. So far so good. However, I get two junk frames:

#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit

It only knows to stop because the Go runtime called fatalpanic, but the backtrace is a fiction: The program continued to run after the panic, enough to run all the registered defers (including printing “DEFER”), unwinding the stack to the top level, and only then did it fatalpanic. Fortunately it’s still possible to inspect all those stack frames even if some variables may have changed while unwinding, but it’s more like inspecting a core dump than a paused process.

The situation in Python is similar: assert raises AssertionError — a plain old exception — and pdb won’t break until the stack has unwound, exiting context managers and such. Only once the exception reaches the top level does it enter “post mortem debugging,” like a core dump. At least there are no junk stack frames on top. If you’re using asyncio then your program may continue running for quite awhile before the right tasks are scheduled and the exception finally propagates to the top level, if ever.

The worst offender of all is Java. First jdb never breaks for unhandled exceptions. It’s up to you to set a breakpoint before the exception is thrown. But it gets worse: assertions are disabled under jdb. The Java assert statement is worse than useless.

Addendum: Don’t exit the debugger

The largest friction-reducing change I made is never exiting the debugger. Previously I would enter GDB, run my program, exit, edit/rebuild, repeat. However, there’s no reason to exit GDB! It automatically and reliably reloads symbols and updates breakpoints on symbols. It remembers your run configuration, so re-running is just r rather than interacting with shell history.

My workflow on all platforms (including Windows) is a vertically maximized Vim window and a vertically maximized terminal window. The new part for me: The terminal runs a long-term GDB session exclusively, with file set to the program I’m writing, usually set by initial the command line.

$ gdb myprogram
gdb>

Alternatively use file after starting GDB. Occasionally useful if my project has multiple binaries, and I want to examine a different program.

gdb> file myprogram

I use make and Vim’s :mak command for building from within the editor, so I don’t need to change context to build. The quickfix list takes me straight to warnings/errors. Often I’m writing something that takes input from standard input. So I use the run (r) command to set this up (along with any command line arguments).

gdb> r



You can redirect standard output as well. It remembers these settings for
plain run later, so I can test my program by entering r and nothing
else.

gdb> r


My usual workflow is edit, :mak, r, repeat. If I want to test a
different input or use different options, change the run configuration
using run again:

gdb> r -a -b -c 


On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use kill (k) to stop it
without exiting GDB.

gdb> k


GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
set confirm no in your .gdbinit to disable it.

Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with pkill, which
GDB will catch (except on Windows):

:!pkill myprogram


I suspect many people don’t know this, but if you’re on Windows and
developing a graphical application, you can press F12 in the
debuggee’s window to immediately break the program in the attached
debugger. This is a general platform feature and works with any native
debugger. I’ve been using it quite a lot.

On that note, you can run commands from GDB with !, which is another way
to avoid having an extra terminal window around:

gdb> !git diff


In any case, GDB will re-read the binary on the next run and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.

gdb> b somefunc
gdb> r


Alternatively I’ll use a line number, which I read from Vim. Though GDB,
not being involved in the editing process, cannot track how that line
moves between builds.

An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type next (n) — or step (s) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:

gdb> n
gdb>
gdb>


(I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful list command
is inadequate. Update: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either tty or set new-console. I can access it everywhere since
w64devkit now supports GDB TUI.)

If I want to advance to the next breakpoint, I use continue (c):

gdb> c


If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep printing (p) the same variables again and again.
So I use display (disp) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is i over some string str, this will show me the current
character in character format (/c).

gdb> disp/c str[i]


You can accumulate multiple expressions. Use undisplay to remove them.

Too many breakpoints? Use info breakpoints (i b) to list them, then
delete (d) the unwanted ones by ID.

gdb> i b
gdb> d 3 5 8


GDB has many more feature than this, but 10 commands cover 99% of use
cases: r, c, n, s, disp, k, b, i, d, p.



Compressing and embedding a Wordle word list
2022-03-07T03:22:41Z
Wordle is all the rage, resulting in an explosion of hobbyist clones,
with new ones appearing every day. At the current rate I estimate by the
end of 2022 that 99% of all new software releases will be Wordle clones.
That’s no surprise since the rules are simple, it’s more fun to implement
and study than to actually play, and the hard part is building a decent
user interface. Such implementations go back at least 30 years.
Implementers get to decide on a platform, language, and the particular
subject of this article: how to handle the word list. Is it a separate
file/database or embedded in the program? If embedded, is it
worth compressing? In this article I’ll present a simple, tailored Wordle
list compression strategy that beats general purpose compressors.

Last week one particular QuickBASIC clone, WorDOSle, caught my
eye. It embeds its word list despite the dire constraints of its 16-bit
platform. The original Wordle list (1, 2) has 12,972 words which,
naively stored, would consume 77,832 bytes (5 letters, plus newline).
Sadly this exceeds a 16-bit address space. Eliminating the redundant
newline delimiter brings it down to 64,860 bytes — just small enough to
fit in an 8086 segment, but probably still difficult to manage from
QuickBASIC.

The author made a trade-off, reducing the word list to a more manageable,
if meager, 2,318 words, wisely excluding delimiters. Otherwise no further
effort made towards reducing the size. The list is sorted, and the program
cleverly tests words against the list in place using a binary search.

Compaction baseline

Before getting into any real compression technologies, there’s low hanging
fruit to investigate. Words are exactly five, case-insensitive, English
language letters: a–z. To illustrate, here are the first 100 5-letter
words from a short Wordle word list.

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands


In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (log2(26)) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (26**5 is ~23.5 bits). A 40% savings just by using a
smarter representation.

With 12,972 words, that’s 38,916 bytes for the whole list. Any
compression I apply must at least beat this size in order to be worth
using.

Letter frequency

Not all letters occur at the same frequency. Here’s the letter frequency
for the original Wordle word list:

a:5990  e:6662  i:3759  m:1976  q: 112  u:2511  y:2074
b:1627  f:1115  j: 291  n:2952  r:4158  v: 694  z: 434
c:2028  g:1644  k:1505  o:4438  s:6665  w:1039
d:2453  h:1760  l:3371  p:2019  t:3295  x: 288


When encoding a word, I can save space by spending fewer bits on frequent
letters like e at the cost of spending more bits on infrequent letters
like q. There are multiple approaches, but the simplest is Huffman
coding. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.

While my ultimate target is C, I did the frequency analysis, explored the
problem space, and implemented my compressors in Python. I don’t normally
like to use Python, but it is good for one-shot, disposable data
science-y stuff like this. The decompressor will be implemented in C,
partially via meta-programming: Python code generating my C code. Here’s
my letter histogram code:

words = [line[:5] for line in sys.stdin]
hist = collections.defaultdict(int)
for c in itertools.chain(*words):
    hist[c] += 1


To build a Huffman coding tree, I’ll need a min-heap (priority queue)
initially filled with nodes representing each letter and its frequency.
While the heap has more than one element, I pop off the two lowest
frequency nodes, create a new parent node with the sum of their
frequencies, and push it into the heap. When the heap has one element, the
remaining element is the root of the Huffman coding tree.

def huffman(hist):
    heap = [(n, c) for c, n in hist.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        a, b = heapq.heappop(heap), heapq.heappop(heap)
        node = a[0]+b[0], (a[1], b[1])
        heapq.heappush(heap, node)
    return heap[0][1]

tree = huffman(hist)


(By the way, I love that heapq operates directly on a plain list
rather than being its own data structure.) This produces the following
Huffman coding tree (via pprint):

((('e', 's'),
  (('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
 ((('i', ('p', 'c')),
   ('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
  (('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))


It would be more useful to actually see the encodings.

def flatten(tree, prefix=""):
    if isinstance(tree, tuple):
        return flatten(tree[0], prefix+"0") + \
               flatten(tree[1], prefix+"1")
    else:
        return [(tree, prefix)]


I used isinstance to distinguish leaves (str) from internal nodes
(tuple). With sorted(flatten(tree)), I get something like Morse Code:

[('a', '1110'),       ('j', '10111110'),   ('s', '001'),
 ('b', '111111'),     ('k', '111110'),     ('t', '0100'),
 ('c', '10011'),      ('l', '0101'),       ('u', '11011'),
 ('d', '11010'),      ('m', '01111'),      ('v', '011010'),
 ('e', '000'),        ('n', '11110'),      ('w', '011011'),
 ('f', '101110'),     ('o', '1100'),       ('x', '101111111'),
 ('g', '01100'),      ('p', '10010'),      ('y', '10110'),
 ('h', '01110'),      ('q', '101111110'),  ('z', '1011110')]
 ('i', '1000'),       ('r', '1010'),


In terms of encoded bit length, what is the shortest and longest?

codes = dict(flatten(tree))
lengths = [(sum(len(codes[c]) for c in w), w) for w in words]


min(lengths) is “esses” at 15 bits, and max(lengths) is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: sum(w[0] for w in lengths)
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.

Speaking of bit packing, I’m ready to compress the entire word list into a
bit stream:

bits = "".join("".join(codes[c] for c in w) for w in words)


Where bits begins with:

11101110011100001101011101110010110001000111011101...


On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused textwrap to dice it up, and I also need to reverse each
set of bits before converting to an integer.

u32 = [int(b[::-1], 2) for b in textwrap.wrap(bits, width=32)]


I now have my compressed data as a sequence of 32-bit integers. Next, some
meta-programming:

print(f"static const uint32_t words[{len(u32)}] =", "{", end="")
for i, u in enumerate(u32):
    if i%6 == 0:
        print("\n    ", end="")
    print(f"0x{u:08x},", end="")
print("\n};")


That produces a C table, the beginnings of my decompressor. The array
length isn’t necessary since the C compiler can figure it out, but being
explicit allows human readers to know the size at a glance, too. Observe
how the final 32-bit integer isn’t entirely filled.

static const uint32_t words[8812] = {
    0x4eeb0e77,0xb8caee23,0xffb892bb,0x397fddf2,0xddfcbfee,0x5ff7997f,
    // ...
    0x7b4e66bd,0x35ebcccd,0x8f9af60f,0x0000000c,
};


Now, how to go about building the rest of the decompressor? I have a
Huffman coding tree, which is an awful lot like a state machine,
eh? I can even have Python generate a state transition table from the
Huffman tree:

def transitions(tree, states, state):
    if isinstance(tree, tuple):
        child = len(states)
        states[state] = -child
        states.extend((None, None))
        transitions(tree[0], states, child+0)
        transitions(tree[1], states, child+1)
    else:
        states[state] = ord(tree)
    return states

states = transitions(tree, [None], 0)


The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In transitions, the caller reserves
space in the state table for callees, hence starting with [None]. I’ll
show the actual table in C form after some more meta-programming:

print(f"static const int8_t states[{len(states)}] =", "{", end="")
for i, s in enumerate(states):
    if i%12 == 0:
        print("\n    ", end="")
    print(f"{s:4},", end="")
print("\n};")


I chose int8_t since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:

static const int8_t states[51] = {
      -1,  -3, -19,  -5,  -7, 101, 115,  -9, -11, 116, 108, -13,
     -17, 103, -15, 118, 119, 104, 109, -21, -39, -23, -27, 105,
     -25, 112,  99, 114, -29, 121, -31, 102, -33, 122, -35, 106,
     -37, 113, 120, -41, -45, 111, -43, 100, 117,  97, -47, 110,
     -49, 107,  98,
};


The first node is -1, meaning if you read a 0 bit then transition to state
1, else state 2 (e.g. immediately following 1). The decompressor reads one
bit at a time, walking the state table until it hits a positive value,
which is an ASCII code. I’ve decided on this function prototype:

int32_t next(char word[5], int32_t n);


The n is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:

int32_t next(char word[5], int32_t n)
{
    for (int i = 0; i < 5; i++) {
        int state = 0;
        for (; states[state] < 0; n++) {
            int b = words[n>>5]>>(n&31) & 1;  // next bit
            state = b - states[state];
        }
        word[i] = states[state];
    }
    return n;
}


When compiled, this is about 80 bytes of instructions, both x86-64 and
ARM64. This, along with the 51 bytes for the state table, should be
counted against the compression size. That’s 35,579 bytes total.

Trying it out, this program indeed reproduces the original word list:

int main(void)
{
    int32_t state = 0;
    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        state = next(word, state);
        fwrite(word, 6, 1, stdout);
    }
}


Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with b is at bit offset 15,743. If the
word I’m looking up begins with b then I can start there and stop at the
first c, decompressing just 909 words.

Taking it to the next level: run-length encoding

Here’s the 100-word word list sample again. The sorting is deliberate:

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands


If I look at words column-wise, I see a long run of a, then a long run
of b, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)

For this second scheme, I’ll pivot the whole list so that I can encode it
in column-order. (This is roughly how one part of bzip2 works, by the
way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’,
etc.”, then I’ll encode these RLE tokens using Huffman coding, per the
first scheme, since there will be lots of repeated tokens.

First, pivot the word list:

pivot = "".join("".join(w[i] for w in words) for i in range(5))


Next compute the RLE token stream. The stream works in pairs, first
indicating a letter (1–26), then the run length.

tokens = []
offset = 0
while offset < len(pivot):
    c = pivot[offset]
    start = offset
    while offset < len(pivot) and pivot[offset] == c:
        offset += 1
    tokens.append(ord(c) - ord('a') + 1)
    tokens.append(offset - start)


I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 —
since I’m going to encode all the tokens using the same Huffman tree.
(Exercise for the reader: Does compression improve with two distinct
Huffman trees, one for letters and the other for runs?) There are no
zero-length runs, and I want there to be as few unique tokens as possible.

tokens looks like so (e.g. 737 ‘a’, 909 ‘b’, …):

[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]


The original Wordle list results in 139 unique tokens. A few tokens appear
many times, but most of appear only once. Reusing my Huffman coding tree
builder from before:

tree = huffman(collections.Counter(tokens))


This makes for a more complex and interesting tree:

(1,
 ((((18, 20), (25, (((10, 24), (26, 22)), 8))),
   (5,
    ((11,
      ((23,
        ((17,
          (((35, (46, 76)), ((82, 93), (104, 111))),
           (((165, 168), 27), (28, (((30, 39), 31), 38))))),
         ((((((40, 41), ((44, 48), 45)),
             ((53, (54, 56)), 55)),
            ((((57, 59), 58), ((60, 61), (62, 63))),
             ((64, (65, 66)), ((67, 70), 68)))),
           (((((71, 75), 74), (77, (78, 79))),
             (((80, 85), 87), 81)),
            ((((90, 91), (92, 97)), (96, (99, 100))),
             (((101, 103), 102),
              ((105, 106), (109, 110)))))),
          ((((((113, 114), 117), ((120, 121), (125, 129))),
             (((130, 133), (137, 139)), (138, (140, 142)))),
            ((((144, 145), (147, 153)), (148, (166, 175))),
             (((181, 183), (187, 189)),
              ((193, 202), (220, 242))))),
           (((((262, 303), (325, 376)),
              ((413, 489), (577, 598))),
             (((628, 638), (685, 693)),
              ((737, 815), (859, 909)))),
            ((((922, 1565), 29), 32), (34, (33, 43)))))))),
       6)),
     3))),
  ((19, 2),
   ((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))


Peeking at the first 21 elements of sorted(flatten(tree)), which chops
off the long tail of large-valued, single-occurrence tokens:

[(1, '0'),            (8, '100111'),       (15, '111010'),
 (2, '1101'),         (9, '111101'),       (16, '1110111'),
 (3, '10111'),        (10, '10011000'),    (17, '1011010100'),
 (4, '11100'),        (11, '101100'),      (18, '10000'),
 (5, '1010'),         (12, '111110'),      (19, '1100'),
 (6, '1011011'),      (13, '1111110'),     (20, '10001'),
 (7, '1111111'),      (14, '111100'),      (21, '1110110')]


Huffman-encoding the RLE stream is more straightforward:

codes = dict(flatten(tree))
bits = "".join(codes[token] for token in tokens)


This time len(bits) is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!

Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be int16_t. I copy-pasted the
original meta-programming code and make the appropriate adjustments:

static const int16_t states[277] = {
      -1,   1,  -3,  -5,-257,  -7, -21,  -9, -11,  18,  20,  25,
     -13, -15,   8, -17, -19,  10,  24,  26,  22,   5, -23, -25,
       3,  11, -27, -29,   6,  23, -31, -33, -63,  17, -35, -37,
     -49, -39, -43,  35, -41,  46,  76, -45, -47,  82,  93, 104,
     111, -51, -55, -53,  27, 165, 168,  28, -57, -59,  38, -61,
      31,  30,  39, -65,-155, -67,-109, -69, -85, -71, -79, -73,
     -75,  40,  41, -77,  45,  44,  48, -81,  55,  53, -83,  54,
      56, -87, -99, -89, -93, -91,  58,  57,  59, -95, -97,  60,
      61,  62,  63,-101,-105,  64,-103,  65,  66,-107,  68,  67,
      70,-111,-129,-113,-123,-115,-119,-117,  74,  71,  75,  77,
    -121,  78,  79,-125,  81,-127,  87,  80,  85,-131,-143,-133,
    -139,-135,-137,  90,  91,  92,  97,  96,-141,  99, 100,-145,
    -149,-147, 102, 101, 103,-151,-153, 105, 106, 109, 110,-157,
    -213,-159,-185,-161,-173,-163,-167,-165, 117, 113, 114,-169,
    -171, 120, 121, 125, 129,-175,-181,-177,-179, 130, 133, 137,
     139, 138,-183, 140, 142,-187,-199,-189,-195,-191,-193, 144,
     145, 147, 153, 148,-197, 166, 175,-201,-207,-203,-205, 181,
     183, 187, 189,-209,-211, 193, 202, 220, 242,-215,-245,-217,
    -231,-219,-225,-221,-223, 262, 303, 325, 376,-227,-229, 413,
     489, 577, 598,-233,-239,-235,-237, 628, 638, 685, 693,-241,
    -243, 737, 815, 859, 909,-247,-253,-249,  32,-251,  29, 922,
    1565,  34,-255,  33,  43,-259,-261,  19,   2,-263,-269,   4,
    -265,  15,-267,  21,  16,-271,-273,  14,   9,  12,-275,  13,
       7,
};


(Since 277 is prime it will never wrap to a nice rectangle no matter what
width I plug in. Ugh.)

With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a 12972*5-byte buffer to be
filled:

void decompress(char *);


Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a 12972*3-byte buffer.

Here’s my decoder, much like before:

void decompress(char *buf)
{
    for (int32_t x = 0, y = 0, i = 0; i < 164958;) {
        // Decode letter
        int state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int c = states[state] + 96;

        // Decode run-length
        state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int len = states[state];

        // Fill columns
        for (int n = 0; n < len; n++, y++) {
            buf[y*5+x] = c;
        }
        if (y == 12972) {
            y = 0;
            x++;
        }
    }
}


And my new test exactly reproduces the original list:

int main(void)
{
    char buf[12972*5L];
    decompress(buf);

    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        memcpy(word, buf+i*5, 5);
        fwrite(word, 6, 1, stdout);
    }
}


Totalling it up:


  Compressed data is 20,620 bytes
  State table is 554 bytes
  Decompressor is about 200 bytes


That’s a total of 21,374 bytes. Surprisingly this beats general purpose
compressors!

PROGRAM     VERSION   SIZE
bzip2 -9    1.0.8     33,752
gzip -9     1.10      30,338
zstd -19    1.4.8     27,098
brotli -9   1.0.9     26,031
xz -9e      5.2.5     16,656
lzip -9     1.22      16,608


Only xz and lzip come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.

Update: Johannes Rudolph has pointed out a compression scheme for
a Game Boy Wordle clone last month that gets it down to 17,871 bytes,
and supports iteration. I improved on this scheme to further
reduce it to 16,659 bytes.




OpenBSD's pledge and unveil from Python
2021-09-15T02:46:56Z
This article was discussed on Hacker News.

Years ago, OpenBSD gained two new security system calls, pledge(2)
(originally tame(2)) and unveil. In both, an application
surrenders capabilities at run-time. The idea is to perform initialization
like usual, then drop capabilities before handling untrusted input,
limiting unwanted side effects. This feature is applicable even where type
safety isn’t an issue, such as Python, where a program might still get
tricked into accessing sensitive files or making network connections when
it shouldn’t. So how can a Python program access these system calls?

As discussed previously, it’s quite easy to access C APIs from
Python through its ctypes package, and this is no exception.
In this article I show how to do it. Here’s the full source if you want to
dive in: openbsd.py.



I’ve chosen these extra constraints:


  
    As extra safety features, unnecessary for correctness, attempts to call
these functions on systems where they don’t exist will silently do
nothing, as though they succeeded. They’re provided as a best effort.
  
  
    Systems other than OpenBSD may support these functions, now or in the
future, and it would be nice to automatically make use of them when
available. This means no checking for OpenBSD specifically but instead
feature sniffing for their presence.
  
  
    The interfaces should be Pythonic as though they were implemented in
Python itself. Raise exceptions for errors, and accept strings since
they’re more convenient than bytes.
  


For reference, here are the function prototypes:

int pledge(const char *promises, const char *execpromises);
int unveil(const char *path, const char *permissions);


The string-oriented interface of pledge will make this a whole
lot easier to implement.

Finding the functions

The first step is to grab functions through ctypes. Like a lot of Python
documentation, this area is frustratingly imprecise and under-documented.
I want to grab a handle to the already-linked libc and search for either
function. However, getting that handle is a little different on each
platform, and in the process I saw four different exceptions, only one of
which is documented.

I came up with passing None to ctypes.CDLL, which ultimately just passes
NULL to dlopen(3). That’s really all I wanted. Currently on
Windows this is a TypeError. Once the handle is in hand, try to access the
pledge attribute, which will fail with AttributeError if it doesn’t
exist. In the event of any exception, just assume the behavior isn’t
available. If found, I also define the function prototype for ctypes.

_pledge = None
try:
    _pledge = ctypes.CDLL(None, use_errno=True).pledge
    _pledge.restype = ctypes.c_int
    _pledge.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _pledge = None


Catching a broad Exception isn’t great, but it’s the best we can do since
the documentation is incomplete. From this block I’ve seen TypeError,
AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if
there are more possibilities, and I don’t want to risk missing them.

Note that I’m catching Exception rather than using a bare except. My
code will not catch KeyboardInterrupt nor SystemExit. This is deliberate,
and I never want to catch these.

The same story for unveil:

_unveil = None
try:
    _unveil = ctypes.CDLL(None, use_errno=True).unveil
    _unveil.restype = ctypes.c_int
    _unveil.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _unveil = None


Pythonic wrappers

The next and final step is to wrap the low-level call in an interface that
hides their C and ctypes nature.

Python strings must be encoded to bytes before they can be passed to C
functions. Rather than make the caller worry about this, we’ll let them
pass friendly strings and have the wrapper do the conversion. Either may
also be NULL, so None is allowed.

def pledge(promises: Optional[str], execpromises: Optional[str]):
    if not _pledge:
        return  # unimplemented

    r = _pledge(None if promises is None else promises.encode(),
                None if execpromises is None else execpromises.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))


As usual, a return of -1 means there was an error, in which case we fetch
errno and raise the appropriate OSError.

unveil works a little differently since the first argument is a path.
Python functions that accept paths, such as open, generally accept
either strings or bytes. On unix-like systems, paths are fundamentally
bytestrings and not necessarily Unicode, so it’s necessary to accept
bytes. Since strings are nearly always more convenient, they take both.
The unveil wrapper here will do the same. If it’s a string, encode it,
otherwise pass it straight through.

def unveil(path: Union[str, bytes, None], permissions: Optional[str]):
    if not _unveil:
        return  # unimplemented

    r = _unveil(path.encode() if isinstance(path, str) else path,
                None if permissions is None else permissions.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))


That’s it!

Trying it out

Let’s start with unveil. Initially a process has access to the whole
file system with the usual restrictions. On the first call to unveil
it’s immediately restricted to some subset of the tree. Each call reveals
a little more until a final NULL which locks it in place for the rest of
the process’s existence.

Suppose a program has been tricked into accessing your shell history,
perhaps by mishandling a path:

def hackme():
    try:
        with open(pathlib.Path.home() / ".bash_history"):
            print("You've been hacked!")
    except FileNotFoundError:
        print("Blocked by unveil.")

hackme()


If you’re a Bash user, this prints:

You've been hacked!


Using our new feature to restrict the program’s access first:

# restrict access to static program data
unveil("/usr/share", "r")
unveil(None, None)

hackme()


On OpenBSD this now prints:

Blocked by unveil.


Working just as it should!

With pledge we declare what abilities we’d like to keep by supplying a
list of promises, pledging to use only those abilities afterward. A
common case is the stdio promise which allows reading and writing of
open files, but not opening files. A program might open its log file,
then drop the ability to open files while retaining the ability to write
to its log.

An invalid or unknown promise is an error. Does that work?

>>> pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument


So far so good. How about the functionality itself?

pledge("stdio", None)
hackme()


The program is instantly killed when making the disallowed system call:

Abort trap (core dumped)


If you want something a little softer, include the error promise:

pledge("stdio error", None)
hackme()


Instead it’s an exception, which will be a lot easier to debug when it
comes to Python, so you probably always want to use it.

OSError: [Errno 78] Function not implemented


The core dump isn’t going to be much help to a Python program, so you
probably always want to use this promise. In general you need to be extra
careful about pledge in complex runtimes like Python’s which may
reasonably need to do many arbitrary, undocumented things at any time.




State machines are wonderful tools
2020-12-31T22:48:13Z
This article was discussed on Hacker News.

I love when my current problem can be solved with a state machine. They’re
fun to design and implement, and I have high confidence about correctness.
They tend to:


  Present minimal, tidy interfaces
  Require few, fixed resources
  Hold no opinions about input and output
  Have a compact, concise implementation
  Be easy to reason about


State machines are perhaps one of those concepts you heard about in
college but never put into practice. Maybe you use them regularly.
Regardless, you certainly run into them regularly, from regular
expressions to traffic lights.



Morse code decoder state machine

Inspired by a puzzle, I came up with this deterministic state
machine for decoding Morse code. It accepts a dot ('.'), dash
('-'), or terminator (0) one at a time, advancing through a state
machine step by step:

int morse_decode(int state, int c)
{
    static const unsigned char t[] = {
        0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
        0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
        0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
        0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
        0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
        0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
        0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
        0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
    };
    int v = t[-state];
    switch (c) {
    case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
    case 0x2e: return v &  2 ? state*2 - 1 : 0;
    case 0x2d: return v &  1 ? state*2 - 2 : 0;
    default:   return 0;
    }
}


It typically compiles to under 200 bytes (table included), requires only a
few bytes of memory to operate, and will fit on even the smallest of
microcontrollers. The full source listing, documentation, and
comprehensive test suite:

https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c

The state machine is trie-shaped, and the 100-byte table t is the static
encoding of the Morse code trie:



Dots traverse left, dashes right, terminals emit the character at the
current node (terminal state). Stopping on red nodes, or attempting to
take an unlisted edge is an error (invalid input).

Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of t), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as a binary heap in an
array: the left and right children of the node at i are found at
i*2+1 and i*2+2. No need to waste memory storing edges!

Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No c input
will produce a bad state.

In the original problem the terminals were missing. Despite being a state
machine, morse_decode is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.

UTF-8 decoder state machine

The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible
and Economical UTF-8 Decoder. It packs the entire state machine into
a relatively small table using clever tricks. It’s easily my favorite
UTF-8 decoder.

I wanted to try my own hand at it, so I re-derived the same canonical
UTF-8 automaton:



Then I encoded this diagram directly into a much larger (2,064-byte), less
elegant table, too large to display inline here:

https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c

However, the trade-off is that the executable code is smaller, faster, and
branchless again (by accident, I swear!):

int utf8_decode(int state, long *cp, int byte)
{
    static const signed char table[8][256] = { /* ... */ };
    static const unsigned char masks[2][8] = { /* ... */ };
    int next = table[state][byte];
    *cp = (*cp << 6) | (byte & masks[!state][next&7]);
    return next;
}


Like Bjoern’s decoder, there’s a code point accumulator. The real state
machine has 1,109,950 terminal states, and many more edges and nodes. The
accumulator is an optimization to track exactly which edge was taken to
which node without having to represent such a monstrosity.

Despite the huge table I’m pretty happy with it.

Word count state machine

Here’s another state machine I came up with awhile back for counting words
one Unicode code point at a time while accounting for Unicode’s various
kinds of whitespace. If your input is bytes, then plug this into the above
UTF-8 state machine to convert bytes to code points! This one uses a
switch instead of a lookup table since the table would be sparse (i.e.
let the compiler figure it out).

/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */
long word_count(long state, long codepoint)
{
    switch (codepoint) {
    case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
    case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
    case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
    case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
    case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
        return state < 0 ? -state : state;
    default:
        return state < 0 ? state : -1 - state;
    }
}


I’m particularly happy with the edge-triggered state transition
mechanism. The sign of the state tracks whether the “signal” is “high”
(inside of a word) or “low” (outside of a word), and so it counts rising
edges.



The counter is not technically part of the state machine — though it
eventually overflows for practical reasons, it isn’t really “finite” — but
is rather an external count of the times the state machine transitions
from low to high, which is the actual, useful output.

Reader challenge: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the switch (chain of branches, jump table?).

Coroutines and generators as state machines

In languages that support them, state machines can be implemented using
coroutines, including generators. I do particularly like the idea of
compiler-synthesized coroutines as state machines, though this is a
rare treat. The state is implicit in the coroutine at each yield, so the
programmer doesn’t have to manage it explicitly. (Though often that
explicit control is powerful!)

Unfortunately in practice it always feels clunky. The following implements
the word count state machine (albeit in a rather un-Pythonic way). The
generator returns the current count and is continued by sending it another
code point:

WHITESPACE = {
    0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
    0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
    0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
    0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
    0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}

def wordcount():
    count = 0
    while True:
        while True:
            # low signal
            codepoint = yield count
            if codepoint not in WHITESPACE:
                count += 1
                break
        while True:
            # high signal
            codepoint = yield count
            if codepoint in WHITESPACE:
                break


However, the generator ceremony dominates the interface, so you’d probably
want to wrap it in something nicer — at which point there’s really no
reason to use the generator in the first place:

wc = wordcount()
next(wc)  # prime the generator
wc.send(ord('A'))  # => 1
wc.send(ord(' '))  # => 1
wc.send(ord('B'))  # => 2
wc.send(ord(' '))  # => 2


Same idea in Lua, which famously has full coroutines:

local WHITESPACE = {
    [0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
    [0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
    [0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
    [0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
    [0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
    [0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
    [0x3000]=true
}

function wordcount()
    local count = 0
    while true do
        while true do
            -- low signal
            local codepoint = coroutine.yield(count)
            if not WHITESPACE[codepoint] then
                count = count + 1
                break
            end
        end
        while true do
            -- high signal
            local codepoint = coroutine.yield(count)
            if WHITESPACE[codepoint] then
                break
            end
        end
    end
end


Except for initially priming the coroutine, at least coroutine.wrap()
hides the fact that it’s a coroutine.

wc = coroutine.wrap(wordcount)
wc()  -- prime the coroutine
wc(string.byte('A'))  -- => 1
wc(string.byte(' '))  -- => 1
wc(string.byte('B'))  -- => 2
wc(string.byte(' '))  -- => 2


Extra examples

Finally, a couple more examples not worth describing in detail here. First
a Unicode case folding state machine:

https://github.com/skeeto/scratch/blob/master/misc/casefold.c

It’s just an interface to do a lookup into the official case folding
table. It was an experiment, and I probably wouldn’t use it in a
real program.

Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s
not obvious from the interface, but internally it’s just a state machine
for both encoder and decoder, which is what it allows it to “pause”
between any pair of input/output bytes.




Asynchronously Opening and Closing Files in Asyncio
2020-09-04T01:36:20Z
Python asyncio has support for asynchronous networking,
subprocesses, and interprocess communication. However, it has nothing
for asynchronous file operations — opening, reading, writing, or
closing. This is likely in part because operating systems themselves
also lack these facilities. If a file operation takes a long time,
perhaps because the file is on a network mount, then the entire Python
process will hang. It’s possible to work around this, so let’s build a
utility that can asynchronously open and close files.

The usual way to work around the lack of operating system support for a
particular asynchronous operation is to dedicate threads to waiting on
those operations. By using a thread pool, we can even avoid the
overhead of spawning threads when we need them. Plus asyncio is designed
to play nicely with thread pools anyway.

Test setup

Before we get started, we’ll need some way to test that it’s working. We
need a slow file system. One thought is to use ptrace to intercept the
relevant system calls, though this isn’t quite so simple. The
other threads need to continue running while the thread waiting on
open(2) is paused, but ptrace pauses the whole process. Fortunately
there’s a simpler solution anyway: LD_PRELOAD.

Setting the LD_PRELOAD environment variable to the name of a shared
object will cause the loader to load this shared object ahead of
everything else, allowing that shared object to override other
libraries. I’m on x86-64 Linux (Debian), and so I’m looking to override
open64(2) in glibc. Here’s my open64.c:

#define _GNU_SOURCE
#include 
#include 
#include 

int
open64(const char *path, int flags, int mode)
{
    if (!strncmp(path, "/tmp/", 5)) {
        sleep(3);
    }
    int (*f)(const char *, int, int) = dlsym(RTLD_NEXT, "open64");
    return f(path, flags, mode);
}


Now Python must go through my C function when it opens files. If the
file resides where under /tmp/, opening the file will be delayed by 3
seconds. Since I still want to actually open a file, I use dlsym() to
access the real open64() in glibc. I build it like so:

$ cc -shared -fPIC -o open64.so open64.c -ldl


And to test that it works with Python, let’s time how long it takes to
open /tmp/x:

$ touch /tmp/x
$ time LD_PRELOAD=./open64.so python3 -c 'open("/tmp/x")'

real    0m3.021s
user    0m0.014s
sys     0m0.005s


Perfect! (Note: It’s a little strange putting time before setting the
environment variable, but that’s because I’m using Bash and it time is
special since this is the shell’s version of the command.)

Thread pools

Python’s standard open() is most commonly used as a context manager
so that the file is automatically closed no matter what happens.

with open('output.txt', 'w') as out:
    print('hello world', file=out)


I’d like my asynchronous open to follow this pattern using async
with. It’s like with, but the context manager is acquired and
released asynchronously. I’ll call my version aopen():

async with aopen('output.txt', 'w') as out:
    ...


So aopen() will need to return an asynchronous context manager, an
object with methods __aenter__ and __aexit__ that both return
awaitables. Usually this is by virtue of these methods being
coroutine functions, but a normal function that directly returns
an awaitable also works, which is what I’ll be doing for __aenter__.

class _AsyncOpen():
    def __init__(self, args, kwargs):
        ...

    def __aenter__(self):
        ...

    async def __aexit__(self, exc_type, exc, tb):
        ...


Ultimately we have to call open(). The arguments for open() will be
given to the constructor to be used later. This will make more sense
when you see the definition for aopen().

    def __init__(self, args, kwargs):
        self._args = args
        self._kwargs = kwargs


When it’s time to actually open the file, Python will call __aenter__.
We can’t call open() directly since that will block, so we’ll use a
thread pool to wait on it. Rather than create a thread pool, we’ll use
the one that comes with the current event loop. The run_in_executor()
method runs a function in a thread pool — where None means use the
default pool — returning an asyncio future representing the future
result, in this case the opened file object.

    def __aenter__(self):
        def thread_open():
            return open(*self._args, **self._kwargs)
        loop = asyncio.get_event_loop()
        self._future = loop.run_in_executor(None, thread_open)
        return self._future


Since this __aenter__ is not a coroutine function, it returns the
future directly as its awaitable result. The caller will await it.

The default thread pool is limited to one thread per core, which I
suppose is the most obvious choice, though not ideal here. That’s fine
for CPU-bound operations but not for I/O-bound operations. In a real
program we may want to use a larger thread pool.

Closing a file may block, so we’ll do that in a thread pool as well.
First pull the file object from the future, then close it in the
thread pool, waiting until the file has actually closed:

    async def __aexit__(self, exc_type, exc, tb):
        file = await self._future
        def thread_close():
            file.close()
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(None, thread_close)


The open and close are paired in this context manager, but it may be
concurrent with an arbitrary number of other _AsyncOpen context
managers. There will be some upper limit to the number of open files, so
we need to be careful not to use too many of these things
concurrently, something which easily happens when using unbounded
queues. Lacking back pressure, all it takes is for tasks to be
opening files slightly faster than they close them.

With all the hard work done, the definition for aopen() is trivial:

def aopen(*args, **kwargs):
    return _AsyncOpen(args, kwargs)


That’s it! Let’s try it out with the LD_PRELOAD test.

A test drive

First define a “heartbeat” task that will tell us the asyncio loop is
still chugging away while we wait on opening the file.

async def heartbeat():
    while True:
        await asyncio.sleep(0.5)
        print('HEARTBEAT')


Here’s a test function for aopen() that asynchronously opens a file
under /tmp/ named by an integer, (synchronously) writes that integer
to the file, then asynchronously closes it.

async def write(i):
    async with aopen(f'/tmp/{i}', 'w') as out:
        print(i, file=out)


The main() function creates the heartbeat task and opens 4 files
concurrently though the intercepted file opening routine:

async def main():
    beat = asyncio.create_task(heartbeat())
    tasks = [asyncio.create_task(write(i)) for i in range(4)]
    await asyncio.gather(*tasks)
    beat.cancel()

asyncio.run(main())


The result:

$ LD_PRELOAD=./open64.so python3 aopen.py
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
$ cat /tmp/{1,2,3,4}
1
2
3
4


As expected, 6 heartbeats corresponding to 3 seconds that all 4 tasks
spent concurrently waiting on the intercepted open(). Here’s the full
source if you want to try it our for yourself:

https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd

Caveat: no asynchronous reads and writes

Only opening and closing the file is asynchronous. Read and writes are
unchanged, still fully synchronous and blocking, so this is only a half
solution. A full solution is not nearly as simple because asyncio is
async/await. Asynchronous reads and writes would require all new APIs
with different coloring. You’d need an aprint() to complement
print(), and so on, each returning an awaitable to be awaited.

This is one of the unfortunate downsides of async/await. I strongly
prefer conventional, preemptive concurrency, but we don’t always have
that luxury.




Conventions for Command Line Options
2020-08-01T00:34:23Z
This article was discussed on Hacker News and critiqued on
Wandering Thoughts (2, 3).

Command line interfaces have varied throughout their brief history but
have largely converged to some common, sound conventions. The core
originates from unix, and the Linux ecosystem extended it,
particularly via the GNU project. Unfortunately some tools initially
appear to follow the conventions, but subtly get them wrong, usually
for no practical benefit. I believe in many cases the authors simply
didn’t know any better, so I’d like to review the conventions.



Short Options

The simplest case is the short option flag. An option is a hyphen —
specifically HYPHEN-MINUS U+002D — followed by one alphanumeric
character. Capital letters are acceptable. The letters themselves have
conventional meanings and are worth following if possible.

program -a -b -c


Flags can be grouped together into one program argument. This is both
convenient and unambiguous. It’s also one of those often missed details
when programs use hand-coded argument parsers, and the lack of support
irritates me.

program -abc
program -acb


The next simplest case are short options that take arguments. The
argument follows the option.

program -i input.txt -o output.txt


The space is optional, so the option and argument can be packed together
into one program argument. Since the argument is required, this is still
unambiguous. This is another often-missed feature in hand-coded parsers.

program -iinput.txt -ooutput.txt


This does not prohibit grouping. When grouped, the option accepting an
argument must be last.

program -abco output.txt
program -abcooutput.txt


This technique is used to create another category, optional option
arguments. The option’s argument can be optional but still unambiguous
so long as the space is always omitted when the argument is present.

program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"


Optional option arguments should be used judiciously since they can be
surprising, but they have their uses.

Options can typically appear in any order — something parsers often
achieve via permutation — but non-options typically follow options.

program -a -b foo bar
program -b -a foo bar


GNU-style programs usually allow options and non-options to be mixed,
though I don’t consider this to be essential.

program -a foo -b bar
program foo -a -b bar
program foo bar -a -b


If a non-option looks like an option because it starts with a hyphen,
use -- to demarcate options from non-options.

program -a -b -- -x foo bar


An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so -- is less often
needed.

# note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options


Long options

Since short options can be cryptic, and there are such a limited number
of them, more complex programs support long options. A long option
starts with two hyphens followed by one or more alphanumeric, lowercase
words. Hyphens separate words. Using two hyphens prevents long options
from being confused for grouped short options.

program --reverse --ignore-backups


Occasionally flags are paired with a mutually exclusive inverse flag
that begins with --no-. This avoids a future flag day where the
default is changed in the release that also adds the flag implementing
the original behavior.

program --sort
program --no-sort


Long options can similarly accept arguments.

program --output output.txt --block-size 1024


These may optionally be connected to the argument with an equals sign
=, much like omitting the space for a short option argument.

program --output=output.txt --block-size=1024


Like before, this opens up the doors for optional option arguments. Due
to the required = this is still unambiguous.

program --color --reverse
program --color=never --reverse


The -- retains its original behavior of disambiguating option-like
non-option arguments:

program --reverse -- --foo bar


Subcommands

Some programs, such as Git, have subcommands each with their own
options. The main program itself may still have its own options distinct
from subcommand options. The program’s options come before the
subcommand and subcommand options follow the subcommand. Options are
never permuted around the subcommand.

program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz


Above, the -a, -b, and -c options are for program, and the
others are for subcommand. So, really, the subcommand is another
command line of its own.

Option parsing libraries

There’s little excuse for not getting these conventions right assuming
you’re interested in following the conventions. Short options can be
parsed correctly in just ~60 lines of C code. Long options are
just slightly more complex.

GNU’s getopt_long() supports long option abbreviation — with no way to
disable it (!) — but this should be avoided.

Go’s flag package intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with =. It’s sound, but I miss both features
every time I write programs in Go. That’s why I wrote my own argument
parser. Not only does it have a nicer feature set, I like the API a
lot more, too.

Python’s primary option parsing library is argparse, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention and its behavior is unsound. For instance, the following
program has two options, --foo and --bar. The --foo option accepts
an optional argument, and the --bar option is a simple flag.

import argparse
import sys

parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))


Here are some example runs:

$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')


Everything looks good except the last. If the --foo argument is
optional then why did it consume arg? What happens if I follow it with
--bar? Will it consume it as the argument?

$ python parse.py --foo --bar
Namespace(bar=True, foo=None)


Nope! Unlike arg, it left --bar alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument looks
like an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since argparse does not properly support --,
that does not help.

$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg


Please, stick to the conventions unless you have really good reasons
to break them!




Exactly-Once Initialization in Asynchronous Python
2020-07-30T23:39:12Z
This article was discussed on Hacker News.

A common situation in asyncio Python programs is asynchronous
initialization. Some resource must be initialized exactly once before it
can be used, but the initialization itself is asynchronous — such as an
asyncpg database. Let’s talk about a couple of solutions.



The naive “solution” would be to track the initialization state in a
variable:

initialized = False

async def one_time_setup():
    "Do not call more than once!"
    ...

async def maybe_initialize():
    global initialized
    if not initialized:
        await one_time_setup()
        initialized = True


The reasoning for initialized is the expectation of calling the
function more than once. However, if it might be called from concurrent
tasks there’s a race condition. If the second caller arrives while the
first is awaiting one_time_setup(), the function will be called a
second time.

Switching the order of the call and the assignment won’t help:

async def maybe_initialize():
    global initialized
    if not initialized:
        initialized = True
        await one_time_setup()


Since asyncio is cooperative, the first caller doesn’t give up control
until to other tasks until the await, meaning one_time_setup() will
never be called twice. However, the second caller may return before
one_time_setup() has completed. What we want is for one_time_setup()
to be called exactly once, but for no caller to return until it has
returned.

Mutual exclusion

My first thought was to use a mutex lock. This will protect the
variable and prevent followup callers from progressing too soon. Tasks
arriving while one_time_setup() is still running will block on the
lock.

initialized = False
initialized_lock = asyncio.Lock()

async def maybe_initialize():
    global initialized
    async with initialized_lock:
        if not initialized:
            await one_time_setup()
            initialized = True


Unfortunately this has a serious downside: asyncio locks are
associated with the loop where they were created. Since the
lock variable is global, maybe_initialize() can only be called from
the same loop that loaded the module. asyncio.run() creates a new loop
so it’s incompatible.

# create a loop: always an error
asyncio.run(maybe_initialize())

# reuse the loop: maybe an error
loop = asyncio.get_event_loop()
loop.run_until_complete((maybe_initialize()))


(IMHO, it was a mistake for the asyncio API to include explicit loop
objects. It’s a low-level concept that unavoidably leaks through most
high-level abstractions.)

A workaround is to create the lock lazily. Thank goodness creating a
lock isn’t itself asynchronous!

initialized = False
initialized_lock = None

async def maybe_initialize():
    global initialized, initialized_lock
    if not initialized_lock:
        initialized_lock = asyncio.Lock()
    async with initialized_lock:
        if not initialized:
            await one_time_setup()
            initialized = True


This is better, but maybe_initialize() can still only ever be called
from a single loop.

asyncio.run(maybe_initialize()) # ok
asyncio.run(maybe_initialize()) # error!


Once

The pthreads API provides pthread_once to solve this problem.
C++11 has similarly has std::call_once. We can build something
similar using a future-like object.

future = None

async def maybe_initialize():
    global future
    if not future:
        future = asyncio.create_task(one_time_setup())
    await future


Awaiting a coroutine more than once is an error, but tasks are
future-like objects and can be awaited more than once. At least on
CPython, they can also be awaited in other loops! So not only is this
simpler, it also solves the loop problem!

asyncio.run(maybe_initialize()) # ok
asyncio.run(maybe_initialize()) # still ok


This can be tidied up nicely in a @once decorator:

def once(func):
    future = None
    async def once_wrapper(*args, **kwargs):
        nonlocal future
        if not future:
            future = asyncio.create_task(func(*args, **kwargs))
        return await future
    return once_wrapper


No more need for maybe_initialize(), just decorate the original
one_time_setup():

@once
async def one_time_setup():
    ...





Latency in Asynchronous Python
2020-05-24T02:44:50Z
This week I was debugging a misbehaving Python program that makes
significant use of Python’s asyncio. The program would
eventually take very long periods of time to respond to network
requests. My first suspicion was a CPU-heavy coroutine hogging the
thread, preventing the socket coroutines from running, but an
inspection with pdb showed this wasn’t the case. Instead, the
program’s author had made a couple of fundamental mistakes using
asyncio. Let’s discuss them using small examples.

Setting the stage: There’s a heartbeat coroutine that “beats” once per
second. A real program would send out a packet as the heartbeat, but
here it just prints how late it was scheduled.

async def heartbeat():
    while True:
        start = time.time()
        await asyncio.sleep(1)
        delay = time.time() - start - 1
        print(f'heartbeat delay = {delay:.3f}s')


Running this with asyncio.run(heartbeat()):

heartbeat delay = 0.001s
heartbeat delay = 0.001s
heartbeat delay = 0.001s


It’s consistently 1ms late, but good enough, especially considering
what’s to come. A program that only sends a heartbeat is pretty
useless, so a real program will be busy working on other things
concurrently. In this example, we have little 10ms payloads of work to
do, which are represented by this process() function:

JOB_DURATION = 0.01  # 10ms

async def process():
    time.sleep(JOB_DURATION) # simulate CPU time


That’s a synchronous sleep because it’s standing in for actual CPU work.
Maybe it’s parsing JSON in a loop or crunching numbers in NumPy. Use
your imagination. During this 10ms no other coroutines can be scheduled
because this is, after all, still just a single-threaded program.

JOB_COUNT = 200

async def main():
    asyncio.create_task(heartbeat())

    await asyncio.sleep(2.5)

    print('begin processing')
    count = JOB_COUNT
    for _ in range(JOB_COUNT):
        asyncio.create_task(process())

    await asyncio.sleep(5)


This program starts the heartbeat coroutine in a task. A coroutine
doesn’t make progress unless someone is waiting on it, and that
something can be a task. So it will continue along independently without
prodding.

The arbitrary 2.5 second sleep simulates waiting, say, for a network
request. In the output we’ll see the heartbeat tick a couple of times,
then it will create and process 200 jobs concurrently. In a real program
we’d have some way to collect the results, but we can ignore that part
for now. They’re only 10ms, so the effect on the heartbeat should be
pretty small right?

heartbeat delay = 0.001s
heartbeat delay = 0.001s
begin processing
heartbeat delay = 1.534s
heartbeat delay = 0.001s
heartbeat delay = 0.001s


The heartbeat was delayed for 1.5 seconds by a mere 200 tasks each doing
only 10ms of work each. What happened?

Python calls the object that schedules tasks a loop, and this is no
coincidence. Everything to be scheduled gets put into a loop and is
scheduled round robin, one after another. The 200 tasks got scheduled
ahead of the heartbeat, and so it doesn’t get scheduled again until each
of those tasks either yields (await) or completes.

It really didn’t take much to significantly hamper the heartbeat, and,
with a dumb bytecode compiler, 10ms may not be much work at all.
The lesson here is to avoid spawning many tasks if latency is an
important consideration.

A semaphore is not the answer

My first idea at a solution: What if we used a semaphore to limit the
number of “active” tasks at a time? Then perhaps the heartbeat wouldn’t
have to compete with so many other tasks for time.

WORKER_COUNT = 4  # max "active" jobs at a time

async def main_with_semaphore():
    asyncio.create_task(heartbeat())

    await asyncio.sleep(2.5)

    sem = asyncio.Semaphore(WORKER_COUNT)
    async def process():
        await sem.acquire()
        time.sleep(JOB_DURATION)
        sem.release()

    print('begin processing')
    for _ in range(JOB_COUNT):
        asyncio.create_task(process())

    await asyncio.sleep(5)


When the heartbeat sleep completes, about half the jobs will be complete
and the other half blocked on the semaphore. So perhaps the heartbeat
gets to skip ahead of all the blocked tasks since they’re not yet ready
to run?

heartbeat delay = 0.001s
heartbeat delay = 0.001s
begin processing
heartbeat delay = 1.537s
heartbeat delay = 0.001s
heartbeat delay = 0.001s


It made no difference whatsoever because the tasks each “held their
place” in line in the loop! Even reducing WORKER_COUNT to 1 would have
no effect. As soon as a task completes, it frees the task waiting next
in line. The semaphore does practically nothing here.

Solving it with a job queue

Here’s what does work: a job queue. Create a queue to be populated
with coroutines (not tasks), and have a small number of tasks run jobs
from the queue. Since this is a real solution, I’ve made this example
more complete.

async def main_with_queue():
    asyncio.create_task(heartbeat())

    await asyncio.sleep(2.5)

    queue = asyncio.Queue(maxsize=1)
    async def worker():
        while True:
            coro = await queue.get()
            await coro  # consider using try/except
            queue.task_done()
    workers = [asyncio.create_task(worker())
                   for _ in range(WORKER_COUNT)]

    print('begin processing')
    for _ in range(JOB_COUNT):
        await queue.put(process())
    await queue.join()
    print('end processing')

    for w in workers:
        w.cancel()

    await asyncio.sleep(2)


The task_done() and join() methods make it trivial synchronize on
full job completion. I also take the time to destroy the worker tasks.
It’s harmless to leave them blocked on the queue. They’ll be garbage
collected so it’s not a resource leak. However, CPython complains about
garbage collecting running tasks because it looks like a mistake — and
it usually is.

If you read carefully you might have noticed the queue’s maximum size is
set to 1: not much of a “queue”! Go developers will recognize this
as being (nearly) an unbuffered channel, the default and most common
kind of channel. So it’s more a synchronized rendezvous between producer
(put()) and consumer (get()). The producer waits at the queue with a
job until a task is free to come take it. A task waits at the queue
until a producer arrives with a job for it.

heartbeat delay = 0.001s
heartbeat delay = 0.001s
begin processing
heartbeat delay = 0.014s
heartbeat delay = 0.020s
end processing
heartbeat delay = 0.002s
heartbeat delay = 0.001s


The output shows that the impact to the heartbeat was modest — about
the best we could hope for from async/await — and the heartbeat
continued while jobs were running. The more concurrency — the more
worker tasks running on the queue — the greater the latency.

Note: Increasing the WORKER_COUNT in this toy example won’t have an
impact on latency since the jobs aren’t actually concurrent. They start,
run, and complete before another worker task can draw from the queue.
Putting a couple awaits in process() allows for concurrency:

WORKER_COUNT = 200

async def process():
    await asyncio.sleep(0.01)
    time.sleep(JOB_DURATION)
    await asyncio.sleep(0.01)


Since there are so many worker tasks, this is back to the initial
problem:

heartbeat delay = 0.001s
heartbeat delay = 0.001s
begin processing
heartbeat delay = 1.655s
end processing
heartbeat delay = 0.001s
heartbeat delay = 0.001s


As WORKER_COUNT decreases, so does heartbeat latency.

Unbounded queues

Here’s another defect from the same program. Create an unbounded queue,
a producer, and a consumer. The consumer prints the queue size so we can
see what’s happening:

async def producer_consumer():
    queue = asyncio.Queue()
    done = asyncio.Condition()

    async def producer():
        for i in range(100_000):
            await queue.put(i)
        await queue.join()
        async with done:
            done.notify()

    async def consumer():
        while True:
            await queue.get()
            print(f'qsize = {queue.qsize()}')
            queue.task_done()

    asyncio.create_task(producer())
    asyncio.create_task(consumer())

    async with done:
        await done.wait()


The output of this program begins:

qsize = 99999
qsize = 99998
qsize = 99997
qsize = 99996
...


So the entire queue is populated before the consumer does anything at
all: tons of latency for whatever is being consumed. Since the queue is
unbounded, the producer never needs to yield. You might be tempted to
use asyncio.sleep(0) in the producer to yield explicitly:

    async def producer():
        for i in range(100_000):
            await queue.put(i)
            await asyncio.sleep(0)  # yield
        await queue.join()
        async with done:
            done.notify()


This even seems to work! The output looks like this:

qsize = 0
qsize = 0
qsize = 0
qsize = 0


However, this is fragile and not a real solution. If the consumer yields
just two times in its own loop, its nearly back to where we started:

    async def consumer():
        while True:
            await queue.get()
            print(f'qsize = {queue.qsize()}')
            queue.task_done()
            await asyncio.sleep(0)
            await asyncio.sleep(0)


The output shows that the producer gradually creeps ahead of the
consumer. On each consumer iteration, the producer iterates twice:

qsize = 0
qsize = 1
qsize = 2
qsize = 3
...


There’s a really simple solution to this: Never, ever use unbounded
queues. In fact every unbounded asyncio.Queue() is a bug.
It’s a serious API defect that asyncio allows unbounded queues to be
created at all. The default maxsize should have been actually zero
(unbuffered), not infinite. Because unbounded is the default, virtually
every example of asyncio.Queue — online, offline, and even the
official documentation — is broken in some way.

Important takeaways


  The default asyncio.Queue() is always wrong.
  asyncio.sleep(0) is nearly always used incorrectly.
  Use a maxsize=1 job queue instead of spawning many identical tasks.


Python linters should be updated to warn about 1 and 2 by default.

Update: A couple of people have pointed out an argument in the Trio
documentation for unbounded queues. This argument conflates two
different concepts: data structure queues and concurrent communication
infrastructure queues. To distinguish, the latter is often called a
channel. An unbounded queue (collections.deque) is necessary, but
and unbounded channel (asyncio.Queue) is always wrong. The Trio
documentation describes a web crawler, which is fundamentally a
breadth-first search (read: queue-oriented) of a graph. So this is a
plain old BFS queue, not a channel, which is why it’s reasonable for it
to be unbounded.




Endlessh: an SSH Tarpit
2019-03-22T17:26:45Z
This article was discussed on Hacker News (later), on
reddit (also), featured in BSD Now 294.
Also check out this Endlessh analysis.

I’m a big fan of tarpits: a network service that intentionally inserts
delays in its protocol, slowing down clients by forcing them to wait.
This arrests the speed at which a bad actor can attack or probe the
host system, and it ties up some of the attacker’s resources that
might otherwise be spent attacking another host. When done well, a
tarpit imposes more cost on the attacker than the defender.



The Internet is a very hostile place, and anyone who’s ever stood up
an Internet-facing IPv4 host has witnessed the immediate and
continuous attacks against their server. I’ve maintained such a
server for nearly six years now, and more than 99% of my
incoming traffic has ill intent. One part of my defenses has been
tarpits in various forms. The latest addition is an SSH tarpit I wrote
a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However,
it actually just ties up SSH clients with false promises indefinitely
— or at least until the client eventually gives up. After cloning the
repository, here’s how you can try it out for yourself (default port
2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost


Your SSH client will hang there and wait for at least several days
before finally giving up. Like a mammoth in the La Brea Tar Pits, it
got itself stuck and can’t get itself out. As I write, my
Internet-facing SSH tarpit currently has 27 clients trapped in it. A
few of these have been connected for weeks. In one particular spike it
had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the
standard SSH port. I long ago moved my real SSH server off to another
port where it sees a whole lot less SSH traffic — essentially none.
This makes the logs a whole lot more manageable. And (hopefully)
Endlessh convinces attackers not to look around for an SSH server on
another port.

How does it work? Endlessh exploits a little paragraph in RFC
4253, the SSH protocol specification. Immediately after the TCP
connection is established, and before negotiating the cryptography,
both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF


The RFC also notes:


  The server MAY send other lines of data before sending the version
string.


There is no limit on the number of lines, just that these lines must
not begin with “SSH-“ since that would be ambiguous with the
identification string, and lines must not be longer than 255
characters including CRLF. So Endlessh sends and endless stream of
randomly-generated “other lines of data” without ever intending to
send a version string. By default it waits 10 seconds between each
line. This slows down the protocol, but prevents it from actually
timing out.

This means Endlessh need not know anything about cryptography or the
vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as
possible. It’s just a security tool, and the server does have an
actual purpose that doesn’t include being a tarpit. It should tie up
the attacker’s resources, not the server’s, and should generally be
unnoticeable. (Take note all those who write the awful “security”
products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than
99.999% of its time waiting around, doing nothing. It wouldn’t even be
accurate to call it I/O-bound. If anything, it’s timer-bound,
waiting around before sending off the next line of data. The most
precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a
fork server: accept a connection, fork, and the child simply alternates
between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}


A process per connection is a lot of overhead when connections are
expected to be up hours or even weeks at a time. An attacker who knows
about this could exhaust the server’s resources with little effort by
opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per
connection. On Linux this is practically the same thing, but it’s
still better. However, you still have to allocate a stack for the thread
and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a
single-threaded poll(2) server, analogous to stackless green threads.
The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their
only overhead is the socket object in the kernel, and another 78 bytes
to track them in Endlessh. Most of those bytes are used only for
accurate logging. Only those clients that are overdue for a new line
are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is
essentially used in place of sleep(3). Though since it still needs
to manage the accept server socket, it (almost) never actually waits
on nothing.

There’s an option to limit the total number of client connections so
that it doesn’t get out of hand. In this case it will stop polling the
accept socket until a client disconnects. I probably shouldn’t have
bothered with this option and instead relied on ulimit, a feature
already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much
more efficient than poll(2). The problem with poll(2) is that it’s
constantly registering and unregistering Endlessh on each of the
overdue sockets each time around the main loop. This is by far the
most CPU-intensive part of Endlessh, and it’s all inflicted on the
kernel. Most of the time, even with thousands of clients trapped in
the tarpit, only a small number of them at polled at once, so I opted
for better portability instead.

One consequence of not polling connections that are waiting is that
disconnections aren’t noticed in a timely fashion. This makes the logs
less accurate than I like, but otherwise it’s pretty harmless.
Unforunately even if I wanted to fix this, the poll(2) interface
isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the
kernel, where it allocates send and receive buffers for each client
and manages the proper TCP state. The next step to reducing this
overhead is Endlessh opening a raw socket and speaking TCP itself,
bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t
need to be tracked. For example, it doesn’t care about any data sent by
the client, so no receive buffer is needed, and any data that arrives
could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks.
Despite the long delays between data lines, the kernel itself responds
very quickly on the TCP layer and below. ACKs are sent back quickly and
so on. An astute attacker could detect that the delay is artificial,
imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol
itself. It could introduce artificial “noise” to the connection
that requires packet retransmissions, delay ACKs, etc. It would look a
lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the
moment attackers either have a hard timeout, so this wouldn’t matter,
or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asyncio, and
it’s actually a near perfect fit for this problem. I should have just
used it in the first place. The hard part is already implemented within
asyncio, and the problem isn’t CPU-bound, so being written in Python
doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of
Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())


Since Python coroutines are stackless, the per-connection memory
overhead is comparable to the C version. So it seems asyncio is
perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up
attackers trying to exploit HTTP servers. It slowly sends a random,
endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())


Try it out for yourself. Firefox and Chrome will spin on that server
for hours before giving up. I have yet to see curl actually timeout on
its own in the default settings (--max-time/-m does work
correctly, though).

Parting exercise for the reader: Using the examples above as a starting
point, implement an SMTP tarpit using asyncio. Bonus points for using
TLS connections and testing it against real spammers.




An Async / Await Library for Emacs Lisp
2019-03-10T20:57:03Z
As part of building my Python proficiency, I’ve learned how to
use asyncio. This new language feature first appeared in
Python 3.5 (PEP 492, September 2015). JavaScript grew a
nearly identical feature in ES2017 (June 2017). An async function
can pause to await on an asynchronously computed result, much like a
generator pausing when it yields a value.

In fact, both Python and JavaScript async functions are essentially just
fancy generator functions with some specialized syntax and semantics.
That is, they’re stackless coroutines. Both languages already had
generators, so their generator-like async functions are a natural
extension that — unlike stackful coroutines — do not require
significant, new runtime plumbing.

Emacs officially got generators in 25.1 (September 2016),
though, unlike Python and JavaScript, it didn’t require any additional
support from the compiler or runtime. It’s implemented entirely using
Lisp macros. In other words, it’s just another library, not a core
language feature. In theory, the generator library could be easily
backported to the first Emacs release to properly support lexical
closures, Emacs 24.1 (June 2012).

For the same reason, stackless async/await coroutines can also be
implemented as a library. So that’s what I did, letting Emacs’ generator
library do most of the heavy lifting. The package is called aio:


  https://github.com/skeeto/emacs-aio


It’s modeled more closely on JavaScript’s async functions than Python’s
asyncio, with the core representation being promises rather than a
coroutine objects. I just have an easier time reasoning about promises
than coroutines.

I’m definitely not the first person to realize this was
possible, and was beaten to the punch by two years. Wanting to
avoid fragmentation, I set aside all formality in my first
iteration on the idea, not even bothering with namespacing my
identifiers. It was to be only an educational exercise. However, I got
quite attached to my little toy. Once I got my head wrapped around the
problem, everything just sort of clicked into place so nicely.

In this article I will show step-by-step one way to build async/await
on top of generators, laying out one concept at a time and then
building upon each. But first, some examples to illustrate the desired
final result.

aio example

Ignoring all its problems for a moment, suppose you want to use
url-retrieve to fetch some content from a URL and return it. To keep
this simple, I’m going to omit error handling. Also assume that
lexical-binding is t for all examples. Besides, lexical scope
required by the generator library, and therefore also required by aio.

The most naive approach is to fetch the content synchronously:

(defun fetch-fortune-1 (url)
  (let ((buffer (url-retrieve-synchronously url)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))


The result is returned directly, and errors are communicated by an error
signal (e.g. Emacs’ version of exceptions). This is convenient, but the
function will block the main thread, locking up Emacs until the result
has arrived. This is obviously very undesirable, so, in practice,
everyone nearly always uses the asynchronous version:

(defun fetch-fortune-2 (url callback)
  (url-retrieve url (lambda (_status)
                      (funcall callback (buffer-string)))))


The main thread no longer blocks, but it’s a whole lot less
convenient. The result isn’t returned to the caller, and instead the
caller supplies a callback function. The result, whether success or
failure, will be delivered via callback, so the caller must split
itself into two pieces: the part before the callback and the callback
itself. Errors cannot be delivered using a error signal because of the
inverted flow control.

The situation gets worse if, say, you need to fetch results from two
different URLs. You either fetch results one at a time (inefficient),
or you manage two different callbacks that could be invoked in any
order, and therefore have to coordinate.

Wouldn’t it be nice for the function to work like the first example,
but be asynchronous like the second example? Enter async/await:

(aio-defun fetch-fortune-3 (url)
  (let ((buffer (aio-await (aio-url-retrieve url))))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))


A function defined with aio-defun is just like defun except that
it can use aio-await to pause and wait on any other function defined
with aio-defun — or, more specifically, any function that returns a
promise. Borrowing Python parlance: Returning a promise makes a
function awaitable. If there’s an error, it’s delivered as a error
signal from aio-url-retrieve, just like the first example. When
called, this function returns immediately with a promise object that
represents a future result. The caller might look like this:

(defcustom fortune-url ...)

(aio-defun display-fortune ()
  (interactive)
  (message "%s" (aio-await (fetch-fortune-3 fortune-url))))


How wonderfully clean that looks! And, yes, it even works with
interactive like that. I can M-x display-fortune and a fortune is
printed in the minibuffer as soon as the result arrives from the
server. In the meantime Emacs doesn’t block and I can continue my
work.

You can’t do anything you couldn’t already do before. It’s just a
nicer way to organize the same callbacks: implicit rather than
explicit.

Promises, simplified

The core object at play is the promise. Promises are already a
rather simple concept, but aio promises have been distilled to their
essence, as they’re only needed for this singular purpose. More on
this later.

As I said, a promise represents a future result. In practical terms, a
promise is just an object to which one can subscribe with a callback.
When the result is ready, the callbacks are invoked. Another way to
put it is that promises reify the concept of callbacks. A
callback is no longer just the idea of extra argument on a function.
It’s a first-class thing that itself can be passed around as a
value.

Promises have two slots: the final promise result and a list of
subscribers. A nil result means the result hasn’t been computed
yet. It’s so simple I’m not even bothering with cl-struct.

(defun aio-promise ()
  "Create a new promise object."
  (record 'aio-promise nil ()))

(defsubst aio-promise-p (object)
  (and (eq 'aio-promise (type-of object))
       (= 3 (length object))))

(defsubst aio-result (promise)
  (aref promise 1))


To subscribe to a promise, use aio-listen:

(defun aio-listen (promise callback)
  (let ((result (aio-result promise)))
    (if result
        (run-at-time 0 nil callback result)
      (push callback (aref promise 2)))))


If the result isn’t ready yet, add the callback to the list of
subscribers. If the result is ready call the callback in the next
event loop turn using run-at-time. This is important because it
keeps all the asynchronous components isolated from one another. They
won’t see each others’ frames on the call stack, nor frames from
aio. This is so important that the Promises/A+ specification
is explicit about it.

The other half of the equation is resolving a promise, which is done
with aio-resolve. Unlike other promises, aio promises don’t care
whether the promise is being fulfilled (success) or rejected
(error). Instead a promise is resolved using a value function — or,
usually, a value closure. Subscribers receive this value function
and extract the value by invoking it with no arguments.

Why? This lets the promise’s resolver decide the semantics of the
result. Instead of returning a value, this function can instead signal
an error, propagating an error signal that terminated an async function.
Because of this, the promise doesn’t need to know how it’s being
resolved.

When a promise is resolved, subscribers are each scheduled in their own
event loop turns in the same order that they subscribed. If a promise
has already been resolved, nothing happens. (Thought: Perhaps this
should be an error in order to catch API misuse?)

(defun aio-resolve (promise value-function)
  (unless (aio-result promise)
    (let ((callbacks (nreverse (aref promise 2))))
      (setf (aref promise 1) value-function
            (aref promise 2) ())
      (dolist (callback callbacks)
        (run-at-time 0 nil callback value-function)))))


If you’re not an async function, you might subscribe to a promise like
so:

(aio-listen promise (lambda (v)
                      (message "%s" (funcall v))))


The simplest example of a non-async function that creates and delivers
on a promise is a “sleep” function:

(defun aio-sleep (seconds &optional result)
  (let ((promise (aio-promise))
        (value-function (lambda () result)))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))


Similarly, here’s a “timeout” promise that delivers a special timeout
error signal at a given time in the future.

(defun aio-timeout (seconds)
  (let ((promise (aio-promise))
        (value-function (lambda () (signal 'aio-timeout nil))))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))


That’s all there is to promises.

Evaluate in the context of a promise

Before we get into pausing functions, lets deal with the slightly
simpler matter of delivering their return values using a promise. What
we need is a way to evaluate a “body” and capture its result in a
promise. If the body exits due to a signal, we want to capture that as
well.

Here’s a macro that does just this:

(defmacro aio-with-promise (promise &rest body)
  `(aio-resolve ,promise
                (condition-case error
                    (let ((result (progn ,@body)))
                      (lambda () result))
                  (error (lambda ()
                           (signal (car error) ; rethrow
                                   (cdr error)))))))


The body result is captured in a closure and delivered to the promise.
If there’s an error signal, it’s “rethrown” into subscribers by the
promise’s value function.

This is where Emacs Lisp has a serious weak spot. There’s not really a
concept of rethrowing a signal. Unlike a language with explicit
exception objects that can capture a snapshot of the backtrace, the
original backtrace is completely lost where the signal is caught.
There’s no way to “reattach” it to the signal when it’s rethrown. This
is unfortunate because it would greatly help debugging if you got to see
the full backtrace on the other side of the promise.

Async functions

So we have promises and we want to pause a function on a promise.
Generators have iter-yield for pausing an iterator’s execution. To
tackle this problem:


  Yield the promise to pause the iterator.
  Subscribe a callback on the promise that continues the generator
(iter-next) with the promise’s result as the yield result.


All the hard work is done in either side of the yield, so aio-await is
just a simple wrapper around iter-yield:

(defmacro aio-await (expr)
  `(funcall (iter-yield ,expr)))


Remember, that funcall is here to extract the promise value from the
value function. If it signals an error, this propagates directly into
the iterator just as if it had been a direct call — minus an accurate
backtrace.

So aio-lambda / aio-defun needs to wrap the body in a generator
(iter-lamba), invoke it to produce a generator, then drive the
generator using callbacks. Here’s a simplified, unhygienic definition of
aio-lambda:

(defmacro aio-lambda (arglist &rest body)
  `(lambda (&rest args)
     (let ((promise (aio-promise))
           (iter (apply (iter-lambda ,arglist
                          (aio-with-promise promise
                            ,@body))
                        args)))
       (prog1 promise
         (aio--step iter promise nil)))))


The body is evaluated inside aio-with-promise with the result
delivered to the promise returned directly by the async function.

Before returning, the iterator is handed to aio--step, which drives
the iterator forward until it delivers its first promise. When the
iterator yields a promise, aio--step attaches a callback back to
itself on the promise as described above. Immediately driving the
iterator up to the first yielded promise “primes” it, which is
important for getting the ball rolling on any asynchronous operations.

If the iterator ever yields something other than a promise, it’s
delivered right back into the iterator.

(defun aio--step (iter promise yield-result)
  (condition-case _
      (cl-loop for result = (iter-next iter yield-result)
               then (iter-next iter (lambda () result))
               until (aio-promise-p result)
               finally (aio-listen result
                                   (lambda (value)
                                     (aio--step iter promise value))))
    (iter-end-of-sequence)))


When the iterator is done, nothing more needs to happen since the
iterator resolves its own return value promise.

The definition of aio-defun just uses aio-lambda with defalias.
There’s nothing to it.

That’s everything you need! Everything else in the package is merely
useful, awaitable functions like aio-sleep and aio-timeout.

Composing promises

Unfortunately url-retrieve doesn’t support timeouts. We can work
around this by composing two promises: a url-retrieve promise and
aio-timeout promise. First define a promise-returning function,
aio-select that takes a list of promises and returns (as another
promise) the first promise to resolve:

(defun aio-select (promises)
  (let ((result (aio-promise)))
    (prog1 result
      (dolist (promise promises)
        (aio-listen promise (lambda (_)
                              (aio-resolve
                               result
                               (lambda () promise))))))))


We give aio-select both our url-retrieve and timeout promises, and
it tells us which resolved first:

(aio-defun fetch-fortune-4 (url timeout)
  (let* ((promises (list (aio-url-retrieve url)
                         (aio-timeout timeout)))
         (fastest (aio-await (aio-select promises)))
         (buffer (aio-await fastest)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))


Cool! Note: This will not actually cancel the URL request, just move
the async function forward earlier and prevent it from getting the
result.

Threads

Despite aio being entirely about managing concurrent, asynchronous
operations, it has nothing at all to do with threads — as in Emacs 26’s
support for kernel threads. All async functions and promise callbacks
are expected to run only on the main thread. That’s not to say an
async function can’t await on a result from another thread. It just must
be done very carefully.

Processes

The package also includes two functions for realizing promises on
processes, whether they be subprocesses or network sockets.


  aio-process-filter
  aio-process-sentinel


For example, this function loops over each chunk of output (typically
4kB) from the process, as delivered to a filter function:

(aio-defun process-chunks (process)
  (cl-loop for chunk = (aio-await (aio-process-filter process))
           while chunk
           do (... process chunk ...)))


Exercise for the reader: Write an awaitable function that returns a line
at at time rather than a chunk at a time. You can build it on top of
aio-process-filter.

I considered wrapping functions like start-process so that their aio
versions would return a promise representing some kind of result from
the process. However there are so many different ways to create and
configure processes that I would have ended up duplicating all the
process functions. Focusing on the filter and sentinel, and letting the
caller create and configure the process is much cleaner.

Unfortunately Emacs has no asynchronous API for writing output to a
process. Both process-send-string and process-send-region will block
if the pipe or socket is full. There is no callback, so you cannot await
on writing output. Maybe there’s a way to do it with a dedicated thread?

Another issue is that the process-send-* functions are
preemptible, made necessary because they block. The
aio-process-* functions leave a gap (i.e. between filter awaits)
where no filter or sentinel function is attached. It’s a consequence
of promises being single-fire. The gap is harmless so long as the
async function doesn’t await something else or get preempted. This
needs some more thought.

Update: These process functions no longer exist and have been
replaced by a small framework for building chains of promises. See
aio-make-callback.

Testing aio

The test suite for aio is a bit unusual. Emacs’ built-in test suite,
ERT, doesn’t support asynchronous tests. Furthermore, tests are
generally run in batch mode, where Emacs invokes a single function and
then exits rather than pump an event loop. Batch mode can only handle
asynchronous process I/O, not the async functions of aio. So it’s
not possible to run the tests in batch mode.

Instead I hacked together a really crude callback-based test suite. It
runs in non-batch mode and writes the test results into a buffer
(run with make check). Not ideal, but it works.

One of the tests is a sleep sort (with reasonable tolerances). It’s a
pretty neat demonstration of what you can do with aio:

(aio-defun sleep-sort (values)
  (let ((promises (mapcar (lambda (v) (aio-sleep v v)) values)))
    (cl-loop while promises
             for next = (aio-await (aio-select promises))
             do (setf promises (delq next promises))
             collect (aio-await next))))


To see it in action (M-x sleep-sort-demo):

(aio-defun sleep-sort-demo ()
  (interactive)
  (let ((values '(0.1 0.4 1.1 0.2 0.8 0.6)))
    (message "%S" (aio-await (sleep-sort values)))))


Async/await is pretty awesome

I’m quite happy with how this all came together. Once I had the
concepts straight — particularly resolving to value functions —
everything made sense and all the parts fit together well, and mostly
by accident. That feels good.




Python Decorators: Syntactic Artificial Sweetener
2019-03-08T23:00:49Z
Python has a feature called function decorators. With a little bit of
syntax, the behavior of a function or class can be modified in useful
ways. Python comes with a few decorators, but most of the useful ones
are found in third-party libraries.

PEP 318 suggests a very simple, but practical decorator called
synchronized, though it doesn’t provide a concrete example. Consider
this function that increments a global counter:

counter = 0

def increment():
    global counter
    counter = counter + 1


If this function is called from multiple threads, there’s a race
condition — though, at least for CPython, it’s not a data
race thanks to the Global Interpreter Lock (GIL). Incrementing
the counter is not an atomic operation, as illustrated by its byte
code:

 0 LOAD_GLOBAL              0 (counter)
 3 LOAD_CONST               1 (1)
 6 BINARY_ADD
 7 STORE_GLOBAL             0 (counter)
10 LOAD_CONST               0 (None)
13 RETURN_VALUE


The variable is loaded, operated upon, and stored. Another thread
could be scheduled between any of these instructions and cause an
undesired result. It’s easy to see that in practice:

from threading import Thread

def worker():
    for i in range(200000):
        increment()

threads = [Thread(target=worker) for _ in range(8)];
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

print(counter)


The increment function is called exactly 1.6 million times, but on my
system I get different results on each run:

$ python3 example.py 
1306205
$ python3 example.py 
1162418
$ python3 example.py 
1076801


I could change the definition of increment() to use synchronization,
but wouldn’t it be nice if I could just tell Python to synchronize this
function? This is where a function decorator shines:

from threading import Lock

def synchronized(f):
    lock = Lock()
    def wrapper():
        with lock:
            return f()
    return wrapper


The synchronized function is a higher order function that accepts a
function and returns a function — or, more specifically, a callable.
The purpose is to wrap and decorate the function it’s given. In this
case the function is wrapped in a mutual exclusion lock. Note: This
implementation is very simple and only works for functions that accept
no arguments.

To use it, I just add a single line to increment:

@synchronized
def increment():
    global counter
    counter = counter + 1


With this change my program now always prints 1600000.

Syntactic “sugar”

Everyone is quick to point out that this is just syntactic sugar, and
that you can accomplish this without the @ syntax. For example, the
last definition of increment is equivalent to:

def increment():
    ...

increment = synchronized(increment)


Decorators can also be parameterized. For example, Python’s
functools module has an lru_cache decorator for memoizing a
function:

@lru_cache(maxsize=32)
def expensive(n):
    ...


Which is equivalent to this very direct source transformation:

def expensive(n):
    ...

expensive = lru_cache(maxsize=32)(expensive)


So what comes after the @ isn’t just a name. In fact, it looks
like it can be any kind of expression that evaluates to a function
decorator. Or is it?

Syntactic artificial sweetener

Reality is often disappointing. Let’s try using an “identity” decorator
defined using lambda. This decorator will accomplish nothing, but it
will test if we can decorate a function using a lambda expression.

@lambda f: f
def foo():
    pass


But Python complains:

    @lambda f: f
          ^
SyntaxError: invalid syntax


Maybe Python is absolutely literal about the syntax sugar thing, and
it’s more like a kind of macro replacement. Let’s try wrapping it in
parentheses:

@(lambda f: f)
def foo(n):
    pass


Nope, same error, but now pointing at the opening parenthesis. Getting
desperate now:

@[synchronized][0]
def foo():
    pass


Again, syntax error. What’s going on?

Pattern matching

The problem is that the Python language reference doesn’t parse an
expression after @. It matches a very specific pattern that
just so happens to look like a Python expression. It’s not syntactic
sugar, it’s syntactic artificial sweetener!

ator ::= "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE


In a way, this puts Python in the ranks of PHP 5 and Matlab: two
languages with completely screwed up grammars that can only parse
specific constructions that the developers had anticipated. For
example, in PHP 5 (fixed in PHP 7):

function foo() {
    return function() {
        return 0;
    };
}

foo()();


That is a syntax error:

PHP Parse error:  syntax error, unexpected '(', expecting ',' or ';'


Or in any version of Matlab:

    magic(4)(:)


That is a syntax error:

Unbalanced or unexpected parenthesis or bracket


In Python’s defense, this strange, limited syntax is only in a single
place rather than everywhere, but I still wonder why it was defined
that way.

Update: Clément Pit-Claudel pointed out the explanation in the PEP,
which references a 2004 email by Guido van Rossum:


  I have a gut feeling about this one.  I’m not sure where it comes
from, but I have it.  It may be that I want the compiler to be able to
recognize certain decorators.

  So while it would be quite easy to change the syntax to @test in the
future, I’d like to stick with the more restricted form unless a real
use case is presented where allowing @test would increase readability.
(@foo().bar() doesn’t count because I don’t expect you’ll ever need
that).





The CPython Bytecode Compiler is Dumb
2019-02-24T21:56:35Z
This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on
Python at work, I recently needed to brush up on my Python skills. So
far for me, Python has been little more than a fancy extension
language for BeautifulSoup, though I also used it to participate
in the recent tradition of writing one’s own static site
generator, in this case for my wife’s photo blog.
I’ve been reading through Fluent Python by Luciano Ramalho, and it’s
been quite effective at getting me up to speed.



As I write Python, like with Emacs Lisp, I can’t help but
consider what exactly is happening inside the interpreter. I wonder if
the code I’m writing is putting undue constraints on the bytecode
compiler and limiting its options. Ultimately I’d like the code I
write to drive the interpreter efficiently and effectively.
The Zen of Python says there should “only one obvious way to do
it,” but in practice there’s a lot of room for expression. Given
multiple ways to express the same algorithm or idea, I tend to prefer
the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of
Python, is very transparent about its bytecode. It’s easy to inspect
and reason about its bytecode. The disassembly listing is easy to read
and understand, and I can always follow it without consulting the
documentation. This contrasts sharply with modern JavaScript engines
and their opaque use of JIT compilation, where performance is guided
by obeying certain patterns (hidden classes, etc.), helping the
compiler understand my program’s types, and being careful
not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been
studying the bytecode disassembly of the functions that I write. One
fact has become quite apparent: the CPython bytecode compiler is
pretty dumb. With a few exceptions, it’s a very literal translation
of a Python program, and there is almost no optimization.
Below I’ll demonstrate a case where it’s possible to detect one of the
missed optimizations without inspecting the bytecode disassembly
thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should
necessarily change. In fact, as I’ll show, dumb bytecode compilers
are par for the course. In the past I’ve lamented how the Emacs Lisp
compiler could do a better job, but CPython and Lua are operating at
the same level. There are benefits to a dumb and straightforward
bytecode compiler: the compiler itself is simpler, easier to maintain,
and more amenable to modification (e.g. as Python continues to
evolve). It’s also easier to debug Python (pdb) because it’s such a
close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum
himself said, “Python is about having the simplest, dumbest compiler
imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance,
use something other than Python. (And if you can’t do that, at least use
PyPy.) That’s a fairly reasonable and healthy goal. Still, if
I’m writing Python, I’d like to do the best I can, which means
exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython
3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically
typed, are primarily executed on a bytecode virtual machine, and it’s
easy to access their disassembly listings. One caveat: CPython and Emacs
use a stack-based virtual machine while Lua uses a register-based
virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x
disassemble, and all code will use lexical scoping. In Lua I’ll use
lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the
variable around potentially involves allocating memory for it, assigning
to it, and accessing it. Take this example:

def foo():
    x = 0
    y = 1
    return x


This function is equivalent to:

def foo():
    return 0


Despite this, CPython completely misses this optimization for both x
and y:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE


It assigns both variables, and even loads again from x for the return.
Missed optimizations, but, as I said, by keeping these variables around,
debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo()
    local x = 0
    local y = 1
    return x
end


It also misses this optimization, though it matters a little less due to
its architecture (the return instruction references a register
regardless of whether or not that register is allocated to a local
variable):

        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1


Emacs Lisp also misses it:

(defun foo ()
  (let ((x 0)
        (y 1))
    x))


Disassembly:

0	constant  0
1	constant  1
2	stack-ref 1
3	return


All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at
compile time? This is simple and everyone does it.

def foo():
    return 1 + 2 * 3 / 4


Disassembly:

  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE


Lua:

function foo()
    return 1 + 2 * 3 / 4
end


Disassembly:

        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1


Emacs Lisp:

(defun foo ()
  (+ 1 (/ (* 2 3) 4.0))


Disassembly:

0	constant  2.5
1	return


That’s something we can count on so long as the operands are all
numeric literals (or also, for Python, string literals) that are
visible to the compiler. Don’t count on your operator overloads to
work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects
allocated in a function ever become visible outside of that function. If
they don’t then these objects could potentially be stack-allocated
(instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython
does have a trick up its sleeve: tuple optimization. Since tuples are
immutable, in certain circumstances CPython will reuse them and avoid
both the constructor and the allocation.

def foo():
    return (1, 2, 3)


Check it out, the tuple is used as a constant:

  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE


Which we can detect by evaluating foo() is foo(), which is True.
Though deviate from this too much and the optimization is disabled.
Remember how CPython can’t optimize away variables, and that they
break constant folding? The break this, too:

def foo():
    x = 1
    return (x, 2, 3)


Disassembly:

  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE


This function might document that it always returns a simple tuple,
but we can tell if its being optimized or not using is like before:
foo() is foo() is now False! In some future version of Python with
a cleverer bytecode compiler, that expression might evaluate to
True. (Unless the Python language specification is specific
about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with
is. Was that deliberate? I’m impressed that PyPy matches CPython’s
semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break
this optimization. But that’s not the compiler being dumb. That’s a
hard constraint on the compiler: the caller might change the mutable
component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of
an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode
compilers eliminate unnecessary intermediate objects.

def foo():
    return [1024][0]


Disassembly:

  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE


Lua:

function foo()
    return ({1024})[1]
end


Disassembly:

        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1


Emacs Lisp:

(defun foo ()
  (car (list 1024)))


Disassembly:

0	constant  1024
1	list1
2	car
3	return


Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and
so on, and each case is almost certainly unoptimized. The general rule
of thumb is to simply not expect much from these bytecode compilers.
They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious
optimizations from the compiler. This frees me to be more expressive
in my code. Lots of things are cost-free thanks to these
optimizations, such as breaking a complex expression up into several
variables, naming my constants, or not using a local variable to
manually cache memory accesses. I’m confident the compiler will
optimize away my expressiveness. The catch is that clever compilers
can take things too far, so I’ve got to be mindful of how it might
undermine my intentions — i.e. when I’m doing something unusual or not
strictly permitted.

These bytecode compilers will never truly surprise me. The cost is
that being more expressive in Python, Lua, or Emacs Lisp may reduce
performance at run time because it shows in the bytecode. Usually this
doesn’t matter, but sometimes it does.




Web Scraping into an E-book with BeautifulSoup and Pandoc
2017-05-15T02:39:20Z
I recently learned how to use BeautifulSoup, a Python library for
manipulating HTML and XML parse trees, and it’s been a fantastic
addition to my virtual toolbelt. In the past when I’ve needed to process
raw HTML, I’ve tried nasty hacks with Unix pipes, or routing the
content through a web browser so that I could manipulate it via
the DOM API. None of that worked very well, but now I finally have
BeautifulSoup to fill that gap. It’s got a selector interface and,
except for rendering, it’s basically as comfortable with HTML as
JavaScript.

Today’s problem was that I wanted to read a recommended online
book called Interviewing Leather, a story set “in a world where
caped heroes fight dastardly villains on an everyday basis.” I say
“online book” because the 39,403 word story is distributed as a series
of 14 blog posts. I’d rather not read it on the website in a browser,
instead preferring it in e-book form where it’s more comfortable. The
last time I did this, I manually scraped the entire book into
Markdown, spent a couple of weeks editing it for mistakes, and finally
sent the Markdown to Pandoc to convert into an e-book.

For this book, I just want a quick-and-dirty scrape in order to shift
formats. I’ve never read it and I may not even like it (update: I
enjoyed it), so I definitely don’t want to spend much time on the
conversion. Despite having fun with typing lately, I’d also
prefer to keep all the formating — italics, etc. — without re-entering
it all manually.

Fortunately Pandoc can consume HTML as input, so, in theory, I can feed
it the original HTML and preserve all of the original markup. The
challenge is that the HTML is spread across 14 pages surrounded by all
the expected blog cruft. I need some way to extract the book content
from each page, concatenate it together along with chapter headings, and
send the result to Pandoc. Enter BeautifulSoup.

First, I need to construct the skeleton HTML document. Rather than code
my own HTML, I’m going to build it with BeautifulSoup. I start by
creating a completely empty document and adding a doctype to it.

from bs4 import BeautifulSoup, Doctype

doc = BeautifulSoup()
doc.append(Doctype('html'))


Next I create the html root element, then add the head and body
elements. I also add a title element. The original content has fancy
Unicode markup — left and right quotation marks, em dash, etc. — so it’s
important to declare the page as UTF-8, since otherwise these characters
are likely to be interpreted incorrectly. It always feels odd declaring
the encoding within the content being encoded, but that’s just the way
things are.

html = doc.new_tag('html', lang='en-US')
doc.append(html)
head = doc.new_tag('head')
html.append(head)
meta = doc.new_tag('meta', charset='utf-8')
head.append(meta)
title = doc.new_tag('title')
title.string = 'Interviewing Leather'
head.append(title)
body = doc.new_tag('body')
html.append(body)


If I print(doc.prettify()) then I see the skeleton I want:


 lang="en-US">
 
   charset="utf-8"/>
  </span>
   Interviewing Leather
  <span class="nt">
 
 
 



Next, I assemble a list of the individual blog posts. When I was
actually writing the script, I first downloaded them locally with my
favorite download tool, curl, and ran the script against local
copies. I didn’t want to hit the web server each time I tested. (Note:
I’ve truncated these URLs to fit in this article.)

chapters = [
    "https://banter-latte.com/2007/06/26/...",
    "https://banter-latte.com/2007/07/03/...",
    "https://banter-latte.com/2007/07/10/...",
    "https://banter-latte.com/2007/07/17/...",
    "https://banter-latte.com/2007/07/24/...",
    "https://banter-latte.com/2007/07/31/...",
    "https://banter-latte.com/2007/08/07/...",
    "https://banter-latte.com/2007/08/14/...",
    "https://banter-latte.com/2007/08/21/...",
    "https://banter-latte.com/2007/08/28/...",
    "https://banter-latte.com/2007/09/04/...",
    "https://banter-latte.com/2007/09/20/...",
    "https://banter-latte.com/2007/09/25/...",
    "https://banter-latte.com/2007/10/02/..."
]


I visit a few of these pages in my browser to determine which part of
the page I want to extract. I want to look closely enough to see what
I’m doing, but not too closely as to not spoil myself! Right clicking
the content in the browser and selecting “Inspect Element” (Firefox) or
“Inspect” (Chrome) pops up a pane to structurally navigate the page.
“View Page Source” would work, too, especially since this is static
content, but I find the developer pane easier to read. Plus it hides
most of the content, revealing only the structure.

The content is contained in a div with the class entry-content. I
can use a selector to isolate this element and extract its child p
elements. However, it’s not quite so simple. Each chapter starts with a
bit of commentary that’s not part of the book, and I don’t want to
include in my extract. It’s separated from the real content by an hr
element. There’s also a footer below another hr element, likely put
there by someone who wasn’t paying attention to the page structure. It’s
not quite the shining example of semantic markup, but it’s regular
enough I can manage.


   class="site-main">
     class="entry-body">
       class="entry-content">
        A little intro.
        Some more intro.
        
        Actual book content.
        More content.
        
        Footer navigation junk.

Articles tagged python at null program

Assertions should be more debugger-oriented

A test assertion

Sanitizers

A better assertion

Other languages

Addendum: Don’t exit the debugger

Compressing and embedding a Wordle word list

Compaction baseline

Letter frequency

Taking it to the next level: run-length encoding

OpenBSD's pledge and unveil from Python

Finding the functions

Pythonic wrappers

Trying it out

State machines are wonderful tools

Morse code decoder state machine

UTF-8 decoder state machine

Word count state machine

Coroutines and generators as state machines

Extra examples

Asynchronously Opening and Closing Files in Asyncio

Test setup

Thread pools

A test drive

Caveat: no asynchronous reads and writes

Conventions for Command Line Options

Short Options

Long options

Subcommands

Option parsing libraries

Exactly-Once Initialization in Asynchronous Python

Mutual exclusion

Once

Latency in Asynchronous Python

A semaphore is not the answer

Solving it with a job queue

Unbounded queues

Important takeaways

Endlessh: an SSH Tarpit

Implementation strategies

Processes

Threads

Poll

Raw sockets

asyncio and other tarpits

An Async / Await Library for Emacs Lisp

aio example

Promises, simplified

Evaluate in the context of a promise

Async functions

Composing promises

Threads

Processes

Testing aio

Async/await is pretty awesome

Python Decorators: Syntactic Artificial Sweetener

Syntactic “sugar”

Syntactic artificial sweetener

Pattern matching

The CPython Bytecode Compiler is Dumb

Disassembly examples

Local variable elimination

Constant folding

Allocation optimization

Don’t expect too much

Web Scraping into an E-book with BeautifulSoup and Pandoc

The Adversarial Implementation

C example

Python example

A tool for understanding specifications