Articles tagged go at null program

Guidelines for computing sizes and subscripts

2024-05-24T22:25:10Z

Occasionally we need to compute the size of an object that does not yet exist, or a subscript that may fall out of bounds. It’s easy to miss the edge cases where results overflow, creating a nasty, subtle bug, even in the presence of type safety. Ideally such computations happen in specialized code, such as inside an allocator (calloc, reallocarray) and not outside by the allocatee (i.e. malloc). Mitigations exist with different trade-offs: arbitrary precision, or using a wider fixed integer — i.e. 128-bit integers on 64-bit hosts. In the typical case, working only with fixed size-type integers, I’ve come up with a set of guidelines to avoid overflows in the edge cases.

Range check before computing a result. No exceptions.
Do not cast unless you know a priori the operand is in range.
Never mix unsigned and signed operands. Prefer signed. If you need to convert an operand, see (2).
Do not add unless you know a priori the result is in range.
Do not multiply unless you know a priori the result is in range.
Do not subtract unless you know a priori both signed operands are non-negative. For unsigned, that the second operand is not larger than the first (treat it like (4)).
Do not divide unless you know a prior the denominator is positive.
Make it correct first. Make it fast later, if needed.

These guidelines are also useful when reviewing code, tracking in your mind whether the invariants are held at each step. If not, you’ve likely found a bug. If in doubt, use assertions to document and check invariants. I compiled this list during code review, so for me that’s where it’s most useful.

Range check, then compute

Not strictly necessary when overflow is well-defined, i.e. wraparound, but it’s like defensive driving. It’s simpler and clearer to check with basic arithmetic rather than reason from a wraparound, i.e. a negative result. Checked math functions are fine, too, if you check the overflow boolean before accessing the result.

// bad
len++;
if (len <= 0) error();

// good
if (len == MAX) error();
len++;

Casting

Casting from signed to unsigned, it’s as simple as knowing the value is non-negative, which is likely if you’re following (1). If a negative size has appeared, there’s already been a bug earlier in the program, and the only reasonable course of action is to abort, not handle it like an error.

Addition

To check if addition will overflow, subtract one of the operands from the maximum value.

if (b > MAX - a) error();
r = a + b;

In pointer arithmetic addition, it’s a common mistake to compute the result pointer then compare it to the bounds. If the check failed, then the pointer already overflowed, i.e. undefined behavior. Major pieces software, like glibc, are riddled with such pointer overflows. (Now that you’re aware of it, you’ll start noticing it everywhere. Sorry.)

// bad: never do this
beg += size;
if (beg > end) error();

To do this correctly, check integers not pointers. Like before, subtract before adding.

available = end - beg;
if (size > available) error();
beg += size;

Mind mixing signed and unsigned operands for the comparison operator (3), e.g. an unsigned size on the left and signed difference on the right.

Multiplication and division

If you’re working this out on your own, multiplication seems tricky until you’ve internalized a simple pattern. Just as we subtracted before adding, we need to divide before multiplying. Divide the maximum value by one of the operands:

if (a>0 && b>MAX/a) error();
r = a * b;

It’s often permitted for one or both to be zero, so mind divide-by-zero, which is handled above by the first condition. Sometimes size must be positive, e.g. the result of the sizeof operator in C, in which case we should prefer it as the denominator.

assert(size  >  0);
assert(count >= 0);
if (count > MAX/size) error();
total = count * size;

With arena allocation there are usually two concerns. First, will it overflow when computing the total size, i.e. count * size? Second, is the total size within the arena capacity. Naively that’s two checks, but we can kill two birds with one stone: Check both at once by using the current arena capacity as the maximum value when considering overflow.

if (count > (end - beg)/size) error();
total = count * size;

One condition pulling double duty.

Subtraction

With signed sizes, the negative range is a long “runway” allowing a single unchecked subtraction before overflow might occur. In essence, we were exploiting this in order to check addition. The most common mistake with unsigned subtraction is not accounting for overflow when going below zero.

// note: signed "i" only
for (i = end - stride; i >= beg; i -= stride) ...

This loop will go awry if i is unsigned and beg <= stride.

In special cases we can get away with a second subtraction without an overflow check if we know some properties of our operands. For example, my arena allocators look like this:

padding = -beg & (align - 1);
if (count >= (end - beg - padding)/size) error();

That’s two subtractions in a row. However, end - beg describes the size of a realized object, and align is a small constant (e.g. 2^(0–6)). It could only overflow if the entirety of memory was occupied by the arena.

Bonus, advanced note: This check is actually pulling triple duty. Notice that I used >= instead of >. The arena can’t fill exactly to the brim, but it handles the extreme edge case where count is zero, the arena is nearly full, but the bump pointer is unaligned. The result of subtracting padding is negative, which rounds to zero by integer division, and would pass a > check. That wouldn’t be a problem except that aligning the bump pointer would break the invariant beg <= end.

Try it for yourself

Next time you’re reviewing code that computes sizes or subscripts, bring the list up and see how well it follows the guidelines. If it misses one, try to contrive an input that causes an overflow. If it follows guidelines and you can still contrive such an input, then perhaps the list could use another item!

Solving "Two Sum" in C with a tiny hash table

2023-06-26T19:38:18Z

I came across a question: How does one efficiently solve Two Sum in C? There’s a naive quadratic time solution, but also an amortized linear time solution using a hash table. Without a built-in or standard library hash table, the latter sounds onerous. However, a mask-step-index table, a hash table construction suitable for many problems, requires only a few lines of code. This approach is useful even when a standard hash table is available, because by exploiting the known problem constraints, it beats typical generic hash table performance by an order of magnitude (demo).

The Two Sum exercise, restated:

Given an integer array and target, return the distinct indices of two elements that sum to the target.

In particular, the solution doesn’t find elements, but their indices. The exercise also constrains input ranges — important but easy to overlook:

2 <= count <= 10⁴
-10⁹ <= nums[i] <= 10⁹
-10⁹ <= target <= 10⁹

Notably, indices fit in a 16-bit integer with lots of room to spare. In fact, it will fit in a 14-bit address space (16,384) with still plenty of overhead. Elements fit in a signed 32-bit integer, and we can add and subtract elements without overflow, if just barely. The last constraint isn’t redundant, but it’s not readily exploitable either.

The naive solution is to linearly search the array for the complement. With nested loops, it’s obviously quadratic time. At 10k elements, we expect an abysmal 25M comparisons on average.

int16_t count = ...;
int32_t *nums = ...;

for (int16_t i = 0; i < count-1; i++) {
    for (int16_t j = i+1; j < count; j++) {
        if (nums[i]+nums[j] == target) {
            // found
        }
    }
}

The nums array is “keyed” by index. It would be better to also have the inverse mapping: key on elements to obtain the nums index. Then for each element we could compute the complement and find its index, if any, using this second mapping.

The input range is finite, so an inverse map is simple. Allocate an array, one element per integer in range, and store the index there. However, the input range is 2 billion, and even with 16-bit indices that’s a 4GB array. Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed to make it so. This array would be very sparse, at most less than half a percent of its elements populated. That’s a hint: Associative arrays are far more appropriate for representing such sparse mappings. That is, a hash table.

Using Go’s built-in hash table:

func TwoSumWithMap(nums []int32, target int32) (int, int, bool) {
    seen := make(map[int32]int16)
    for i, num := range nums {
        complement := target - num
        if j, ok := seen[complement]; ok {
            return int(j), i, true
        }
        seen[num] = int16(i)
    }
    return 0, 0, false
}

In essence, the hash table folds the sparse 2 billion element array onto a smaller array, with collision resolution when elements inevitably land in the same slot. For this exercise, that small array could be as small as 10,000 elements because that’s the most we’d ever need to track. For folding the large key space onto the smaller, we could use modulo. For collision resolution, we could keep walking the table.

int16_t seen[10000] = {0};

// Find or insert nums[index].
int16_t lookup(int32_t *nums, int16_t index)
{
    int i = nums[index] % 10000;
    for (;;) {
        int16_t j = seen[i] - 1;  // unbias
        if (j < 0) {  // empty slot
            seen[i] = index + 1;  // insert biased index
            return -1;
        } else if (nums[j] == nums[index]) {
            return j;  // match found
        }
        i = (i + 1) % 10000;  // keep looking
    }
}

Take note of a few details:

An empty slot is zero, and an empty table is a zero-initialized array. Since zero is a valid value, and all values are non-negative, it biases values by 1 in the table.
The nums array is part of the table structure, necessary for lookups. The two mappings — element-by-index and index-by-element — share structure.
It uses open addressing with linear probing, and so walks the table until it either either finds the element or hits an empty slot.
The “hash” function is modulo. If inputs are not random, they’ll tend to bunch up in the table. Combined with linear probing makes for lots of collisions. For the worst case, imagine sequentially ordered inputs.
Sometimes the table will almost completely fill, and lookups will be no better than the linear scans of the naive solution.
Most subtle of all: This hash table is not enough for the exercise. The keyed-on element may not even be in nums, and when lookup fails, that element is not inserted in the table. Instead, a different element is inserted. The conventional solution has at least two hash table lookups. In the Go code, it’s seen[complement] for lookups and seen[num] for inserts.

To solve (4) we’ll use a hash function to more uniformly distribute elements in the table. We’ll also probe the table in a random-ish order that depends on the key. In practice there will be little bunching even for non-random inputs.

To solve (5) we’ll use a larger table: 2¹⁴ or 16,384 elements. This has breathing room, and with a power of two we can use a fast mask instead of a slow division (though in practice, compilers usually implement division by a constant denominator with modular multiplication).

To solve (6) we’ll key complements together under the same key. It looks for the complement, but on failure it inserts the current element in the empty slot. In other words, this solution will only need a single hash table lookup per element!

Laying down some groundwork:

typedef struct {
    int16_t i, j;
    _Bool ok;
} TwoSum;

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        // ...
    }
    return r;
}

The seen array is a 32KiB hash table large enough for all inputs, small enough that it can be a local variable. In the loop:

        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;

Compute the complement, then apply a “max” operation to derive a key. Any commutative operation works, though obviously addition would be a poor choice. XOR is similar enough to cause many collisions. Multiplication works well, and is probably better if the ternary produces a branch.

The hash function is multiplication with a randomly-chosen prime. As we’ll see in a moment, step will also add-shift the hash before use. The initial index will be the bottom 14 bits of this hash. For step, recall from the MSI article that it must be odd so that every slot is eventually probed. I shift out 13 bits and then override the 14th bit, so step effectively skips over the 14 bits used for the initial table index.

I used unsigned because I don’t really care about the width of the hash table index, but more importantly, I want defined overflow from all the bit twiddling, even in the face of implicit promotion. As a bonus, it can help in reasoning about indirection: seen indices are unsigned, nums indices are int16_t.

        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }

The step is added before using the index the first time, helping to scatter the start point and reduce collisions. If it’s an empty slot, insert the current element, not the complement — which wouldn’t be possible anyway. Unlike conventional solutions, this doesn’t require another hash and lookup. If it finds the complement, problem solved, otherwise keep going.

Putting it all together, it’s only slightly longer than solutions using a generic hash table:

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;
        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }
    }
    return r;
}

Applying this technique to Go:

func TwoSumWithBespoke(nums []int32, target int32) (int, int, bool) {
    var seen [1 << 14]int16
    for n, num := range nums {
        complement := target - num
        hash := int(num * complement * 489183053)
        mask := len(seen) - 1
        step := hash>>13 | 1
        for i := hash; ; {
            i = (i + step) & mask
            j := int(seen[i] - 1) // unbias
            if j < 0 {
                seen[i] = int16(n) + 1 // bias
                break
            } else if nums[j] == complement {
                return j, n, true
            }
        }
    }
    return 0, 0, false
}

With Go 1.20 this is an order of magnitude faster than map[int32]int16, which isn’t surprising. I used multiplication as the key operator because, in my first take, Go produced a branch for the “max” operation — at a 25% performance penalty on random inputs.

A full-featured, generic hash table may be overkill for your problem, and a bit of hashed indexing with collision resolution over a small array might be sufficient. The problem constraints might open up such shortcuts.

Assertions should be more debugger-oriented

2022-06-26T18:51:04Z

Prompted by a 20 minute video, over the past month I’ve improved my debugger skills. I’d shamefully acquired a bad habit: avoiding a debugger until exhausting dumber, insufficient methods. My first choice should be a debugger, but I had allowed a bit of friction to dissuade me. With some thoughtful practice and deliberate effort clearing the path, my bad habit is finally broken — at least when a good debugger is available. It feels like I’ve leveled up and, like touch typing, this was a skill I’d neglected far too long. One friction point was the less-than-optimal assert feature in basically every programming language implementation. It ought to work better with debuggers.

An assertion verifies a program invariant, and so if one fails then there’s undoubtedly a defect in the program. In other words, assertions make programs more sensitive to defects, allowing problems to be caught more quickly and accurately. Counter-intuitively, crashing early and often makes for more robust and reliable software in the long run. For exactly this reason, assertions go especially well with fuzzing.

assert(i >= 0 && i < len);   // bounds check
assert((ssize_t)size >= 0);  // suspicious size_t
assert(cur->next != cur);    // circular reference?

They’re sometimes abused for error handling, which is a reason they’ve also been (wrongfully) discouraged at times. For example, failing to open a file is an error, not a defect, so an assertion is inappropriate.

Normal programs have implicit assertions all over, even if we don’t usually think of them as assertions. In some cases they’re checked by the hardware. Examples of implicit assertion failures:

Out-of-bounds indexing
Dereferencing null/nil/None
Dividing by zero
Certain kinds of integer overflow (e.g. -ftrapv)

Programs are generally not intended to recover from these situations because, had they been anticipated, the invalid operation wouldn’t have been attempted in the first place. The program simply crashes because there’s no better alternative. Sanitizers, including Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan), are in essence additional, implicit assertions, checking invariants that aren’t normally checked.

Ideally a failing assertion should have these two effects:

Execution should immediately stop. The program is in an unknown state, so it’s neither safe to “clean up” nor attempt to recover. Additional execution will only make debugging more difficult, and may obscure the defect.
When run under a debugger — or visited as a core dump — it should break exactly at the failed assertion, ready for inspection. I should not need to dig around the call stack to figure out where the failure occurred. I certainly shouldn’t need to manually set a breakpoint and restart the program hoping to fail the assertion a second time. The whole reason for using a debugger is to save time, so if it’s wasting my time then it’s failing at its primary job.

I examined standard assert features across various language implementations, and none strictly meet the criteria. Fortunately, in some cases, it’s trivial to build a better assertion, and you can substitute your own definition. First, let’s discuss the way assertions disappoint.

A test assertion

My test for C and C++ is minimal but establishes some state and gives me a variable to inspect:

#include 

int main(void)
{
    for (int i = 0; i < 10; i++) {
        assert(i < 5);
    }
}

Then I compile and debug in the most straightforward way:

$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt

The r in GDB stands for run, which immediately breaks because of the assert. The bt prints a backtrace. On a typical Linux distribution that shows this backtrace:

#0  __GI_raise
#1  __GI_abort
#2  __assert_fail_base
#3  __GI___assert_fail
#4  main

Well, actually, it’s much messier than this, but I manually cleaned it up:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1  0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2  0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i < 5", file=0x555555556004 "test.c", line=6, f
unction=) at assert.c:92
#3  0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i < 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 <__PRETTY_FUNCTION__.0> "main") at assert
.c:101
#4  0x0000555555555178 in main () at test.c:6

That’s a lot to take in at a glance, and about 95% of it is noise that will never contain useful information. Most notably, GDB didn’t stop at the failing assertion. Instead there’s four stack frames of libc junk I have to navigate before I can even begin debugging.

(gdb) up
(gdb) up
(gdb) up
(gdb) up

I must wade through this for every assertion failure. This is some of the friction that made me avoid the debugger in the first place. glibc loves indirection, so maybe the other libc implementations do better? How about musl?

#0  setjmp
#1  raise
#2  ??
#3  ??
#4  ??
#5  ??
#6  ??
#7  ??
#8  ??
#9  ??
#10 ??
#11 ??

Oops, without musl debugging symbols I can’t debug assertions at all because GDB can’t read the stack, so it’s lost. If you’re on Alpine you can install musl-dbg, but otherwise you’ll probably need to build your own from source. With debugging symbols, musl is no better than glibc:

#0  __restore_sigs
#1  raise
#2  abort
#3  __assert_fail
#4  main

Same with FreeBSD:

#0  thr_kill
#1  in raise
#2  in abort
#3  __assert
#4  main

OpenBSD has one fewer frame:

#0  thrkill
#1  _libc_abort
#2  _libc___assert2
#3  main

How about on Windows with Mingw-w64?

[Inferior 1 (process 7864) exited with code 03]

Oops, on Windows GDB doesn’t break at all on assert. You must first set a breakpoint on abort:

(gdb) b abort

Besides that, it’s the most straightforward so far:

#0 msvcrt!abort
#1 msvcrt!_assert
#2 main

With MSVC (default CRT) I get something slightly different:

#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh

RemedyBG leaves me at the abort like GDB does elsewhere. Visual Studio recognizes that I don’t care about its stack frames and instead puts the focus on the assertion, ready for debugging. The other stack frames are there, but basically invisible. It’s the only case that practically meets all my criteria!

I can’t entirely blame these implementations. The C standard requires that assert print a diagnostic and call abort, and that abort raises SIGABRT. There’s not much implementations can do, and it’s up to the debugger to be smarter about it.

Sanitizers

ASan doesn’t break GDB on assertion failures, which is yet another source of friction. You can work around this with an environment variable:

export ASAN_OPTIONS=abort_on_error=1:print_legend=0

This works, but it’s the worst case of all: I get 7 junk stack frames on top of the failed assertion. It’s also very noisy when it traps, so the print_legend=0 helps to cut it down a bit. I want this variable so often that I set it in my shell’s .profile so that it’s always set.

With UBSan you can use -fsanitize-undefined-trap-on-error, which behaves like the improved assertion. It traps directly on the defect with no junk frames, though it prints no diagnostic. As a bonus, it also means you don’t need to link libubsan. Thanks to the bonus, it fully supplants -ftrapv for me on all platforms.

Update November 2022: This “stop” hook eliminates ASan friction by popping runtime frames — functions with the reserved __ prefix — from the call stack so that they’re not in the way when GDB takes control. It requires Python support, which is the purpose of the feature-sniff outer condition.

if !$_isvoid($_any_caller_matches)
    define hook-stop
        while $_thread && $_any_caller_matches("^__")
            up-silently
        end
    end
end

This is now part of my .gdbinit.

A better assertion

At least when under a debugger, here’s a much better assertion macro for GCC and Clang:

#define assert(c) if (!(c)) __builtin_trap()

__builtin_trap inserts a trap instruction — a built-in breakpoint. By not calling a function to raise a signal, there are no junk stack frames and no need to breakpoint on abort. It stops exactly where it should as quickly as possible. This definition works reliably with GCC across all platforms, too. On MSVC the equivalent is __debugbreak. If you’re really in a pinch then do whatever it takes to trigger a fault, like dereferencing a null pointer. A more complete definition might be:

#ifdef DEBUG
#  if __GNUC__
#    define assert(c) if (!(c)) __builtin_trap()
#  elif _MSC_VER
#    define assert(c) if (!(c)) __debugbreak()
#  else
#    define assert(c) if (!(c)) *(volatile int *)0 = 0
#  endif
#else
#  define assert(c)
#endif

None of these print a diagnostic, but that’s unnecessary when a debugger is involved.

Other languages

Unfortunately the situation mostly gets worse with other language implementations, and it’s generally not possible to build a better assertion. Assertions typically have exception-like semantics, if not literally just another exception, and so they are far less reliable. If a failed assertion raises an exception, then the program won’t stop until it’s unwound the stack — running destructors and such along the way — all the way to the top level looking for a handler. It only knows there’s a problem when nobody was there to catch it.

Go officially doesn’t have assertions, though panics are a kind of assertion. However, panics have exception-like semantics, and so suffer the problems of exceptions. A Go version of my test:

func main() {
    defer fmt.Println("DEFER")
    for i := 0; i < 10; i++ {
        if i >= 5 {
            panic(i)
        }
    }
}

If I run this under Go’s premier debugger, Delve, the unrecovered panic causes it to break. So far so good. However, I get two junk frames:

#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit

It only knows to stop because the Go runtime called fatalpanic, but the backtrace is a fiction: The program continued to run after the panic, enough to run all the registered defers (including printing “DEFER”), unwinding the stack to the top level, and only then did it fatalpanic. Fortunately it’s still possible to inspect all those stack frames even if some variables may have changed while unwinding, but it’s more like inspecting a core dump than a paused process.

The situation in Python is similar: assert raises AssertionError — a plain old exception — and pdb won’t break until the stack has unwound, exiting context managers and such. Only once the exception reaches the top level does it enter “post mortem debugging,” like a core dump. At least there are no junk stack frames on top. If you’re using asyncio then your program may continue running for quite awhile before the right tasks are scheduled and the exception finally propagates to the top level, if ever.

The worst offender of all is Java. First jdb never breaks for unhandled exceptions. It’s up to you to set a breakpoint before the exception is thrown. But it gets worse: assertions are disabled under jdb. The Java assert statement is worse than useless.

Addendum: Don’t exit the debugger

The largest friction-reducing change I made is never exiting the debugger. Previously I would enter GDB, run my program, exit, edit/rebuild, repeat. However, there’s no reason to exit GDB! It automatically and reliably reloads symbols and updates breakpoints on symbols. It remembers your run configuration, so re-running is just r rather than interacting with shell history.

My workflow on all platforms (including Windows) is a vertically maximized Vim window and a vertically maximized terminal window. The new part for me: The terminal runs a long-term GDB session exclusively, with file set to the program I’m writing, usually set by initial the command line.

$ gdb myprogram
gdb>

Alternatively use file after starting GDB. Occasionally useful if my project has multiple binaries, and I want to examine a different program.

gdb> file myprogram

I use make and Vim’s :mak command for building from within the editor, so I don’t need to change context to build. The quickfix list takes me straight to warnings/errors. Often I’m writing something that takes input from standard input. So I use the run (r) command to set this up (along with any command line arguments).

gdb> r



You can redirect standard output as well. It remembers these settings for
plain run later, so I can test my program by entering r and nothing
else.

gdb> r


My usual workflow is edit, :mak, r, repeat. If I want to test a
different input or use different options, change the run configuration
using run again:

gdb> r -a -b -c 


On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use kill (k) to stop it
without exiting GDB.

gdb> k


GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
set confirm no in your .gdbinit to disable it.

Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with pkill, which
GDB will catch (except on Windows):

:!pkill myprogram


I suspect many people don’t know this, but if you’re on Windows and
developing a graphical application, you can press F12 in the
debuggee’s window to immediately break the program in the attached
debugger. This is a general platform feature and works with any native
debugger. I’ve been using it quite a lot.

On that note, you can run commands from GDB with !, which is another way
to avoid having an extra terminal window around:

gdb> !git diff


In any case, GDB will re-read the binary on the next run and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.

gdb> b somefunc
gdb> r


Alternatively I’ll use a line number, which I read from Vim. Though GDB,
not being involved in the editing process, cannot track how that line
moves between builds.

An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type next (n) — or step (s) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:

gdb> n
gdb>
gdb>


(I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful list command
is inadequate. Update: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either tty or set new-console. I can access it everywhere since
w64devkit now supports GDB TUI.)

If I want to advance to the next breakpoint, I use continue (c):

gdb> c


If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep printing (p) the same variables again and again.
So I use display (disp) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is i over some string str, this will show me the current
character in character format (/c).

gdb> disp/c str[i]


You can accumulate multiple expressions. Use undisplay to remove them.

Too many breakpoints? Use info breakpoints (i b) to list them, then
delete (d) the unwanted ones by ID.

gdb> i b
gdb> d 3 5 8


GDB has many more feature than this, but 10 commands cover 99% of use
cases: r, c, n, s, disp, k, b, i, d, p.



A flexible, lightweight, spin-lock barrier
2022-03-13T23:55:08Z
This article was discussed on Hacker News.

The other day I wanted try the famous memory reordering experiment
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an “impossible” result on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.



Here’s the entire barrier implementation for two threads in C11.

// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
    uint32_t v = ++*barrier;
    if (v & 1) {
        for (v &= 2; (*barrier&2) == v;);
    }
}


Or in Go:

func BarrierWait(barrier *uint32) {
    v := atomic.AddUint32(barrier, 1)
    if v&1 == 1 {
        v &= 2
        for atomic.LoadUint32(barrier)&2 == v {
        }
    }
}


Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.

When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of subtly-incorrect spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.

Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.

Experiment

Here’s the setup for the memory reordering experiment, where w0 and w1
are initialized to zero.

thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0


Considering all the possible orderings, it would seem that at least one of
r0 or r1 is 1. There seems to be no ordering where r0 and r1 could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.

How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use volatile for w0 and w1. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
volatile.

So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:

static int experiment(int *w0, int *w1)
{
    int r1;
    __asm volatile (
        "movl  $1, %1\n"
        "movl  %2, %0\n"
        : "=r"(r1), "=m"(*w0)
        : "m"(*w1)
    );
    return r1;
}


ARM64 (to try on my Raspberry Pi):

static int experiment(int *w0, int *w1)
{
    int r1 = 1;
    __asm volatile (
        "str  %w0, %1\n"
        "ldr  %w0, %2\n"
        : "+r"(r1), "=m"(w0)
        : "m"(w1)
    );
    return r1;
}


This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
static.

Alternatively, I could use C11 atomics with a relaxed memory order:

static int experiment(_Atomic int *w0, _Atomic int *w1)
{
    atomic_store_explicit(w0, 1, memory_order_relaxed);
    return atomic_load_explicit(w1, memory_order_relaxed);
}


Since this is a race and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of starting barrier… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.

int w0, w1, r0, r1;

// thread#1                   // thread#2
w0 = w1 = 0;
BARRIER;                      BARRIER;
r1 = experiment(&w0, &w1);    r0 = experiment(&w1, &w0);
BARRIER;                      BARRIER;

if (!r0 && !r1) {
    puts("impossible!");
}


The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.

Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.

Barrier selection

On my first attempt, I made the obvious decision for the barrier: I used
pthread_barrier_t. I was already using pthreads for spawning the
extra thread, including on Windows, so this was convenient.

However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a global
lock twice per wait to manage the barrier’s reference counter.

All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a spin-lock barrier.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.

Barrier design

Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to w0, w1) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.

I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.

At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.

void broken_wait1(_Atomic unsigned *barrier)
{
    ++*barrier;
    while (*barrier&1);
}

Or to avoid an extra load, use the result directly:

void broken_wait2(_Atomic unsigned *barrier)
{
    if (++*barrier & 1) {
        while (*barrier&1);
    }
}


Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.

To fix this, the wait function must also track the phase. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently the rest of the integer acts like a phase counter!
Writing this out more explicitly:

void barrier_wait(_Atomic unsigned *barrier)
{
    unsigned observed = ++*barrier;
    unsigned thread_count = observed & 1;
    if (thread_count != 0) {
        // not last arrival, watch for phase change
        unsigned init_phase = observed >> 1;
        for (;;) {
            unsigned current_phase = *barrier >> 1;
            if (current_phase != init_phase) {
                break;
            }
        }
    }
}


The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.

By the way, I’m using unsigned since it may eventually overflow, and
even _Atomic int overflow is undefined for the ++ operator. However,
if you use atomic_fetch_add or C++ std::atomic then overflow is
defined and you can use int.

Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (>>), I
mask (&) the phase bit with 2.

With this spin-lock barrier, the experiment observes r0 = r1 = 0 in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.

Generalizing to more threads

Two threads required two bits. This generalizes to log2(n)+1 bits for
n threads, where n is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
    unsigned v = ++*barrier;
    if (v & (n - 1)) {
        for (v &= n; (*barrier&n) == v;);
    }
}


Note: It never makes sense for n to exceed the logical core count!
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.

If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a uint64_t — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the & operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.

While C11 _Atomic seems like it would be useful, unsurprisingly it is
not supported by one major, stubborn implementation. If you’re
using C++11 or later, then go ahead use std::atomic since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.

#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
    int v = BARRIER_INC(barrier);
    if (v & (n - 1)) {
        for (v &= n; (BARRIER_GET(barrier)&n) == v;);
    }
}


This has the nice bonus that the interface does not have the _Atomic
qualifier, nor std::atomic template. It’s just a plain old int, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.

If you’d like to try the experiment yourself: reorder.c. If
you’d like to see a test of Go and C sharing a thread barrier:
coop.go.

I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe context is
everything. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.




Test cross-architecture without leaving home
2021-08-21T23:59:33Z
I like to test my software across different environments, on strange
platforms, and with alternative implementations. Each has its
own quirks and oddities that can shake bugs out earlier. C is particularly
good at this since it has such a wide selection of compilers and runs on
everything. For instance I count at least 7 distinct C compilers in Debian
alone. One advantage of writing portable software is access to a
broader testing environment, and it’s one reason I prefer to target
standards rather than specific platforms.

However, I’ve long struggled with architecture diversity. My work and
testing has been almost entirely on x86, with ARM as a distant second
(Raspberry Pi and friends). Big endian hosts are particularly rare.
However, I recently learned a trick for quickly and conveniently accessing
many different architectures without even leaving my laptop: QEMU User
Emulation. Debian and its derivatives support this very well and
require almost no setup or configuration.



Cross-compilation Example

While there are many options, my main cross-testing architecture has been
PowerPC. It’s 32-bit big endian, while I’m generally working on 64-bit
little endian, which is exactly the sort of mismatch I’m going for. I use
a Debian-supplied cross-compiler and qemu-user tools. The binfmt
support is especially slick, so that’s how I usually use it.

# apt install gcc-powerpc-linux-gnu qemu-user-binfmt


binfmt_misc is a kernel module that teaches Linux how to recognize
arbitrary binary formats. For instance, there’s a Wine binfmt so that
Linux programs can transparently exec(3) Windows .exe binaries. In the
case of QEMU User Mode, binaries for foreign architectures are loaded into
a QEMU virtual machine configured in user mode. In user mode there’s no
guest operating system, and instead the virtual machine translates guest
system calls to the host operating system.

The first package gives me powerpc-linux-gnu-gcc. The prefix is the
architecture tuple describing the instruction set and system ABI.
To try this out, I have a little test program that inspects its execution
environment:

#include 

int main(void)
{
    char *w = "?";
    switch (sizeof(void *)) {
    case 1: w = "8";  break;
    case 2: w = "16"; break;
    case 4: w = "32"; break;
    case 8: w = "64"; break;
    }

    char *b = "?";
    switch (*(char *)(int []){1}) {
    case 0: b = "big";    break;
    case 1: b = "little"; break;
    }

    printf("%s-bit, %s endian\n", w, b);
}


When I run this natively on x86-64:

$ gcc test.c
$ ./a.out
64-bit, little endian


Running it on PowerPC via QEMU:

$ powerpc-linux-gnu-gcc -static test.c
$ ./a.out
32-bit, big endian


Thanks to binfmt, I could execute it as though the PowerPC binary were a
native binary. With just a couple of environment variables in the right
place, I could pretend I’m developing on PowerPC — aside from emulation
performance penalties of course.

However, you might have noticed I pulled a sneaky on ya: -static. So far
what I’ve shown only works with static binaries. There’s no dynamic loader
available to run dynamically-linked binaries. Fortunately this is easy to
fix in two steps. The first step is to install the dynamic linker for
PowerPC:

# apt install libc6-powerpc-cross


The second is to tell QEMU where to find it since, unfortunately, it
cannot currently do so on its own.

$ export QEMU_LD_PREFIX=/usr/powerpc-linux-gnu


Now I can leave out the -static:

$ powerpc-linux-gnu-gcc test.c
$ ./a.out
32-bit, big endian


A practical example: Remember binitools? I’m now ready to run its
fuzz-generated test suite on this cross-testing platform.

$ git clone https://github.com/skeeto/binitools
$ cd binitools/
$ make check CC=powerpc-linux-gnu-gcc
...
PASS: 668/668


Or if I’m going to be running make often:

$ export CC=powerpc-linux-gnu-gcc
$ make -e check


Recall: make’s -e flag passes the environment through, so I
don’t need to pass CC=... on the command line each time.

When setting up a test suite for your own programs, consider how difficult
it would be to run the tests under customized circumstances like this. The
easier it is to run your tests, the more they’re going to be run. I’ve run
into many projects with such overly-complex test builds that even enabling
sanitizers in the tests suite was a pain, let alone cross-architecture
testing.

Dependencies? There might be a way to use Debian’s multiarch support
to install these packages, but I haven’t been able to figure it out. You
likely need to build dependencies yourself using the cross compiler.

Testing with Go

None of this is limited to C (or even C++). I’ve also successfully used
this to test Go libraries and programs cross-architecture. This isn’t
nearly as important since it’s harder to write unportable Go than C — e.g.
dumb pointer tricks are literally labeled “unsafe”. However, Go
(gc) trivializes cross-compilation and is statically compiled, so it’s
incredibly simple. Once you’ve installed qemu-user-binfmt it’s entirely
transparent:

$ GOARCH=mips64 go test


That’s all there is to cross-platform testing. If for some reason binfmt
doesn’t work (WSL) or you don’t want to install it, there’s just one extra
step (package named example):

$ GOARCH=mips64 go test -c
$ qemu-mips64-static example.test


The -c option builds a test binary but doesn’t run it, instead allowing
you to choose where and how to run it.

It even works with cgo — if you’re willing to jump through the same
hoops as with C of course:

package main

// #include 
// uint16_t v = 0x1234;
// char *hi = (char *)&v + 0;
// char *lo = (char *)&v + 1;
import "C"
import "fmt"

func main() {
	fmt.Printf("%02x %02x\n", *C.hi, *C.lo)
}


With go run on x86-64:

$ CGO_ENABLED=1 go run example.go
34 12


Via QEMU User Mode:

$ export CGO_ENABLED=1
$ export GOARCH=mips64
$ export CC=mips64-linux-gnuabi64-gcc
$ export QEMU_LD_PREFIX=/usr/mips64-linux-gnuabi64
$ go run example.go
12 34


I was pleasantly surprised how well this all works.

One dimension

Despite the variety, all these architectures are still “running” the same
operating system, Linux, and so they only vary on one dimension. For most
programs primarily targeting x86-64 Linux, PowerPC Linux is practically
the same thing, while x86-64 OpenBSD is foreign territory despite sharing
an architecture and ABI (System V). Testing across operating
systems still requires spending the time to install, configure, and
maintain these extra hosts. That’s an article for another time.




More DLL fun with w64devkit: Go, assembly, and Python
2021-06-29T21:50:30Z
My previous article explained how to work with dynamic-link libraries
(DLLs) using w64devkit. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, w64devkit is a great complement to Go and reliably
fullfills all the needs of cgo — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.

Go: bootstrap and cgo

The primary Go implementation, confusingly named “gc”, is an
incredible piece of software engineering. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.

You will need the Go 1.4 source, go1.4-bootstrap-20171003.tar.gz.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).

Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use cmd.exe explicitly.

$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ && cmd /c make)


In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.

$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ && cmd /c make)


At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.

$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" >>~/.profile
$ source ~/.profile


Not only is Go now available, so is the full power of cgo. (Including its
costs if used.)

Vim suggestions

Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
goimports and a couple of corrections to Vim’s built-in Go support ([[
and ]] navigation). The included ctags understands Go, so tags
navigation works the same as it does with C. \i saves the current
buffer, runs goimports, and populates the quickfix list with any errors.
Similarly :make invokes go build and, as expected, populates the
quickfix list.

autocmd FileType go setlocal makeprg=go\ build
autocmd FileType go map <silent> <buffer> <leader>i
    \ :update \|
    \ :cexpr system("goimports -w " . expand("%")) \|
    \ :silent edit<cr>
autocmd FileType go map <buffer> [[
    \ ?^\(func\\|var\\|type\\|import\\|package\)\><cr>
autocmd FileType go map <buffer> ]]
    \ /^\(func\\|var\\|type\\|import\\|package\)\><cr>


Go only comes with gofmt but goimports is just one command away, so
there’s little excuse not to have it:

$ go install golang.org/x/tools/cmd/goimports@latest


Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.

cgo DLLs

The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
import "C". The imported C object provides access to C types and
functions. Go functions marked with an //export comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.

To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.

// Which DLL am I running?
int version(void);

// Generate 64 bits from a CSPRNG.
unsigned long long rand64(void);

// Compute the Euclidean norm.
float dist(float x, float y);


Here’s a C implementation which I’m calling “version 1”.

#include 
#include 
#include 

__declspec(dllexport)
int
version(void)
{
    return 1;
}

__declspec(dllexport)
unsigned long long
rand64(void)
{
    unsigned long long x;
    RtlGenRandom(&x, sizeof(x));
    return x;
}

__declspec(dllexport)
float
dist(float x, float y)
{
    return sqrtf(x*x + y*y);
}


As discussed in the previous article, each function is exported using
__declspec so that they’re available for import. As before:

$ cc -shared -Os -s -o hello1.dll hello1.c


Side note: This could be trivially converted into a C++ implementation
just by adding extern "C" to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.

Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:

package main

import "C"
import (
	"crypto/rand"
	"encoding/binary"
	"math"
)

//export version
func version() C.int {
	return 2
}

//export rand64
func rand64() C.ulonglong {
	var buf [8]byte
	rand.Read(buf[:])
	r := binary.LittleEndian.Uint64(buf[:])
	return C.ulonglong(r)
}

//export dist
func dist(x, y C.float) C.float {
	return C.float(math.Sqrt(float64(x*x + y*y)))
}

func main() {
}


Note the use of C types for all arguments and return values. The main
function is required since this is the main package, but it will never be
called. The DLL is built like so:

$ go build -buildmode=c-shared -o hello2.dll hello2.go


Without the -o option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.

What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using --out-implib. For Go we have to handle this ourselves via
gendef and dlltool.

$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def


The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.

NASM assembly DLL

We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.

w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has friendlier syntax. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.

bits 64

section .text

global DllMainCRTStartup
export DllMainCRTStartup
DllMainCRTStartup:
	mov eax, 1
	ret

global version
export version
version:
	mov eax, 3
	ret

global rand64
export rand64
rand64:
	rdrand rax
	ret

global dist
export dist
dist:
	mulss  xmm0, xmm0
	mulss  xmm1, xmm1
	addss  xmm0, xmm1
	sqrtss xmm0, xmm0
	ret


The global directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
export directive is Windows-specific and is equivalent to dllexport in
C.

Every DLL must have an entrypoint, usually named DllMainCRTStartup. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.

Here’s how to assemble and link the DLL:

$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o


Call the DLLs from Python

Python has a nice, built-in C interop, ctypes, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:

import ctypes

def load(version):
    hello = ctypes.CDLL(f"./hello{version}.dll")
    hello.version.restype = ctypes.c_int
    hello.version.argtypes = ()
    hello.dist.restype = ctypes.c_float
    hello.dist.argtypes = (ctypes.c_float, ctypes.c_float)
    hello.rand64.restype = ctypes.c_ulonglong
    hello.rand64.argtypes = ()
    return hello

for hello in load(1), load(2), load(3):
    print("version", hello.version())
    print("rand   ", f"{hello.rand64():016x}")
    print("dist   ", hello.dist(3, 4))


After loading the DLL with CDLL the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:

$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0


That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!




Conventions for Command Line Options
2020-08-01T00:34:23Z
This article was discussed on Hacker News and critiqued on
Wandering Thoughts (2, 3).

Command line interfaces have varied throughout their brief history but
have largely converged to some common, sound conventions. The core
originates from unix, and the Linux ecosystem extended it,
particularly via the GNU project. Unfortunately some tools initially
appear to follow the conventions, but subtly get them wrong, usually
for no practical benefit. I believe in many cases the authors simply
didn’t know any better, so I’d like to review the conventions.



Short Options

The simplest case is the short option flag. An option is a hyphen —
specifically HYPHEN-MINUS U+002D — followed by one alphanumeric
character. Capital letters are acceptable. The letters themselves have
conventional meanings and are worth following if possible.

program -a -b -c


Flags can be grouped together into one program argument. This is both
convenient and unambiguous. It’s also one of those often missed details
when programs use hand-coded argument parsers, and the lack of support
irritates me.

program -abc
program -acb


The next simplest case are short options that take arguments. The
argument follows the option.

program -i input.txt -o output.txt


The space is optional, so the option and argument can be packed together
into one program argument. Since the argument is required, this is still
unambiguous. This is another often-missed feature in hand-coded parsers.

program -iinput.txt -ooutput.txt


This does not prohibit grouping. When grouped, the option accepting an
argument must be last.

program -abco output.txt
program -abcooutput.txt


This technique is used to create another category, optional option
arguments. The option’s argument can be optional but still unambiguous
so long as the space is always omitted when the argument is present.

program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"


Optional option arguments should be used judiciously since they can be
surprising, but they have their uses.

Options can typically appear in any order — something parsers often
achieve via permutation — but non-options typically follow options.

program -a -b foo bar
program -b -a foo bar


GNU-style programs usually allow options and non-options to be mixed,
though I don’t consider this to be essential.

program -a foo -b bar
program foo -a -b bar
program foo bar -a -b


If a non-option looks like an option because it starts with a hyphen,
use -- to demarcate options from non-options.

program -a -b -- -x foo bar


An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so -- is less often
needed.

# note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options


Long options

Since short options can be cryptic, and there are such a limited number
of them, more complex programs support long options. A long option
starts with two hyphens followed by one or more alphanumeric, lowercase
words. Hyphens separate words. Using two hyphens prevents long options
from being confused for grouped short options.

program --reverse --ignore-backups


Occasionally flags are paired with a mutually exclusive inverse flag
that begins with --no-. This avoids a future flag day where the
default is changed in the release that also adds the flag implementing
the original behavior.

program --sort
program --no-sort


Long options can similarly accept arguments.

program --output output.txt --block-size 1024


These may optionally be connected to the argument with an equals sign
=, much like omitting the space for a short option argument.

program --output=output.txt --block-size=1024


Like before, this opens up the doors for optional option arguments. Due
to the required = this is still unambiguous.

program --color --reverse
program --color=never --reverse


The -- retains its original behavior of disambiguating option-like
non-option arguments:

program --reverse -- --foo bar


Subcommands

Some programs, such as Git, have subcommands each with their own
options. The main program itself may still have its own options distinct
from subcommand options. The program’s options come before the
subcommand and subcommand options follow the subcommand. Options are
never permuted around the subcommand.

program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz


Above, the -a, -b, and -c options are for program, and the
others are for subcommand. So, really, the subcommand is another
command line of its own.

Option parsing libraries

There’s little excuse for not getting these conventions right assuming
you’re interested in following the conventions. Short options can be
parsed correctly in just ~60 lines of C code. Long options are
just slightly more complex.

GNU’s getopt_long() supports long option abbreviation — with no way to
disable it (!) — but this should be avoided.

Go’s flag package intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with =. It’s sound, but I miss both features
every time I write programs in Go. That’s why I wrote my own argument
parser. Not only does it have a nicer feature set, I like the API a
lot more, too.

Python’s primary option parsing library is argparse, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention and its behavior is unsound. For instance, the following
program has two options, --foo and --bar. The --foo option accepts
an optional argument, and the --bar option is a simple flag.

import argparse
import sys

parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))


Here are some example runs:

$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')


Everything looks good except the last. If the --foo argument is
optional then why did it consume arg? What happens if I follow it with
--bar? Will it consume it as the argument?

$ python parse.py --foo --bar
Namespace(bar=True, foo=None)


Nope! Unlike arg, it left --bar alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument looks
like an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since argparse does not properly support --,
that does not help.

$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg


Please, stick to the conventions unless you have really good reasons
to break them!




A Go Module Testbed
2020-02-13T01:03:24Z
I had recently lamented that due to Go’s strict module security
policy it was unreasonably difficult to experiment and practice with
modules. Modules can only be fetched from servers with valid TLS
certificates, including both the module path and repository servers.
Setting up a small, local experiment meant creating a certificate
authority, generating and signing certificates, and installing these all
in the right places. I’d much rather relax Go’s security policy for the
experiment.

As a result of that complaint, I learned that the upcoming Go
1.14 has as a new feature: GOINSECURE. It’s like the old
-insecure option, but safer due to being finer grained: a whitelist of
exceptions. It’s exactly what I needed. Since then I’ve been using it to
run small module experiments. It started as some scripts, but I
eventually formalized it into its own little project.

https://github.com/skeeto/go-module-testbed [requires Go 1.14]



It’s first and foremost a shell script, and the Go source is only there
as a server. The interface is like a Python virtual environment
where “activating” the environment in a shell allows Go run from that
shell to interact with the testbed servers. The script establishes the
testbed environment and starts both servers in that environment. It
optionally accepts a testbed directory as an argument, defaulting to the
working directory as the testbed.

$ ./go-module-testbed


In addition to running the servers in the foreground, the script
populates the testbed directory with an activate script, src/
containing module Git repositories, and www/ containing the static web
server contents. These are initialized with a module named
127.0.0.1/example at v1.0.0. Why not localhost as the domain? The
domain part of a module path must contain at least one dot, and IP
addresses are acceptable.

The server logs requests to standard output so you can see each request
Go makes to the server. This is has been an important part of learning
what exactly Go is requesting from the web server hosting the module
path.

There’s one giant caveat: Modules must be hosted on a privileged port.
Normally that’s 443 (HTTPS), though in this case it’s 80 (HTTP). Since
it’s a privileged port, you’ll need to do some system configuration. On
Linux it’s easy enough just to temporarily forward the testbed port 8001
to port 80.

# iptables -t nat -I OUTPUT -p tcp -d 127.0.0.1 \
           --dport 80 -j REDIRECT --to-ports 8001


Unfortunately this means, outside of doing something with namespace or
containers, there can only be one testbed per host at at time. My goal
is just to run small, local, temporary experiments, so this isn’t a big
deal for me, but I wish it could be better.

Activating the environment

With the server running and the port forwarding configured, source the
activate script from a shell:

$ source activate


This sets up an isolated, disposable GOPATH so that the testbed is
completely isolated from your normal development. It also updates
PATH, unconditionally enables modules (GO111MODULE=on), whitelists
the testbed servers in GOINSECURE, and sets GOPRIVATE so that the
testbed modules don’t leak anywhere outside the testbed environment.

The ensure that it’s all working, try installing the hello command
from the example module:

$ go get 127.0.0.1/example/cmd/demo
go: downloading 127.0.0.1/example v1.0.0
go: found 127.0.0.1/example/cmd/demo in 127.0.0.1/example v1.0.0
$ demo
Example v1.0.0


Non-testbed modules are still accessible like normal, though all fetched
and built artifacts are isolated in the testbed environment:

$ go get nullprogram.com/x/passphrase2pgp
$ go get golang.org/x/tools/cmd/goimports


So you can mix your experiments and practice with real modules.

Running experiments

From here you could practice creating a new minor version of the example
module, and see how it appears to the module’s users.

$ sed -i s/v1.0.0/v1.1.0/ src/example/example.go 
$ git -C src/example/ commit -a -m 'Bump to v1.1.0'
[master 7a3cf82] Bump to v1.1.0
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git -C src/example/ tag -a v1.1.0 -m v1.1.0
$ go get 127.0.0.1/example/cmd/demo
go: downloading 127.0.0.1/example v1.1.0
go: found 127.0.0.1/example/cmd/demo in 127.0.0.1/example v1.1.0
$ demo
Example v1.1.0


Or try more challenging: Release a v2.0.0, which requires changing the
module path.

$ cd src/example/
$ go mod edit -module 127.0.0.1/example/v2 go.mod
$ sed -i s/v1.0.0/v2.0.0/ example.go 
$ git commit -a -m 'Bump to v2.0.0'
[master bf5c4cf] Bump to v2.0.0
 2 files changed, 2 insertions(+), 2 deletions(-)
$ git tag -a v2.0.0 -m v2.0.0
$ cd ../../www/example/
$ mkdir v2
$ sed 's#e git#e/v2 git#' index.html >v2/index.html
$ cd ../../
$ go get 127.0.0.1/example/v2/cmd/demo
go: downloading 127.0.0.1/example/v2 v2.0.0
go: downloading 127.0.0.1/example v1.0.0
go: found 127.0.0.1/example/v2/cmd/demo in 127.0.0.1/example/v2 v2.0.0
go: finding module for package 127.0.0.1/example
go: found 127.0.0.1/example in 127.0.0.1/example v1.0.0


I was able to figure this all out specifically because of my testbed.
Adding a /v2 module path on the web server was not obvious, and it’s
glossed over in the tutorials.

Nested modules

One of the under-documented corners of Go modules is nested modules.
That is, repositories that contain more than one module. (Note: These
are not called submodules since that would be confusing in the context
of Git.) The Go module testbed is great place to try them out — and to
learn why they should never be used. Even if I never plan to use them, I
still want to understand them since I might need to debug them someday.

There are two tricky parts to nested modules: the version tag and the
module path. Neither are documented as far as I’ve seen, so I had to
figure them out from official examples.

$ mkdir src/example/nested
$ cd src/example/nested/
$ go mod init 127.0.0.1/example/nested
go: creating new go.mod: module 127.0.0.1/example/nested
$ echo package nested >nested.go
$ git add .
$ git commit -m 'Add a nested module'
[master c5b1a29] Add a nested module
 2 files changed, 4 insertions(+)
 create mode 100644 nested/go.mod
 create mode 100644 nested/nested.go
$ git tag -a nested/v1.2.3 -m v1.2.3
$ cd ../../../
$ mkdir www/example/nested
$ cp www/example/index.html www/example/nested/
$ go get 127.0.0.1/example/nested
go: downloading 127.0.0.1/example v1.0.0
go: downloading 127.0.0.1/example/nested v1.2.3
go: 127.0.0.1/example/nested upgrade => v1.2.3


Module versions are derived from the Git tag, which is global to the
repository. So how are nested modules versions indicated? They get
namespaced tags, as shown above with nested/v1.2.3. If I didn’t create
this tag, it would be as if I didn’t tag any version of that module.

The second unintuitive part is the web server’s response to ?go-get=1.
At the nested module path, the response must indicate the containing
module and where to get it. In other words, it’s the same response as
the containing module, which is why I merely copied index.html.
Returning a 404 for the module path is no good — another thing I’ve
learned from the module testbed.

There are still many things I have yet to try or practice in my module
testbed. It’s great that now when I have a niggling question about
modules or go get behavior, I can get an answer within a minute or so
without needing to dig through useless online search results.




Go's Tooling is an Undervalued Technology
2020-01-21T23:59:59Z
This article was discussed on Hacker News, on reddit, and
on Lobsters.

Regardless of your opinions of the Go programming language, the
primary implementation of Go, gc, is an incredible piece of software
engineering. Everyone ought to be blown away by it! Yet not only is it
undervalued in general, even the Go community itself doesn’t fully
appreciate it. It’s not perfect, but it has unique features never before
seen in a toolchain.



In this article, when I say “Go” I’m referring to the gc compiler.

Building Go

Since Go 1.5, Go is implemented in Go. It also has no external
dependencies, so to build Go only a Go compiler is required. On my
laptop, building the latest version of Go takes only 43 seconds. That
includes the compiler, linker, all cross-compilers, and the standard
library.

$ tar xzf go1.13.6.src.tar.gz
$ cd go/src/
$ ./make.bash


Cross-compiling Go — as in to build a Go toolchain for another platform
supported by Go — only requires setting a couple of environment
variables (GOOS, GOARCH). So, in a mere 43 seconds I can compile an
entire toolchain for any supported host! If you already have a Go
compiler on your system, there’s no reason to bother with binary
releases. Just grab the source and build it. All can manage their own
toolchain with ease!

Anyone who’s ever built a GCC or Clang+LLVM toolchain, especially
anyone who’s built cross-compiler toolchains, should find this situation
totally bonkers. How could it possibly be so easy and so fast? GCC’s
configure script wouldn’t even finish before Go was already
built.

Yes, this comparison is a bit apples and oranges. Both GCC and LLVM are
more advanced compilers and produce much more efficient code, so of
course there’s more to them, and of course they take longer to build.
But does that completely justify the difference? This goes double for
GCC and LLVM cross-compiler toolchains, which are, for the most part,
very complex and difficult to build.

If you don’t already have Go, all you need is a C compiler and the Go
1.4 source code. Bootstrapping through Go 1.4 is easy, and
I’ve done it a number of times. I keep a copy of the Go 1.4 tarball just
for this reason.

How Go could improve: The linker could be better. Binaries are
already too big, and getting bigger with each release. This problem is
acknowledged by the Go developers:


  The original linker was also simpler than it is now and its
implementation fit in one Turing award winner’s head, so there’s
little abstraction or modularity. Unfortunately, as the linker grew
and evolved, it retained its lack of structure, and our sole Turing
award winner retired.


The story for native interop (cgo) isn’t great either and
requires trading away Go’s biggest strengths.

Update 2026: Go bootstrap is now complicated, and bootstraping
from source is decreasingly practical. However, Peter0x44 has [scripts][]
to help, which also patch Go’s linker to support w64devkit’s more capable
default object format.

Package Management

Go has decentralized package management — or, more accurately, module
management. There’s no central module manager or module registration. To
use a Go module, it need only be hosted on a reachable network with a
valid HTTPS certificate. Modules are named by a module path that
includes its network location. This means there’s no land grab for
popular module names.

An organization using Go does not need to trust an external package
repository (PyPI, etc.), nor do they need to run an internal package
repository for their own internal packages. In general it’s sufficient
just to leverage the organization’s already-existing source control
system.

Dependencies are locked to a particular version cryptographically. The
upstream source cannot change a published module for those that already
depend on it. They could still publish a new version with hostile
changes, but one should be cautious about updating dependencies —
a deliberate action — or even having dependencies in the first
place (also).

With decentralized module management, you might think that each
dependency host is a single point of failure — and you would be exactly
right. If any dependency disappears, you can no longer build in a fresh
checkout. Go has a solution for this: a module proxy. Before fetching
the dependency directly, Go (optionally, configured via GOPROXY)
checks with a module proxy that may have cached the dependency. This
eliminates the single point of failure. Google hosts a free module proxy
service for the internet, but organizations should probably run their
own module proxy internally, at least for external dependencies. This
neatly solves the left-pad problem.

Honestly, this is a breath of fresh air. Decentralized modules are great
idea and avoid most of the issues of a centralized package repositories.

How Go could improve: Go’s module management is a little too gung-ho
about HTTPS and certificates. The module documentation is still
incomplete, and the only way to get answers to some of my questions was
either to find the relevant source code in Go or to simply experiment.

Normally I could experiment using my local system, but Go refuses to do
anything with modules unless I go through HTTPS with valid certificates.
Needing to do bunch of pointless configuration — creating a dummy CA,
dummy localhost certificates, and setting it all up — really kills my
momentum and motivation, and it delayed me in learning the new module
system. Before modules, Go supported an -insecure flag, which was
great for this sort of experimentation, but they removed it out of fear
of misuse. I’ll decide my own risks, thank you very much.

An example of a question without a documented answer: If my module path
is example.com/foo but my web server 301 redirects this request to
example.com/foo/, will Go follow this redirect and re-append
?go-get=1? (Yes.) Did I want to configure an HTTPS server just to test
this? (No.)

Update: I’ve been alerted that Go 1.14 will introduce
GOINSECURE as a finer-grained form of the old -insecure option.
This nicely solves my experimentation issue!

Vendoring

I still haven’t even gotten to one of the most powerful and unique
module features — a feature which the Go developers initially didn’t
want to include. If you have a vendor/ directory at the root of your
module, and you use -mod=vendor when compiling, Go will look in that
directory for modules. Go’s build system before modules (GOPATH) had a
similar mechanism.

This is called vendoring and the practice pre-dates Go itself. Just
check your dependency sources directly into source control alongside
your own sources and hook them into your build. Organizations will often
use this internally to lock down dependencies and to avoid depending on
external resources. Typically, vendoring is a lot of work. The project’s
build system must cooperate with the dependency’s build system. Then
eventually you may want to update a vendored dependency, which may
require more build changes.

These issues have led to the rise of header libraries and
amalgamations in C and C++: libraries that are trivial to
integrate into any project.

Go’s module system fully automates vendoring, which it can do
because it already orchestrates builds. A single command populates the
vendor/ directory with all of the module’s current dependencies:

$ go mod vendor


Normally you might follow this up by checking it into source control,
but that’s not the only way it’s useful. Instead a project could merely
include the vendor/ directory in its release source tarball. That
tarball would be the entire, standalone source for that release. With
all external dependencies packed into the tarball, the program could be
built entirely offline on any system with a Go compiler. This is
incredibly useful for me personally.

Some open source projects not written in Go have dependencies-included
releases like this (example), but it’s a ton of work. So, of
course, it’s usually not done. However, any Go project (not using cgo)
can accomplish this trivially without even thinking about it. This is
a such big deal, and nobody’s talking about it!

There’s lots of discussion about Go the programming language, but I
hardly see discussion about the amazing engineering that’s gone into Go
itself. It’s an under-appreciated piece of technology!




The Long Key ID Collider
2019-07-22T21:27:02Z
Over the last couple weeks I’ve spent a lot more time working with
OpenPGP keys. It’s a consequence of polishing my passphrase-derived
PGP key generator. I’ve tightened up the internals, and it’s
enabled me to explore the corners of the format, try interesting
things, and observe how various OpenPGP implementations respond to
weird inputs.

For one particularly cool trick, take a look at these two (private)
keys I generated yesterday. Here’s the first:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAjJgvdh3N2pegXPEuMe25nJ3gI7k8gEgQvCor
AExppm4AAQC0TNsuIRHkxaGjLNN6hQowRMxLXAMrkZfMcp1DTG8GBg1TzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAGq0APwOtCFVCxpv
d/gzKUg0SkdygmriV1UmrQ+KYx9dhzC6xwEAqwDGsSgSbCqPdkwqi/tOn+MwZ5N9
jYxy48PZGZ2V3ws=
=bBGR
-----END PGP PRIVATE KEY BLOCK-----


And the second:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAzjSPKjpOuJoLP6G0z7pptx4sBNiqmgEI0xiH
Z4Xb16kAAP0Qyon06UB2/gOeV/KjAjCi91MeoUd7lsA5yn82RR5bOxAkzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAEv4AQDLRqx10v3M
bwVnJ8BDASAOzrPw+Rz1tKbjG9r45iE7NQEAhm9QVtFd8SN337kIWcq8wXA6j1tY
+UeEsjg+SHzkqA4=
=QnLn
-----END PGP PRIVATE KEY BLOCK-----


Concatenate these and then import them into GnuPG to have a look at
them. To avoid littering in your actual keyring, especially with
private keys, use the --homedir option to set up a temporary
keyring. I’m going to omit that option in the examples.

$ gpg --import < keys.asc
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: Total number processed: 2
gpg:               imported: 2
gpg:       secret keys read: 2
gpg:   secret keys imported: 2


The user ID is “nullprogram.com” since I made these and that’s me
taking credit. “992A5EEE1D1049FA” is called the long key ID: a
64-bit value that identifies the key. It’s the lowest 64 bits of the
full key ID, a 160-bit SHA-1 hash. In the old days everyone used a
short key ID to identify keys, which was the lowest 32 bits of the
key. For these keys, that would be “1D1049FA”. However, this was
deemed way too short, and everyone has since switched to long key
IDs, or even the full 160-bit key ID.

The key ID is nothing more than a SHA-1 hash of the key creation date —
unsigned 32-bit unix epoch seconds — and the public key material. So
secret keys have the same key ID as their associated public key. This
makes sense since they’re a key pair and they go together.

Look closely and you’ll notice that both keypairs have the same long
key ID. If you hadn’t already guessed from the title of this article,
these are two different keys with the same long key ID. In other
words, I’ve created a long key ID collision. The GnuPG
--list-keys command prints the full key ID since it’s so important:

$ gpg --list-keys
---------------------
pub   ed25519 2019-07-22 [SCA]
      A422F8B0E1BF89802521ECB2992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

pub   ed25519 2019-07-22 [SCA]
      F43BC80C4FC2603904E7BE02992A5EEE1D1049FA
uid           [ unknown] nullprogram.com


I was only targeting the lower 64 bits, but I actually managed to
collide the lowest 68 bits by chance. So a long key ID still isn’t
enough to truly identify any particular key.

This isn’t news, of course. Nor am I the first person to create a long
key ID collision. In 2013, David Leon Gil published a long key ID
collision for two 4096-bit RSA public keys. However, that is the
only other example I was able to find. He did not include the private
keys and did not elaborate on how he did it. I know he did generate
viable keys, not just garbage for the public key portions, since they’re
both self-signed.

Creating these keys was trickier than I had anticipated, and there’s an
old, clever trick that makes it work. Building atop the work I did for
passphrase2pgp, I created a standalone tool that will create a long key
ID collision and print the two keypairs to standard output:


  https://github.com/skeeto/pgpcollider


Example usage:

$ go get -u github.com/skeeto/pgpcollider
$ pgpcollider --verbose > keys.asc


This can take up to a day to complete when run like this. The tool can
optionally coordinate many machines — see the --server / -S and
--client / -C options — to work together, greatly reducing the total
time. It took around 4 hours to create the keys above on a single
machine, generating a around 1 billion extra keys in the process. As
discussed below, I actually got lucky that it only took 1 billion. If
you modify the program to do short key ID collisions, it only takes a
few seconds.

The rest of this article is about how it works.

Birthday Attacks

An important detail is that this technique doesn’t target any specific
key ID. Cloning someone’s long key ID is still very expensive. No,
this is a birthday attack. To find a collision in a space of
2^64, on average I only need to generate 2^32 samples — the square root
of that space. That’s perfectly feasible on a regular desktop computer.
To collide long key IDs, I need only generate about 4 billion IDs and
efficiently do membership tests on that set as I go.

That last step is easier said than done. Naively, that might look like
this (pseudo-code):

seen := map of long key IDs to keys
loop forever {
    key := generateKey()
    longID := key.ID[12:20]
    if longID in seen {
        output seen[longID]
        output key
        break
    } else {
        seen[longID] = key
    }
}


Consider the size of that map. Each long ID is 8 bytes, and we expect
to store around 2^32 of them. That’s at minimum 32 GB of storage
just to track all the long IDs. The map itself is going to have some
overhead, too. Since these are literally random lookups, this all
mostly needs to be in RAM or else lookups are going to be very slow
and impractical.

And I haven’t even counted the keys yet. As a saving grace, these are
Ed25519 keys, so that’s 32 bytes for the public key and 32 bytes for the
private key, which I’ll need if I want to make a self-signature. (The
signature itself will be larger than the secret key.) That’s around
256GB more storage, though at least this can be stored on the hard
drive. However, to address these from the map I’d need at least 38 bits,
plus some more in case it goes over. Just call it another 8 bytes.

So that’s, at a bare minimum, 64GB of RAM plus 256GB of other storage.
Since nothing is ideal, we’ll need more than this. This is all still
feasible, but will require expensive hardware. We can do a lot better.

Keys from seeds

The first thing you might notice is that we can jettison that 256GB of
storage by being a little more clever about how we generate keys. Since
we don’t actually care about the security of these keys, we can generate
each key from a seed much smaller than the key itself. Instead of using
8 bytes to reference a key in storage, just use those 8 bytes to store
the seed used to make the key.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    key := generateKey(seed)
    longID := key.ID[12:20]
    if longID in seen {
        output generateKey(seen[longID])
        output key
        break
    } else {
        seen[longID] = seed
    }
}


I’m incrementing a counter to generate the seeds because I don’t want to
experience the birthday paradox to apply to my seeds. Each really must
be unique. I’m using SplitMix64 for the PRNG since I learned it’s the
fastest for Go, so a simple increment to generate seeds is
perfectly fine.

Ultimately, this still uses utterly excessive amounts of memory.
Wouldn’t it be crazy if we could somehow get this 64GB map down to just
a few MBs of RAM? Well, we can!

Rainbow tables

For decades, password crackers have faced a similar problem. They want
to precompute the hashes for billions of popular passwords so that they
can efficiently reverse those password hashes later. However, storing
all those hashes would be unnecessarily expensive, or even infeasible.

So they don’t. Instead they use rainbow tables. Password hashes
are chained together into a hash chain, where a password hash leads to a
new password, then to a hash, and so on. Then only store the beginning
and the end of each chain.

To lookup a hash in the rainbow table, run the hash chain algorithm
starting from the target hash and, for each hash, check if it matches
the end of one of the chains. If so, recompute that chain and note the
step just before the target hash value. That’s the corresponding
password.

For example, suppose the password “foo” hashes to 9bfe98eb, and we
have a reduction function that maps a hash to some password. In this
case, it maps 9bfe98eb to “bar”. A trivial reduction function could
just be an index into a list of passwords. A hash chain starting from
“foo” might look like this:

foo -> 9bfe98eb -> bar -> 27af0841 -> baz -> d9d4bbcb


In reality a chain would be a lot longer. Another chain starting from
“apple” might look like this:

apple -> 7bbc06bc -> candle -> 82a46a63 -> dog -> 98c85d0a


We only store the tuples (foo, d9d4bbcb) and (apple, 98c85d0a) in
our database. If the chains had been one million hashes long, we’d
still only store those two tuples. That’s literally a 1:1000000
compression ratio!

Later on we’re faced with reversing the hash 27af0841, which isn’t
listed directly in the database. So we run the chain forward from that
hash until either I hit the maximum chain length (i.e. password not in
the table), or we recognize a hash:

27af0841 -> baz -> d9d4bbcb


That d9d4bbcb hash is listed as being in the “foo” hash chain. So I
regenerate that hash chain to discover that “bar” leads to 27af0841.
Password cracked!

Collider rainbow table

My collider works very similarly. A hash chain works like this: Start
with a 64-bit seed as before, generate a key, get the long key ID,
then use the long key ID as the seed for the next key.



There’s one big difference. In the rainbow table the purpose is to run
the hash function backwards by looking at the previous step in the
chain. For the collider, I want to know if any of the hash chains
collide. So long as each chain starts from a unique seed, it would mean
we’ve found two different seeds that lead to the same long key ID.

Alternatively, it could be two different seeds that lead to the same
key, which wouldn’t be useful, but that’s trivial to avoid.

A simple and efficient way to check if two chains contain the same
sequence is to stop them at the same place in that sequence. Rather than
run the hash chains for some fixed number of steps, they stop when they
reach a distinguishing point. In my collider a distinguishing point is
where the long key ID ends with at least N 0 bits, where N determines
the average chain length. I chose 17 bits.

func computeChain(seed) {
    loop forever {
        key := generateKey(seed)
        longID := key.ID[12:20]
        if distinguished(longID) {
            return longID
        }
        seed = longID
    }
}


If two different hash chains end on the same distinguishing point,
they’re guaranteed to have collided somewhere in the middle.



To determine where two chains collided, regenerate each chain and find
the first long key ID that they have in common. The step just before are
the colliding keys.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    longID := computeChain(seed)
    if longID in seen {
        output findCollision(seed, seen[longID])
        break
    } else {
        seen[longID] = seed
    }
}


Hash chains computation is embarrassingly parallel, so the load can be
spread efficiently across CPU cores. With these rainbow(-like) tables,
my tool can generate and track billions of keys in mere megabytes of
memory. The additional computational cost is the time it takes to
generate a couple more chains than otherwise necessary.




Predictable, Passphrase-Derived PGP Keys
2019-07-10T04:18:29Z
tl;dr: passphrase2pgp.

One of my long-term concerns has been losing my core cryptographic keys,
or just not having access to them when I need them. I keep my important
data backed up, and if that data is private then I store it encrypted.
My keys are private, but how am I supposed to encrypt them? The chicken
or the egg?

The OpenPGP solution is to (optionally) encrypt secret keys using a key
derived from a passphrase. GnuPG prompts the user for this passphrase
when generating keys and when using secret keys. This protects the keys
at rest, and, with some caution, they can be included as part of regular
backups. The OpenPGP specification, RFC 4880 has many options
for deriving a key from this passphrase, called String-to-Key, or S2K,
algorithms. None of the options are great.

In 2012, I selected the strongest S2K configuration at the time and,
along with a very strong passphrase, put my GnuPG keyring on the
internet as part of my public dotfiles repository. It was a
kind of super-backup that would guarantee their availability anywhere
I’d need them.

My timing was bad because, with the release of GnuPG 2.1 in 2014, GnuPG
fundamentally changed its secret keyring format. S2K options are now
(quietly!) ignored when deriving the protection keys. Instead it
auto-calibrates to much weaker settings. With this new version of GnuPG,
I could no longer update the keyring in my dotfiles repository without
significantly downgrading its protection.

By 2017 I was pretty irritated with the whole situation. I let my
OpenPGP keys expire, and then I wrote my own tool to replace
the only feature of GnuPG I was actively using: encrypting my backups
with asymmetric encryption. One of its core features is that the
asymmetric keypair can be derived from a passphrase using a memory-hard
key derivation function (KDF). Attackers must commit a significant
quantity of memory (expensive) when attempting to crack the passphrase,
making the passphrase that much more effective.

Since the asymmetric keys themselves, not just the keys protecting them,
are derived from a passphrase, I never need to back them up! They’re
also always available whenever I need them. My keys are essentially
stored entirely in my brain as if I was a character in a William
Gibson story.

Tackling OpenPGP key generation

At the time I had expressed my interest in having this feature for
OpenPGP keys. It’s something I’ve wanted for a long time. I first took
a crack at it in 2013 (now the the old-version branch) for
generating RSA keys. RSA isn’t that complicated but it’s very
easy to screw up. Since I was rolling it from scratch, I didn’t
really trust myself not to subtly get it wrong. Plus I never figured out
how to self-sign the key. GnuPG doesn’t accept secret keys that aren’t
self-signed, so it was never useful.

I took another crack at it in 2018 with a much more brute force
approach. When a program needs to generate keys, it will either read
from /dev/u?random or, on more modern systems, call getentropy(3).
These are all ultimately system calls, and I know how to intercept
those with Ptrace. If I want to control key generation for any
program, not just GnuPG, I could intercept these inputs and replace them
with the output of a CSPRNG keyed by a passphrase.

Keyed: Linux Entropy Interception

In practice this doesn’t work at all. Real programs like GnuPG and
OpenSSH’s ssh-keygen don’t rely solely on these entropy inputs. They
also grab entropy from other places, like getpid(2),
gettimeofday(2), and even extract their own scheduler and execution
time noise. Without modifying these programs I couldn’t realistically
control their key generation.

Besides, even if it did work, it would still be fragile and unreliable
since these programs could always change how they use the inputs. So,
ultimately, it was more of an experiment than something practical.

passphrase2pgp

For regular readers, it’s probably obvious that I recently learned
Go. While searching for good projects idea for cutting my teeth, I
noticed that Go’s “extended” standard library has a lot of useful
cryptographic support, so the idea of generating the keys myself may be
worth revisiting.

Something else also happened since my previous attempt: The OpenPGP
ecosystem now has widespread support for elliptic curve cryptography. So
instead of RSA, I could generate a Curve25519 keypair, which, by design,
is basically impossible to screw up. Not only would I be generating
keys on my own terms, I’d being doing it in style, baby.

There are two different ways to use Curve25519:


  Digital signatures: Ed25519 (EdDSA)
  Diffie–Hellman (encryption): X25519 (ECDH)


In GnuPG terms, the first would be a “sign only” key and the second is
an “encrypt only” key. But can’t you usually do both after you generate
a new OpenPGP key? If you’ve used GnuPG, you’ve probably seen the terms
“primary key” and “subkey”, but you probably haven’t had think about
them since it’s all usually automated.

The primary key is the one associated directly with your identity.
It’s always a signature key. The OpenPGP specification says this is a
signature key only by convention, but, practically speaking, it really
must be since signatures is what holds everything together. Like
packaging tape.

If you want to use encryption, independently generate an encryption key,
then sign that key with the primary key, binding that key as a subkey
to the primary key. This all happens automatically with GnuPG.

Fun fact: Two different primary keys can have the same subkey. Anyone
could even bind any of your subkeys to their primary key! They only need
to sign the public key! Though, of course, they couldn’t actually use
your key since they’d lack the secret key. It would just be really
confusing, and could, perhaps in certain situations, even cause some
OpenPGP clients to malfunction. (Note to self: This demands
investigation!)

It’s also possible to have signature subkeys. What good is that?
Paranoid folks will keep their primary key only on a secure, air-gapped,
then use only subkeys on regular systems. The subkeys can be revoked and
replaced independently of the primary key if something were to go wrong.

In Go, generating an X25519 key pair is this simple (yes, it actually
takes array pointers, which is rather weird):

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/curve25519"
)

func main() {
	var seckey, pubkey [32]byte
	rand.Read(seckey[:]) // FIXME: check for error
	seckey[0] &= 248
	seckey[31] &= 127
	seckey[31] |= 64
	curve25519.ScalarBaseMult(&pubkey, &seckey)
	fmt.Printf("pub %x\n", pubkey[:])
	fmt.Printf("sec %x\n", seckey[:])
}


The three bitwise operations are optional since it will do these
internally, but it ensures that the secret key is in its canonical form.
The actual Diffie–Hellman exchange requires just one more function call:
curve25519.ScalarMult().

For Ed25519, the API is higher-level:

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/ed25519"
)

func main() {
	seed := make([]byte, ed25519.SeedSize)
	rand.Read(seed) // FIXME: check for error
	key := ed25519.NewKeyFromSeed(seed)
	fmt.Printf("pub %x\n", key[32:])
	fmt.Printf("sec %x\n", key[:32])
}


Signing a message with this key is just one function call:
ed25519.Sign().

Unfortunately that’s the easy part. The other 400 lines of the real
program are concerned only with encoding these values in the complex
OpenPGP format. That’s the hard part. GnuPG’s --list-packets option
was really useful for debugging this part.

OpenPGP specification

(Feel free to skip this section if the OpenPGP wire format isn’t
interesting to you.)

Following the specification was a real challenge, especially since many
of the details for Curve25519 only appear in still incomplete (and still
erroneous) updates to the specification. I certainly don’t envy the
people who have to parse arbitrary OpenPGP packets. It’s finicky and has
arbitrary parts that don’t seem to serve any purpose, such as redundant
prefix and suffix bytes on signature inputs. Fortunately I only had to
worry about the subset that represents an unencrypted secret key export.

OpenPGP data is broken up into packets. Each packet begins with a tag
identifying its type, followed by a length, which itself is a variable
length. All the packets produced by passphrase2pgp are short, so I could
pretend lengths were all a single byte long.

For a secret key export with one subkey, we need the following packets
in this order:


  Secret-Key: Public-Key packet with secret key appended
  User ID: just a length-prefixed, UTF-8 string
  Signature: binds Public-Key packet (1) and User ID packet (2)
  Secret-Subkey: Public-Subkey packet with secret subkey appended
  Signature: binds Public-Key packet (1) and Public-Subkey packet (4)


A Public-Key packet contains the creation date, key type, and public key
data. A Secret-Key packet is the same, but with the secret key literally
appended on the end and a different tag. The Key ID is (essentially) a
SHA-1 hash of the Public-Key packet, meaning the creation date is part
of the Key ID. That’s important for later.

I had wondered if the SHAttered attack could be used to create
two different keys with the same full Key ID. However, there’s no slack
space anywhere in the input, so I doubt it.

User IDs are usually a RFC 2822 name and email address, but that’s only
convention. It can literally be an empty string, though that wouldn’t be
useful. OpenPGP clients that require anything more than an empty string,
such as GnuPG during key generation, are adding artificial restrictions.

The first Signature packet indicates the signature date, the signature
issuer’s Key ID, and then optional metadata about how the primary key is
to be used and the capabilities the key owner’s client. The signature
itself is formed by appending the Public-Key packet portion of the
Secret-Key packet, the User ID packet, and the previously described
contents of the signature packet. The concatenation is hashed, the hash
is signed, and the signature is appended to the packet. Since the
options are included in the signature, they can’t be changed by another
person.

In theory the signature is redundant. A client could accept the
Secret-Key packet and User ID packet and consider the key imported. It
would then create its own self-signature since it has everything it
needs. However, my primary target for passphrase2pgp is GnuPG, and it
will not accept secret keys that are not self-signed.

The Secret-Subkey packet is exactly the same as the Secret-Key packet
except that it uses a different tag to indicate it’s a subkey.

The second Signature packet is constructed the same as the previous
signature packet. However, it signs the concatenation of the Public-Key
and Public-Subkey packets, binding the subkey to that primary key. This
key may similarly have its own options.

To create a public key export from this input, a client need only chop
off the secret keys and fix up the packet tags and lengths. The
signatures remain untouched since they didn’t include the secret keys.
That’s essentially what other people will receive about your key.

If someone else were to create a Signature packet binding your
Public-Subkey packet with their Public-Key packet, they could set their
own options on their version of the key. So my question is: Do clients
properly track this separate set of options and separate owner for the
same key? If not, they have a problem!

The format may not sound so complex from this description, but there are
a ton of little details that all need to be correct. To make matters
worse, the feedback is usually just a binary “valid” or “invalid”. The
world could use an OpenPGP format debugger.

Usage

There is one required argument: either --uid (-u) or --load
(-l). The former specifies a User ID since a key with an empty User ID
is pretty useless. It’s my own artificial restriction on the User ID.
The latter loads a previously-generated key which will come with a User
ID.

To generate a key for use in GnuPG, just pipe the output straight into
GnuPG:

$ passphrase2pgp --uid "Foo " | gpg --import


You will be prompted for a passphrase. That passphrase is run through
Argon2id, a memory-hard KDF, with the User ID as the salt.
Deriving the key requires 8 passes over 1GB of state, which takes my
current computers around 8 seconds. With the --paranoid (-x) option
enabled, that becomes 16 passes over 2GB (perhaps not paranoid enough?).
The output is 64 bytes: 32 bytes to seed the primary key and 32 bytes to
seed the subkey.

Despite the aggressive KDF settings, you will still need to choose a
strong passphrase. Anyone who has your public key can mount an offline
attack. A 10-word Diceware or Pokerware passphrase is more than
sufficient (~128 bits) while also being quite reasonable to memorize.

Since the User ID is the salt, an attacker couldn’t build a single
rainbow table to attack passphrases for different people. (Though your
passphrase really should be strong enough that this won’t matter!) The
cost is that you’ll need to use exactly the same User ID again to
reproduce the key. In theory you could change the User ID afterward to
whatever you like without affecting the Key ID, though it will require a
new self-signature.

The keys are not encrypted (no S2K), and there are few options you can
choose when generating the keys. If you want to change any of this, use
GnuPG’s --edit-key tool after importing. For example, to set a
protection passphrase:

$ gpg --edit-key Foo
gpg> passwd


There’s a lot that can be configured from this interface.

If you just need the public key to publish or share, the --public
(-p) option will suppress the private parts and output only a public
key. It works well in combination with ASCII armor, --armor (-a).
For example, to put your public key on the clipboard:

$ passphrase2pgp -u '...' -ap | xclip


The tool can create detached signatures (--sign, -S) entirely on its
own, too, so you don’t need to import the keys into GnuPG just to make
signatures:

$ passphrase2pgp --sign --uid '...' program.exe


This would create a file named program.exe.sig with the detached
signature, ready to be verified by another OpenPGP implementation. In
fact, you can hook it directly up to Git for signing your tags and
commits without GnuPG:

$ git config --global gpg.program passphrase2pgp


This only works for signing, and it cannot verify (verify-tag or
verify-commit).

It’s pretty tedious to enter the --uid option all the time, so, if
omitted, passphrase2pgp will infer the User ID from the environment
variables REALNAME and EMAIL. Combined with the KEYID environment
variable (see the README for details), you can easily get away with
never storing your keys: only generate them on demand when needed.

That’s how I intend to use passphrase2pgp. When I want to sign a file,
I’ll only need one option, one passphrase prompt, and a few seconds of
patience:

$ passphrase2pgp -S path/to/file


January 1, 1970

The first time you run the tool you might notice one offensive aspect of
its output: Your key will be dated January 1, 1970 — i.e. unix epoch
zero. This predates PGP itself by more than two decades, so it might
alarm people who receive your key.

Why do this? As I noted before, the creation date is part of the Key ID.
Use a different date, and, as far as OpenPGP is concerned, you have a
different key. Since users probably don’t want to remember a specific
datetime, at seconds resolution, in addition to their passphrase,
passphrase2pgp uses the same hard-coded date by default. A date of
January 1, 1970 is like NULL in a database: no data.

If you don’t like this, you can override it with the --time (-t) or
--now (-n) options, but it’s up to you to remain consistent.

Vanity Keys

If you’re interested in vanity keys — e.g. where the Key ID spells out
words or looks unusual — it wouldn’t take much work to hack up the
passphrase2pgp source into generating your preferred vanity keys. It
would easily beat anything else I could find online.

Reconsidering limited OpenPGP

Initially my intention was never to output an encryption subkey, and
passphrase2pgp would only be useful for signatures. By default it still
only produces a sign key, but you can still get an encryption subkey
with the --subkey (-s) option. I figured it might be useful to
generate an encryption key, even if it’s not output by default. Users
can always ask for it later if they have a need for it.

Why only a signing key? Nobody should be using OpenPGP for encryption
anymore. Use better tools instead and retire the 20th century
cryptography. If you don’t have an encryption subkey, nobody can
send you OpenPGP-encrypted messages.

In contrast, OpenPGP signatures are still kind of useful and lack a
practical alternative. The Web of Trust failed to reach critical mass,
but that doesn’t seem to matter much in practice. Important OpenPGP keys
can be bootstrapped off TLS by strategically publishing them on HTTPS
servers. Keybase.io has done interesting things in this area.

Further, GitHub officially supports OpenPGP signatures, and I
believe GitLab does too. This is another way to establish trust for a
key. IMHO, there’s generally too much emphasis on binding a person’s
legal identity to their OpenPGP key (e.g. the idea behind key-signing
parties). I suppose that’s useful for holding a person legally
accountable if they do something wrong. I’d prefer trust a key with has
an established history of valuable community contributions, even if done
so only under a pseudonym.

So sometime in the future I may again advertise an OpenPGP public key.
If I do, those keys would certainly be generated with passphrase2pgp. I
may not even store the secret keys on a keyring, and instead generate
them on the fly only when I occasionally need them.




Go Slices are Fat Pointers
2019-06-30T21:27:19Z
This article was discussed on Hacker News.

One of the frequent challenges in C is that pointers are nothing but a
memory address. A callee who is passed a pointer doesn’t truly know
anything other than the type of object being pointed at, which says some
things about alignment and how that pointer can be used… maybe. If it’s
a pointer to void (void *) then not even that much is known.



The number of consecutive elements being pointed at is also not known.
It could be as few as zero, so dereferencing would be illegal. This can
be true even when the pointer is not null. Pointers can go one past the
end of an array, at which point it points to zero elements. For example:

void foo(int *);

void bar(void)
{
    int array[4];
    foo(array + 4);  // pointer one past the end
}


In some situations, the number of elements is known, at least to the
programmer. For example, the function might have a contract that says it
must be passed at least N elements, or exactly N elements. This
could be communicated through documentation.

/** Foo accepts 4 int values. */
void foo(int *);


Or it could be implied by the function’s prototype. Despite the
following function appearing to accept an array, that’s actually a
pointer, and the “4” isn’t relevant to the prototype.

void foo(int[4]);


C99 introduced a feature to make this a formal part of the prototype,
though, unfortunately, I’ve never seen a compiler actually use this
information.

void foo(int[static 4]);  // >= 4 elements, cannot be null


Another common pattern is for the callee to accept a count parameter.
For example, the POSIX write() function:

ssize_t write(int fd, const void *buf, size_t count);


The necessary information describing the buffer is split across two
arguments. That can become tedious, and it’s also a source of serious
bugs if the two parameters aren’t in agreement (buffer overflow,
information disclosure, etc.). Wouldn’t it be nice if this
information was packed into the pointer itself? That’s essentially the
definition of a fat pointer.

Fat pointers via bit hacks

If we assume some things about the target platform, we can encode fat
pointers inside a plain pointer with some dirty pointer
tricks, exploiting unused bits in the pointer value. For
example, currently on x86-64, only the lower 48 bits of a pointer are
actually used. The other 16 bits could carefully be used for other
information, like communicating the number of elements or bytes:

// NOTE: x86-64 only!
unsigned char buf[1000];
uintptr addr = (uintptr_t)buf & 0xffffffffffff;
uintptr pack = (sizeof(buf) << 48) | addr;
void *fatptr = (void *)pack;


The other side can unpack this to get the components back out. Obviously
16 bits for the count will often be insufficient, so this would more
likely be used for baggy bounds checks.

Further, if we know something about the alignment — say, that it’s
16-byte aligned — then we can also encode information in the least
significant bits, such as a type tag.

Fat pointers via a struct

That’s all fragile, non-portable, and rather limited. A more robust
approach is to lift pointers up into a richer, heavier type, like a
structure.

struct fatptr {
    void *ptr;
    size_t len;
};


Functions accepting these fat pointers no longer need to accept a count
parameter, and they’d generally accept the fat pointer by value.

fatptr_write(int fd, struct fatptr);


In typical C implementations, the structure fields would be passed
practically, if not exactly, same way as the individual parameters would
have been passed, so it’s really no less efficient. (Update June 2024:
Pengji Zhang pointed out that this applies only to the 2-element struct
fatptr, and not to 3-element slice headers discussed below.)

To help keep this straight, we might employ some macros:

#define COUNTOF(array) \
    (sizeof(array) / sizeof(array[0]))

#define FATPTR(ptr, count) \
    (struct fatptr){ptr, count}

#define ARRAYPTR(array) \
    FATPTR(array, COUNTOF(array))

/* ... */

unsigned char buf[40];
fatptr_write(fd, ARRAYPTR(buf));


There are obvious disadvantages of this approach, like type confusion
due to that void pointer, the inability to use const, and just being
weird for C. I wouldn’t use it in a real program, but bear with me.

Before I move on, I want to add one more field to that fat pointer
struct: capacity.

struct fatptr {
    void *ptr;
    size_t len;
    size_t cap;
};


This communicates not how many elements are present (len), but how
much additional space is left in the buffer. This allows callees know
how much room is left for, say, appending new elements.

// Fix the remainder of an int buffer with a value.
void
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
}


Since the callee modifies the fat pointer, it should be returned:

struct fatptr
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
    ptr.len = ptr.cap;
    return ptr;
}


Congratulations, you’ve got slices! Except that in Go they’re a proper
part of the language and so doesn’t rely on hazardous hacks or tedious
bookkeeping. The fatptr_write() function above is nearly functionally
equivalent to the Writer.Write() method in Go, which accepts a slice:

type Writer interface {
	Write(p []byte) (n int, err error)
}


The buf and count parameters are packed together as a slice, and
fd parameter is instead the receiver (the object being acted upon by
the method).

Go slices

Go famously has pointers, including internal pointers, but not pointer
arithmetic. You can take the address of (nearly) anything, but
you can’t make that pointer point at anything else, even if you took the
address of an array element. Pointer arithmetic would undermine Go’s
type safety, so it can only be done through special mechanisms in the
unsafe package.

But pointer arithmetic is really useful! It’s handy to take an address
of an array element, pass it to a function, and allow that function to
modify a slice (wink, wink) of the array. Slices are pointers that
support exactly this sort of pointer arithmetic, but safely. Unlike
the & operator which creates a simple pointer, the slice operator
derives a fat pointer.

func fill([]int, int) []int

var array [8]int

// len == 0, cap == 8, like &array[0]
fill(array[:0], 1)
// array is [1, 1, 1, 1, 1, 1, 1, 1]

// len == 0, cap == 4, like &array[4]
fill(array[4:4], 2)
// array is [1, 1, 1, 1, 2, 2, 2, 2]


The fill function could take a slice of the slice, effectively moving
the pointer around with pointer arithmetic, but without violating memory
safety due to the additional “fat pointer” information. In other words,
fat pointers can be derived from other fat pointers.

Slices aren’t as universal as pointers, at least at the moment. You can
take the address of any variable using &, but you can’t take a slice
of any variable, even if it would be logically sound.

var foo int

// attempt to make len = 1, cap = 1 slice backed by foo
var fooslice []int = foo[:]   // compile-time error!


That wouldn’t be very useful anyway. However, if you really wanted to
do this, the unsafe package can accomplish it. I believe the resulting
slice would be perfectly safe to use:

// Convert to one-element array, then slice
fooslice = (*[1]int)(unsafe.Pointer(&foo))[:]


Update: Chris Siebenmann speculated about why this requires
unsafe.

Of course, slices are super flexible and have many more uses that look
less like fat pointers, but this is still how I tend to reason about
slices when I write Go.




UTF-8 String Indexing Strategies
2019-05-29T21:52:06Z
This article was discussed on Hacker News.

When designing or, in some cases, implementing a programming language
with built-in support for Unicode strings, an important decision must be
made about how to represent or encode those strings in memory. Not all
representations are equal, and there are trade-offs between different
choices.



One issue to consider is that strings typically feature random access
indexing of code points with a time complexity resembling constant
time (O(1)). However, not all string representations actually
support this well. Strings using variable length encoding, such as
UTF-8 or UTF-16, have O(n) time complexity indexing, ignoring
special cases (discussed below). The most obvious choice to achieve
O(1) time complexity — an array of 32-bit values, as in UCS-4 —
makes very inefficient use of memory, especially with typical strings.

Despite this, UTF-8 is still chosen in a number of programming
languages, or at least in their implementations. In this article I’ll
discuss three examples — Emacs Lisp, Julia, and Go — and how each takes a
slightly different approach.

Emacs Lisp

Emacs Lisp has two different types of strings that generally can be used
interchangeably: unibyte and multibyte. In fact, the difference
between them is so subtle that I bet that most people writing Emacs Lisp
don’t even realize there are two kinds of strings.

Emacs Lisp uses UTF-8 internally to encode all “multibyte” strings and
buffers. To fully support arbitrary sequences of bytes in the files
being edited, Emacs uses its own extension of Unicode to
precisely and unambiguously represent raw bytes intermixed with text.
Any arbitrary sequence of bytes can be decoded into Emacs’ internal
representation, then losslessly re-encoded back into the exact same
sequence of bytes.

Unibyte strings and buffers are really just byte-strings. In practice,
they’re essentially ISO/IEC 8859-1, a.k.a. Latin-1. It’s a Unicode
string where all code points are below 256. Emacs prefers the smallest
and simplest string representation when possible, similar to CPython
3.3+.

(multibyte-string-p "hello")
;; => nil

(multibyte-string-p "π ≈ 3.14")
;; => t


Emacs Lisp strings are mutable, and therein lies the kicker: As soon as
you insert a code point above 255, Emacs quietly converts the string to
multibyte.

(defvar fish "fish")

(multibyte-string-p fish)
;; => nil

(setf (aref fish 2) ?ŝ
      (aref fish 3) ?o)

fish
;; => "fiŝo"

(multibyte-string-p fish)
;; => t


Constant time indexing into unibyte strings is straightforward, and
Emacs does the obvious thing when indexing into unibyte strings. It
helps that most strings in Emacs are probably unibyte, even when the
user isn’t working in English.

Most buffers are multibyte, even if those buffers are generally just
ASCII. Since Emacs uses gap buffers it generally doesn’t matter:
Nearly all accesses are tightly clustered around the point, so O(n)
indexing doesn’t often matter.

That leaves multibyte strings. Consider these idioms for iterating
across a string in Emacs Lisp:

(dotimes (i (length string))
  (let ((c (aref string i)))
    ...))

(cl-loop for c being the elements of string
         ...)


The latter expands into essentially the same as the former: An
incrementing index that uses aref to index to that code point. So is
iterating over a multibyte string — a common operation — an O(n^2)
operation?

The good news is that, at least in this case, no! It’s essentially just
as efficient as iterating over a unibyte string. Before going over why,
consider this little puzzle. Here’s a little string comparison function
that compares two strings a code point at a time, returning their first
difference:

(defun compare (string-a string-b)
  (cl-loop for a being the elements of string-a
           for b being the elements of string-b
           unless (eql a b)
           return (cons a b)))


Let’s examine benchmarks with some long strings (100,000 code points):

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (compare a b)))
;; => (0.012568031 0 0.0)


With using two, zeroed unibyte strings it takes 13ms. How about changing
the last code point in one of them to 256, converting it to a multibyte
string:

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (setf (aref a (1- (length a))) 256)
      (compare a b)))
;; => (0.012680513 0 0.0)


Same running time, so that multibyte string cost nothing more to iterate
across. Let’s try making them both multibyte:

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (setf (aref a (1- (length a))) 256
            (aref b (1- (length b))) 256)
      (compare a b)))
;; => (2.327959762 0 0.0)


That took 2.3 seconds: about 2000x longer to run! Iterating over two
multibyte strings concurrently seems to have broken an optimization.
Can you reason about what’s happened?

To avoid the O(n) cost on this common indexing operating, Emacs keeps
a “bookmark” for the last indexing location into a multibyte string.
If the next access is nearby, it can starting looking from this
bookmark, forwards or backwards. Like a gap buffer, this gives a big
advantage to clustered accesses, including iteration.

However, this string bookmark is global, one per Emacs instance, not
once per string. In the last benchmark, the two multibyte strings are
constantly fighting over a single string bookmark, and indexing in
comparison function is reduced to O(n^2) time complexity.

So, Emacs pretends it has constant time access into its UTF-8 text
data, but it’s only faking it with some simple optimizations. This
usually works out just fine.

Julia

Another approach is to not pretend at all, and to make this limitation
of UTF-8 explicit in the interface. Julia took this approach, and it
was one of my complaints about the language. I don’t think
this is necessarily a bad choice, but I do still think it’s
inappropriate considering Julia’s target audience (i.e. Matlab users).

Julia strings are explicitly byte strings containing valid UTF-8 data.
All indexing occurs on bytes, which is trivially constant time, and
always decodes the multibyte code point starting at that byte. But
it is an error to index to a byte that doesn’t begin a code point.
That error is also trivially checked in constant time.

s = "π"

s[1]
# => 'π'

s[2]
# ERROR: UnicodeError: invalid character index
#  in getindex at ./strings/basic.jl:37


Slices are still over bytes, but they “round up” to the end of the
current code point:

s[1:1]
# => "π"


Iterating over a string requires helper functions which keep an internal
“bookmark” so that each access is constant time:

for i in eachindex(string)
    c = string[i]
    # ...
end


So Julia doesn’t pretend, it makes the problem explicit.

Go

Go is very similar to Julia, but takes an even more explicit view of
strings. All strings are byte strings and there are no restrictions on
their contents. Conventionally strings contain UTF-8 encoded text, but
this is not strictly required. There’s a unicode/utf8 package for
working with strings containing UTF-8 data.

Beyond convention, the range clause also assumes the string contains
UTF-8 data, and it’s not an error if it does not. Bytes not containing
valid UTF-8 data appear as a REPLACEMENT CHARACTER (U+FFFD).

func main() {
    s := "π\xff"
    for _, r := range s {
        fmt.Printf("U+%04x\n", r)
    }
}

// U+03c0
// U+fffd


A further case of the language favoring UTF-8 is that casting a string
to []rune decodes strings into code points, like UCS-4, again using
REPLACEMENT CHARACTER:

func main() {
    s := "π\xff"
    r := []rune(s)
    fmt.Printf("U+%04x\n", r[0])
    fmt.Printf("U+%04x\n", r[1])
}

// U+03c0
// U+fffd


So, like Julia, there’s no pretending, and the programmer explicitly
must consider the problem.

Preferences

All-in-all I probably prefer how Julia and Go are explicit with
UTF-8’s limitations, rather than Emacs Lisp’s attempt to cover it up
with an internal optimization. Since the abstraction is leaky, it may
as well be made explicit.