Some sanity for C and C++ development on Windows

A hard reality of C and C++ software development on Windows is that there has never been a good, native C or C++ standard library implementation for the platform. A standard library should abstract over the underlying host facilities in order to ease portable software development. On Windows, C and C++ is so poorly hooked up to operating system interfaces that most portable or mostly-portable software — programs which work perfectly elsewhere — are subtly broken on Windows, particularly outside of the English-speaking world. The reasons are almost certainly political, originally motivated by vendor lock-in, than technical, which adds insult to injury. This article is about what’s wrong, how it’s wrong, and some easy techniques to deal with it in portable software.

There are multiple C implementations, so how could they all be bad, even the early ones? Microsoft’s C runtime has defined how the standard library should work on the platform, and everyone else followed along for the sake of compatibility. I’m excluding Cygwin and its major fork, MSYS2, despite not inheriting any of these flaws. They change so much that they’re effectively whole new platforms, not truly “native” to Windows.

In practice, C++ standard libraries are implemented on top of a C standard library, which is why C++ shares the same problems. CPython dodges these issues: Though written in C, on Windows it bypasses the broken C standard library and directly calls the proprietary interfaces. Other language implementations, such “gc” Go, simply aren’t built on C at all, and instead do things correctly in the first place — the behaviors the C runtimes should have had all along.

If you’re just working on one large project, bypassing the C runtime isn’t such a big deal, and you’re likely already doing so to access important platform functionality. You don’t really even need a C runtime. However, if you write many small programs, as I do, writing the same special Windows support for each one ends up being most of the work, and honestly makes properly supporting Windows not worth the trouble. I end up just accepting the broken defaults most of the time.

Before diving into the details, if you’re looking for a quick-and-easy solution for the Mingw-w64 toolchain, including w64devkit, which magically makes your C and C++ console programs behave well on Windows, I’ve put together a “library” named libwinsane. It solves all problems discussed in this article, except for one. No source changes required, simply link it into your program.

What exactly is broken?

The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix, and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy API, where an active code page maps 256 bytes onto (up to) 256 specific characters. On typical machines configured for European languages, this means code page 1252. Roughly speaking, Windows internally uses UTF-16, and calls through the narrow interface use the active code page to translate the narrow strings to wide strings. The result is that calls through the narrow API have limited access to the system.

The UTF-8 encoding was invented in 1992 and standardized by January 1993. UTF-8 was adopted by the unix world over the following years due to its backwards-compatibility with its existing interfaces. Programs could read and write Unicode data, access Unicode paths, pass Unicode arguments, and get and set Unicode environment variables without needing to change anything. Today UTF-8 has become the dominant text encoding format in the world, in large part due to the world wide web.

In July 1993, Microsoft introduced the wide Windows API with the release of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather than UTF-8. This turned out to be a mistake, since UTF-16 is inferior to UTF-8 in practically every way, though admittedly some problems weren’t so obvious at the time.

The major problem: The C and C++ standard libraries only hook up to the narrow Windows interfaces. The standard library, and therefore typical portable software on Windows, cannot handle anything but ASCII. The effective result is that these programs:

Doing any of these requires calling proprietary functions, treating Windows as a special target. It’s part of what makes correctly porting software to Windows so painful.

The sensible solution would have been for the C runtime to speak UTF-8 and connect to the wide API. Alternatively, the narrow API could have been changed over to UTF-8, phasing out the old code page concept. In theory this is what the UTF-8 “code page” is about, though it doesn’t always work. There would have been compatibility problems with abruptly making such a change, but until very recently, this wasn’t even an option. Why couldn’t there be a switch I could flip to get sane behavior that works like every other platform?

How to mostly fix Unicode support

In 2019, Microsoft introduced a feature to allow programs to request UTF-8 as their active code page on start, along with supporting UTF-8 on more narrow API functions. This is like the magic switch I wanted, except that it involves embedding some ugly XML into your binary in a particular way. At least it’s now an option.

For Mingw-w64, that means writing a resource file like so:

#include <winuser.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"

Compiling it with windres:

$ windres -o manifest.o manifest.rc

Then linking that into your program. Amazingly it mostly works! Programs can access Unicode arguments, Unicode environment variables, and Unicode paths, including with fopen, just as it’s worked on other platforms for decades. Since the active code page is set at load time, it happens before argv is constructed (from GetCommandLineA), which is why that works out.

Alternatively you could create a “side-by-side assembly” placing that XML in a file with the same name as your EXE but with .manifest suffix (after the .exe suffix), then placing that next to your EXE. Just be mindful that there’s a “side-by-side” cache (WinSxS), and so it might not immediately pick up your changes.

What doesn’t work is console input and output since the console is external to the process, and so isn’t covered by the process’s active code page. It must be configured separately using a proprietary call:

SetConsoleOutputCP(CP_UTF8);

Annoying, but at least it’s not that painful. This only covers output, though, meaning programs can only print UTF-8. Unfortunately UTF-8 input still doesn’t work, and setting the input code page doesn’t do anything despite reporting success:

SetConsoleCP(CP_UTF8);  // doesn't work

If you care about reading interactive Unicode input, you’re stuck bypassing the C runtime since it’s still broken.

Text stream translation

Another long-standing issue is that C and C++ on Windows has distinct “text” and “binary” streams, which it inherited from DOS. Mainly this means automatic newline conversion between CRLF and LF. The C standard explicitly allows for this, though unix-like platforms have never actually distinguished between text and binary streams.

The standard also specifies that standard input, output, and error are all open as text streams, and there’s no portable method to change the stream mode to binary — a serious deficiency with the standard. On unix-likes this doesn’t matter, but on Windows it means programs can’t read or write binary data on standard streams without calling a non-standard function. It also means reading and writing standard streams is slow, frequently a bottleneck unless I route around it.

Personally, I like writing binary data to standard output, including video, and sometimes binary filters that also read binary input. I do it so often that in probably half my C programs I have this snippet in main just so they work correctly on Windows:

    #ifdef _WIN32
    int _setmode(int, int);
    _setmode(0, 0x8000);
    _setmode(1, 0x8000);
    #endif

That incantation sets standard input and output in the C runtime to binary mode without the need to include a header, making it compact, simple, and self-contained.

This built-in newline translation, along with the Windows standard text editor, Notepad, lagging decades behind, meant that many other programs, including Git, grew their own, annoying, newline conversion misfeatures that cause other problems.

libwinsane

I introduced libwinsane at the beginning of the article, which fixes all this simply by being linked into a program. It includes the magic XML manifest .rsrc section, configures the console for UTF-8 output, and sets standard streams to binary before main (via a GCC constructor). I called it a “library”, but it’s actually a single object file. It can’t be a static library since it must be linked into the program despite not actually being referenced by the program.

So normally this program:

#include <stdio.h>
#include <string.h>

int main(int argc, char **argv)
{
    char *arg = argv[argc-1];
    size_t len = strlen(arg);
    printf("%zu %s\n", len, arg);
}

Compiled and run:

C:\>cc -o example example.c
C:\>example π
1 p

As usual, the Unicode argument is silently mangled into one byte. Linked with libwinsane, it just works like everywhere else:

C:\>gcc -o example example.c libwinsane.o
C:\>example π
2 π

If you’re maintaining a substantial program, you probably want to copy and integrate the necessary parts of libwinsane into your project and build, rather than always link against this loose object file. This is more for convenience and for succinctly capturing the concept. You may even want to enable ANSI escape processing in your version.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)