The wild west of Windows command line parsing

I’ve been experimenting again lately with writing software without a runtime aside from the operating system itself, both on Linux and Windows. Another way to look at it: I write and embed a bespoke, minimal runtime within the application. One of the runtime’s core jobs is retrieving command line arguments from the operating system. On Windows this is a deeper rabbit hole than I expected, and far more complex than I realized. There is no standard, and every runtime does it a little differently. Five different applications may see five different sets of arguments — even different argument counts — from the same input, and this is before any sort of option parsing. It’s truly a modern day Tower of Babel: “Confound their command line parsing, that they may not understand one another’s arguments.”

Unix-like systems pass the argv array directly from parent to child. On Linux it’s literally copied onto the child’s stack just above the stack pointer on entry. The runtime just bumps the stack pointer address a few bytes and calls it argv. Here’s a minimalist x86-64 Linux runtime in just 6 instructions (22 bytes):

_start: mov   edi, [rsp]     ; argc
        lea   rsi, [rsp+8]   ; argv
        call  main
        mov   edi, eax
        mov   eax, 60        ; SYS_exit
        syscall

It’s 5 instructions (20 bytes) on ARM64:

_start: ldr  w0, [sp]        ; argc
        add  x1, sp, 8       ; argv
        bl   main
        mov  w8, 93          ; SYS_exit
        svc  0

On Windows, argv is passed in serialized form as a string. That’s how MS-DOS did it (via the Program Segment Prefix), because that’s how CP/M did it. It made more sense when processes were mostly launched directly by humans: The string was literally typed by a human operator, and somebody has to parse it after all. Today, processes are nearly always launched by other programs, but despite this, must still serialize the argument array into a string as though a human had typed it out.

Windows itself provides an operating system routine for parsing command line strings: CommandLineToArgvW. Fetch the command line string with GetCommandLineW, pass it to this function, and you have your argc and argv. Plus maybe LocalFree to clean up. It’s only available in “wide” form, so if you want to work in UTF-8 you’ll also need WideCharToMultiByte. It’s around 20 lines of C rather than 6 lines of assembly, but it’s not too bad.

My GetCommandLineW

GetCommandLineW returns a pointer into static storage, which is why it doesn’t need to be freed. More specifically, it comes from the Process Environment Block. This got me thinking: Could I locate this address myself without the API call? First I needed to find the PEB. After some research I found a PEB pointer in the Thread Information Block, itself found via the gs register (x64, fs on x86), an old 386 segment register. Buried in the PEB is a UNICODE_STRING, with the command line string address. I worked out all the offsets for both x86 and x64, and the whole thing is just three instructions:

wchar_t *cmdline_fetch(void)
{
    void *cmd = 0;
    #if __amd64
    __asm ("mov %%gs:(0x60), %0\n"
           "mov 0x20(%0), %0\n"
           "mov 0x78(%0), %0\n"
           : "=r"(cmd));
    #elif __i386
    __asm ("mov %%fs:(0x30), %0\n"
           "mov 0x10(%0), %0\n"
           "mov 0x44(%0), %0\n"
           : "=r"(cmd));
    #endif
    return cmd;
}

From Windows XP through Windows 11, this returns exactly the same address as GetCommandLineW. There’s little reason to do it this way other than to annoy Raymond Chen, but it’s still neat and maybe has some super niche use. Technically some of these offsets are undocumented and/or subject to change, except Microsoft’s own static link CRT also hardcodes all these offsets. It’s easy to find: disassemble any statically linked program, look for the gs register, and you’ll find it using these offsets, too.

If you look carefully at the UNICODE_STRING you’ll see the length is given by a USHORT in units of bytes, despite being a 16-bit wchar_t string. This is the source of Windows’ maximum command line length of 32,767 characters (including terminator).

GetCommandLineW is from kernel32.dll, but CommandLineToArgvW is a bit more off the beaten path in shell32.dll. If you wanted to avoid linking to shell32.dll for important reasons, you’d need to do the command line parsing yourself. Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.

The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.

I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

My CommandLineToArgvW

Before finding any of this, I started building my own implementation, which I now believe matches CommandLineToArgvW. These other documents helped me figure out what I was missing. In my usual fashion, it’s a little state machine: cmdline.c. The interface:

int cmdline_to_argv8(const wchar_t *cmdline, char **argv);

Unlike the others, mine encodes straight into WTF-8, a superset of UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative lines of code: invisible since it involves not reacting to ill-formed input. If you use the new-ish UTF-8 manifest Win32 feature then your program cannot handle command line strings with ill-formed UTF-16, a problem solved by WTF-8.

As documented, that argv must be a particular size — a pointer-aligned, 224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case. That’s not too bad when the command line is limited to 32,766 UTF-16 characters. The worst case argument is a single long sequence of 3-byte UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be half as many. The worst case argc is 16,383 (plus one more argv slot for the null pointer terminator), which is one argument for each pair of command line characters. The second half (roughly) of the argv is actually used as a char buffer for the arguments, so it’s all a single, fixed allocation. There is no error case since it cannot fail.

int mainCRTStartup(void)
{
    static char *argv[CMDLINE_ARGV_MAX];
    int argc = cmdline_to_argv8(cmdline_fetch(), argv);
    return main(argc, argv);
}

Also: Note the FUZZ option in my source. It has been pretty thoroughly fuzz tested. It didn’t find anything, but it does make me more confident in the result.

I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.

Building a command line string

I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.

Naturally, I also wrote my own implementation (same source):

int cmdline_from_argv8(wchar_t *cmdline, int len, char **argv);

Like before, it accepts a WTF-8 argv, meaning it can correctly pass through ill-formed UTF-16 arguments. It returns the actual command line length. Since this one can fail when argv is too large, it returns zero for an error.

char *argv[] = {"python.exe", "-c", code, 0};
wchar_t cmd[CMDLINE_CMD_MAX];
if (!cmdline_from_argv8(cmd, CMDLINE_CMD_MAX, argv)) {
    return "argv too large";
}
if (!CreateProcessW(0, cmd, /*...*/)) {
    return "CreateProcessW failed";
}

How do others handle this?

I don’t plan to write a language implementation anytime soon, where this might be needed, but it’s nice to know I’ve already solved this problem for myself!

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)