nullprogram.com/blog/2022/02/18/
I’ve been experimenting again lately with writing software without a
runtime aside from the operating system itself, both on Linux and
Windows. Another way to look at it: I write and embed a bespoke, minimal
runtime within the application. One of the runtime’s core jobs is
retrieving command line arguments from the operating system. On Windows
this is a deeper rabbit hole than I expected, and far more complex than I
realized. There is no standard, and every runtime does it a little
differently. Five different applications may see five different sets of
arguments — even different argument counts — from the same input, and this
is before any sort of option parsing. It’s truly a modern day Tower of
Babel: “Confound their command line parsing, that they may not understand
one another’s arguments.”
Unix-like systems pass the argv
array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it argv
. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):
_start: mov edi, [rsp] ; argc
lea rsi, [rsp+8] ; argv
call main
mov edi, eax
mov eax, 60 ; SYS_exit
syscall
It’s 5 instructions (20 bytes) on ARM64:
_start: ldr w0, [sp] ; argc
add x1, sp, 8 ; argv
bl main
mov w8, 93 ; SYS_exit
svc 0
On Windows, argv
is passed in serialized form as a string. That’s how
MS-DOS did it (via the Program Segment Prefix), because that’s how
CP/M did it. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and somebody has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.
Windows itself provides an operating system routine for parsing command
line strings: CommandLineToArgvW. Fetch the command line string
with GetCommandLineW, pass it to this function, and you have your
argc
and argv
. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so if you want to work in UTF-8 you’ll also need
WideCharToMultiByte
. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.
My GetCommandLineW
GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the Process
Environment Block. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the Thread Information Block,
itself found via the gs
register (x64, fs
on x86), an old 386 segment
register. Buried in the PEB is a UNICODE_STRING
, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:
wchar_t *cmdline_fetch(void)
{
void *cmd = 0;
#if __amd64
__asm ("mov %%gs:(0x60), %0\n"
"mov 0x20(%0), %0\n"
"mov 0x78(%0), %0\n"
: "=r"(cmd));
#elif __i386
__asm ("mov %%fs:(0x30), %0\n"
"mov 0x10(%0), %0\n"
"mov 0x44(%0), %0\n"
: "=r"(cmd));
#endif
return cmd;
}
From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the gs
register, and you’ll find it using these offsets, too.
If you look carefully at the UNICODE_STRING
you’ll see the length is
given by a USHORT
in units of bytes, despite being a 16-bit wchar_t
string. This is the source of Windows’ maximum command line length
of 32,767 characters (including terminator).
GetCommandLineW is from kernel32.dll
, but CommandLineToArgvW is a bit
more off the beaten path in shell32.dll
. If you wanted to avoid linking
to shell32.dll
for important reasons, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.
The GetCommandLineW has a rough explanation: split arguments on whitespace
(not defined), quoting is involved, and there’s something about counting
backslashes, but only if they stop on a quote. It’s not quite enough to
implement your own, and if you test against it, it’s quickly apparent that
this documentation is at best incomplete. It links to a deprecated page
about parsing C++ command line arguments with a few more details.
Unfortunately the algorithm described on this page is not the algorithm
used by GetCommandLineW, nor is it used by any runtime I could find. It
even varies between Microsoft’s own CRTs. There is no canonical command
line parsing result, not even a de facto standard.
I eventually came across David Deley’s How Command Line Parameters Are
Parsed, which is the closest there is to an authoritative document on
the matter (also). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. argv[0]
) follows entirely different
rules, which really confused me for while. The Wine documentation
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.
My CommandLineToArgvW
Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s a
little state machine: cmdline.c
. The interface:
int cmdline_to_argv8(const wchar_t *cmdline, char **argv);
Unlike the others, mine encodes straight into WTF-8, a superset of
UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative
lines of code: invisible since it involves not reacting to ill-formed
input. If you use the new-ish UTF-8 manifest Win32 feature then your
program cannot handle command line strings with ill-formed UTF-16, a
problem solved by WTF-8.
As documented, that argv
must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case argc
is 16,383 (plus one more argv
slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the argv
is
actually used as a char
buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.
int mainCRTStartup(void)
{
static char *argv[CMDLINE_ARGV_MAX];
int argc = cmdline_to_argv8(cmdline_fetch(), argv);
return main(argc, argv);
}
Also: Note the FUZZ
option in my source. It has been pretty thoroughly
fuzz tested. It didn’t find anything, but it does make me more
confident in the result.
I also peeked at some language runtimes to see how others handle it. Just
as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft
CRT. Also expected, CPython implicitly does whatever the underlying C
runtime does, so its exact command line behavior depends on which version
of Visual Studio was used to build the Python binary. OpenJDK
pragmatically calls CommandLineToArgvW. Go (gc) does its own
parsing, with behavior mixed between CommandLineToArgvW and some of
Microsoft’s CRTs, but not quite matching either.
Building a command line string
I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an argv
. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.
Naturally, I also wrote my own implementation (same source):
int cmdline_from_argv8(wchar_t *cmdline, int len, char **argv);
Like before, it accepts a WTF-8 argv
, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one can fail when argv
is too large, it returns
zero for an error.
char *argv[] = {"python.exe", "-c", code, 0};
wchar_t cmd[CMDLINE_CMD_MAX];
if (!cmdline_from_argv8(cmd, CMDLINE_CMD_MAX, argv)) {
return "argv too large";
}
if (!CreateProcessW(0, cmd, /*...*/)) {
return "CreateProcessW failed";
}
How do others handle this?
I don’t plan to write a language implementation anytime soon, where this
might be needed, but it’s nice to know I’ve already solved this problem
for myself!