nullprogram.com/blog/2025/02/17/
xxd is a versatile hexdump utility with a “reverse” feature, originally
written between 1990–1996. The Vim project soon adopted it, and it’s lived
there ever since. If you have Vim, you also have xxd. Its primary use
cases are (1) the basis for a hex editor due to its -r
reverse option
that can unhexdump its previous output, and (2) a data embedding tool
for C and C++ (-i
). The former provides Vim’s rudimentary hex editor
functionality. The second case is of special interest to w64devkit:
xxd -i
appears in many builds that embed arbitrary data. It’s
important that w64devkit has a compatible implementation, and a freshly
rewritten, improved xxd, rexxd, now replaces the original xxd (as
xxd
).
For those unfamiliar with xxd, examples are in order. Its default hexdump
output looks like this:
$ echo hello world | xxd | tee dump
00000000: 6865 6c6c 6f20 776f 726c 640a hello world.
Octets display in pairs with an ASCII text listing on the right. All
configurable. I can run this in reverse (-r
), recovering the original
input:
$ xxd -r dump
hello world
The tool reads the offset before the colon, the hexadecimal octets, and
ignores the text column. By editing dump
with a text editor, I can
change the raw octets of the original input. From this point of view, the
hexdump is actually a program of two alternating instructions: seek and
write. xxd seeks to the offset, writes the octets, then repeats. It
also doesn’t truncate the output file, so a hexdump can express binary
patches as a seek/write program.
$ echo hello world >hello
$ echo 6: 65766572796f6e650a | xxd -r - hello
$ cat hello
hello everyone
That seeks to offset 0x6
, then writes the 9 octets. The xxd parser is
flexible, and I did not need to follow the default format. It figured out
the format on its own, and rexxd further improves on this. We can use it
to create large files out of thin air, too:
$ echo 3fffffff: 00 | xxd -r - >1G
This command creates an all-zero, 1GiB file, 1G
, by seeking to just
before 1GiB then writing a zero. I used >1G
so that the shell would
truncate the file before starting xxd
— in case it was larger or
contained non-zeros.
This is a “smart seek” of course, and its not literally seeking on every
line. The tool tracks its file position and only seeks when necessary. If
seeking fails, it simulates the seek using a write if possible. When would
it not be possible? Lines need not be in order, of course, and so it may
need to seek backwards. Lines can also overlap in contents. If it weren’t
for buffering — or if rexxd had a unified buffer cache — then by
using the same file for input and output an “xxd program” could write new
instructions for itself and accidentally become Turing-complete.
The other common mode, -i
, looks like this:
$ echo hello world >hello
$ xxd -i hello hello.c
Which produces this hello.c
:
unsigned char hello[] = {
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int hello_len = 12;
Note how it converted the file name into variable names. Characters
disallowed in variable names become underscores _
. When reading from
standard input, xxd only emits the octets. Unless the new-ish -n
name
option is given, in which case that becomes the variable name. This
remains popular because, #embed
notwithstanding, as of this
writing all major toolchains remain stubborn about embedding data on their
own.
The case for replacement
The idea of replacing it began with backporting the -n
name option to
Vim 9.0 xxd. The feature did not appear in a release until a year ago, 28
years after -i
, despite its obviousness. I’ve also felt that xxd
is
slower than it could be, and a momentary examination reveals it’s buggier
than it ought to be. As expected, a few seconds of fuzz testing
xxd -r
reveals bugs, and it doesn’t even require writing a single line
of code:
$ afl-gcc -fsanitize=address,undefined xxd.c
$ mkdir inputs
$ echo >inputs/sample
$ afl-fuzz -i inputs/ -o fuzzout/ ./a.out -r
The Windows port is lacking in the usual ways, unable to handle Unicode
paths. The new Vim 9.1 xxd -R
color feature broke the Windows port, and
if w64devkit included Vim 9.1 then I’d need to patch out the new bugs. As
demonstrated above, at least it’s trivial to compile! It’s a single source
file, xxd.c
, and requires no configuration. I love that.
The more I looked, the more problems I found. It’s not doing anything
terribly complex, so I expected it wouldn’t be difficult to rewrite it
with a better foundation. So I did. Ignoring tests and documentation, my
rewrite is about twice as long. In exchange, it’s substantially faster:
$ dd if=/dev/urandom of=bigfile bs=1M count=64
$ time orig-xxd bigfile dump
real 0m 4.40s
user 0m 2.89s
sys 0m 1.46s
$ time rexxd bigfile dump
real 0m 0.31s
user 0m 0.07s
sys 0m 0.21s
Same in reverse:
$ time orig-xxd -r dump nul
real 0m 5.81s
user 0m 5.67s
sys 0m 0.07s
$ time rexxd -r dump nul
real 0m 0.33s
user 0m 0.23s
sys 0m 0.09s
Or embedding data with rexxd:
$ time orig-xxd -i bigfile bigfile.c
real 0m 10.32s
user 0m 9.85s
sys 0m 0.37s
$ time rexxd -i bigfile bigfile.c
real 0m 0.40s
user 0m 0.07s
sys 0m 0.34s
I wanted to keep it portable and simple, so that’s without fancy SIMD
processing. Just SWAR parsing, branch avoidance,
no division on hot paths, and sound architecture. I also optimized for the
typical case at the cost of the atypical case. It’s a little unfair to
compare it to a program probably first written on a 16-bit machine, but
there was time for it to pick up these techniques over the decades, too.
Unicode support works well:
$ cat π
3.14159265358979323846264338327950288419716939937510582097494
$ rexxd -i π π.c
Producing this source with Unicode variables:
unsigned char π[] = {
0x33, 0x2e, 0x31, 0x34, 0x31, 0x35, 0x39, 0x32, 0x36, 0x35, 0x33, 0x35,
// ...
0x34, 0x0a
};
unsigned int π_len = 62;
Whereas the original xxd on Windows has the usual CRT problems:
$ orig-xxd -i π
orig-xxd: p: No such file or directory
It also struggles with 64-bit offsets, particularly on 32-bit hosts and
LLP64 hosts like Windows. In contrast, I designed rexxd to robustly
process file offsets as 64-bit on all hosts. Its tests operate on a
virtual file system with virtual files at those sizes, so those paths
really have been tested, too.
The original xxd only uses static allocation, which places small range
limits on the configuration:
$ orig-xxd -c 1000
orig-xxd: invalid number of columns (max. 256).
In rexxd everything is arena allocated of course, and options are
limited only by the available memory, so the above, and more, would work.
The arena helps make the SWAR tricks possible, too, providing a fast
runway to load more data at a time.
While reverse engineering the original, I documented bugs I discovered and
noted them with a BUG:
comment if you wanted to see more. I’m not aiming
for bug compatibility, so these are not present in rexxd.
The xxd man page suggests using strace to examine the execution of
-r
reverse. That is, to monitor the seeks and writes of a binary patch
in order to debug it. That’s so insightful that I decided to build that as
a new -x
option (think sh -x
). That is, rexxd has a built-in strace
on all platforms! The trace is expressed in terms of unix system calls,
even on Windows:
$ printf '00:41 \n02:42 \n04:43' | rexxd -x -r - data.bin
open("data.bin", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 19
write(1, "A", 1) = 1
lseek(1, 2, SEEK_SET) = 2
read(0, ..., 4096) = 0
write(1, "B", 1) = 1
lseek(1, 4, SEEK_SET) = 4
write(1, "C", 1) = 1
exit(0) = ?
Is this doing some kind of self-ptrace debugger voodoo? Nope. Like
u-config, it has a platform layer, and it simply logs the platform
layer calls — except for the trace printout itself of course. While the
intention is to debug binary patches, it was also quite insightful in
examining rexxd itself. It helped me spot that rexxd flushed more often
than strictly necessary.
To port rexxd to any system, define Plt
as needed, implement these five
plt_
functions, then call xxd
. The five functions mostly have the
expected unix-like semantics:
typedef struct Plt Plt;
b32 plt_open(Plt *, i32 fd, u8 *path, b32 trunc, Arena *);
i64 plt_seek(Plt *, i32 fd, i64 off, i32 whence);
i32 plt_read(Plt *, u8 *buf, i32 len);
b32 plt_write(Plt *, i32 fd, u8 *buf, i32 len);
void plt_exit(Plt *, i32);
i32 xxd(i32 argc, u8 **argv, Plt *, byte *heap, iz heapsize);
If the platform wants these functions to be “virtual” then it can put
function pointers in the Plt
struct. Otherwise it stores anything it
might need in Plt
. Global variables are never necessary. The application
layer doesn’t use the standard library except (indirectly) memset
and
memcpy
, and it allocates everything it uses from the provided heap
parameter.
plt_open
is a little unusual in that it picks the file descriptor: 0 to
replace standard input, or 1 to replace standard output. All platforms
currently use a virtual file descriptor table, and these do not map onto
the real process file descriptors. But they could! Calls are straced in
the application layer, so they log virtual file descriptors as seen by
rexxd. The arena parameter offers scratch space for the Windows platform
layer to convert paths from narrow to wide for CreateFileW
, so it can
handle long path names with ease.
plt_read
doesn’t accept a file descriptor because there’s only one from
which to read, 0. plt_write
on the other hand allows writing to standard
error, 2.
plt_exit
doesn’t return, of course. In tests it longjmps back
to the top level, as though returning from xxd
with a status. This lets
me skip allocation null pointer checks, with OOM unwinding safely back to
the top level. Since rexxd allocates everything from the arena, it’s all
automatically deallocated, so it’s a clean exit.
On Windows, plt_seek
calls SetFilePointerEx
. I learned the
hard way that the behavior of calling it on a non-file is undefined, not
an error, so at least one GetFileType
call is mandatory. I also learned
that Windows will successfully seek all the way to INT64_MAX
. If the
file system doesn’t support that offset, it’s a write failure later. For
correct operation, rexxd must take care not to overflow its own internal
file position tracking near these offsets with Windows allowing seeks to
operate at the edge until the first flush. Tests run on a virtual file
system thanks to the platform layer, and some tests permit huge seeks and
simulate impossibly enormous files in order to probe behavior at the
extremes.
This is in contrast to Linux, where seeks beyond the underlying file
system’s supported file size is a seek error. For example, on ext4 with
the default configuration:
$ echo ffffffff000: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040320, SEEK_SET) = 17592186040320
read(0, ..., 4096) = 0
write(1, "\0", 1) = -1
exit(3) = ?
We can see the seek succeeded then the write failed because it went one
byte beyond the file system limit. While seeking one byte further will
cause the seek to fail (22 EINVAL
), and rexxd falls back on write until
it fills the storage and runs out of space:
$ echo ffffffff001: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040321, SEEK_SET) = -1
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
...
Mostly for fun, I wrote a libc-free platform layer using raw Linux system
calls, and it maps almost perfectly onto the kernel interface:
struct Plt { int fds[3]; };
b32 plt_open(Plt *plt, i32 fd, u8 *path, b32 trunc, Arena *)
{
i32 mode = fd ? O_CREAT|O_WRONLY : 0;
mode |= trunc ? O_TRUNC : 0;
plt->fds[fd] = (i32)syscall3(SYS_open, (uz)path, mode, 0666);
return plt->fds[fd] >= 0;
}
i64 plt_seek(Plt *plt, i32 fd, i64 off, i32 whence)
{
return syscall3(SYS_lseek, plt->fds[fd], off, whence);
}
i32 plt_read(Plt *plt, u8 *buf, i32 len)
{
return (i32)syscall3(SYS_read, plt->fds[0], (uz)buf, len);
}
b32 plt_write(Plt *plt, i32 fd, u8 *buf, i32 len)
{
return len == syscall3(SYS_write, plt->fds[fd], (uz)buf, len);
}
void plt_exit(Plt *, i32 r)
{
syscall3(SYS_exit, r, 0, 0);
}
On Windows I use the artisanal function prototypes of which I’ve
grown so fond. It’s also my first time using w64devkit’s -lmemory
in a
serious application. I’m using -lchkstk
in the “xxd as a DLL”
platform layer, too, but that one’s just a toy. In that one I use alloca
to allocate an arena, which is a rather novel combination, and the large
stack frame requires a stack probe. Otherwise none of rexxd requires stack
probes.
w64devkit’s new xxd.exe
is delightfully tidy as viewed by peports:
$ du -h xxd.exe
28.0K xxd.exe
$ peports xxd.exe
KERNEL32.dll
0 CreateFileW
0 ExitProcess
0 GetCommandLineW
0 GetFileType
0 GetStdHandle
0 MultiByteToWideChar
0 ReadFile
0 SetFilePointerEx
0 VirtualAlloc
0 WideCharToMultiByte
0 WriteFile
SHELL32.dll
0 CommandLineToArgvW
Other notes
Buffered output and buffered input is custom tailored for rexxd.
When parsing line-oriented input, like -r
, it attempts to parse from of
a view of the input buffer, no copying. The view is the usual string
representation:
typedef struct {
u8 *data;
iz len;
} Str;
Does it fail if the line is longer than the buffer? If it straddles reads,
does that hurt efficiency? The answer to both is “no” due to the spillover
arena. Input
is the buffered input struct, and here’s the interface to
get the next line:
Str nextline(Input *, Arena *);
If the line isn’t entirely contained in the input buffer, the complete
line is concatenated in the arena. So it comfortably handles
huge lines while no-copy optimizing for typical short, non-straddling
lines. With a per-iteration arena, any arena-backed line is automatically
freed at the end of the iteration, so it’s all transparent:
for (;;) {
Arena scratch = perm;
Str line = nextline(b, &scratch);
// ... line may point into an Input or scratch ...
}
If the line doesn’t fit in the arena, it triggers OOM handling. That is,
it calls plt_exit
and something platform-appropriate happens without
returning. Beats the pants off old getline
!
I came up with a maxof
macro that evaluates the maximum of any integral
type, signed or unsigned. It appears in overflow checks and more,
I really like how it turned out. For example:
if (pos > maxof(i64) - off) {
// overflow
}
pos += off;
Or:
i32 trunc32(iz n)
{
return n>maxof(i32) ? maxof(i32) : (i32)n;
}
Now that I have -lmemory
and generally solved string function issues for
myself, I leaned into __builtin_memset
and __builtin_memcpy
for this
project. Despite restrict
, it’s surprisingly difficult to get compilers
to optimize loops into semantically equivalent string function calls. An
explicit built-in solves that. It also produces faster debug builds, which
is what I run while I work. At -O0
, rexxd is about half the speed of a
release build.
Other than -x
, I don’t plan on inventing new features. I’d like to
maintain compatibility with the xxd
found everywhere else, and I don’t
expect adoption beyond w64devkit. Overall the project took about twice as
long as I anticipated — two weekends instead of one — but it turned out
better than I expected and I’m very pleased with the results.