nullprogram.com/blog/2023/10/08/
This article was discussed on Hacker News and on reddit.
This has been a ground-breaking year for my C skills, and paradigm shifts
in my technique has provoked me to reconsider my habits and coding style.
It’s been my largest personal style change in years, so I’ve decided to
take a snapshot of its current state and my reasoning. These changes have
produced significant productive and organizational benefits, so while most
is certainly subjective, it likely includes a few objective improvements.
I’m not saying everyone should write C this way, and when I contribute
code to a project I follow their local style. This is about what works
well for me.
Primitive types
Starting with the fundamentals, I’ve been using short names for primitive
types. The resulting clarity was more than I had expected, and it’s made
my code more enjoyable to review. These names appear frequently throughout
a program, so conciseness pays. Also, now that I’ve gone without, _t
suffixes are more visually distracting than I had realized.
typedef uint8_t u8;
typedef char16_t c16;
typedef int32_t b32;
typedef int32_t i32;
typedef uint32_t u32;
typedef uint64_t u64;
typedef float f32;
typedef double f64;
typedef uintptr_t uptr;
typedef char byte;
typedef ptrdiff_t size;
typedef size_t usize;
Some people prefer an s
prefix for signed types. I prefer i
, plus as
you’ll see, I have other designs for s
. For sizes, isize
would be more
consistent, and wouldn’t hog the identifier, but signed sizes are the
way and so I want them in a place of privilege. usize
is niche,
mainly for interacting with external interfaces where it might matter.
b32
is a “32-bit boolean” and communicates intent. I could use _Bool
,
but I’d rather stick to a natural word size and stay away from its weird
semantics. To beginners it might seem like “wasting memory” by using a
32-bit boolean, but in practice that’s never the case. It’s either in a
register (return value, local variable) or would be padded anyway (struct
field). When it actually matters, I pack booleans into a flags
variable,
and a 1-byte boolean rarely important.
While UTF-16 might seem niche, it’s a necessary evil when dealing with
Win32, so c16
(“16-bit character”) has made a frequent appearance. I
could have based it on uint16_t
, but putting the name char16_t
in its
“type hierarchy” communicates to debuggers, particularly GDB, that for
display purposes these variables hold character data. Officially Win32
uses a type named wchar_t
, but I like being explicit about UTF-16.
u8
is for octets, usually UTF-8 data. It’s distinct from byte
, which
represents raw memory and is a special aliasing type. In theory these
can be distinct types with differing semantics, though I’m not aware of
any implementation that does so (yet?). For now it’s about intent.
What about systems that don’t support fixed width types? That’s academic,
and far too much time has been wasted worrying about it. That includes
time wasted on typing out int_fast32_t
and similar nonsense. Virtually
no existing software would actually work correctly on such systems — I’m
certain nobody’s testing it after all — so it seems nobody else cares
either.
I don’t intend to use these names in isolation, such as in code snippets
(outside of this article). If I did, examples would require the typedefs
to give readers the complete context. That’s not worth extra explanation.
Even in the most recent articles I’ve used ptrdiff_t
instead of size
.
Macros
Next, some “standard” macros:
#define countof(a) (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s) (countof(s) - 1)
#define new(a, t, n) (t *)alloc(a, sizeof(t), _Alignof(t), n)
While I still prefer ALL_CAPS
for constants, I’ve adopted lowercase for
function-like macros because it’s nicer to read. They don’t have the same
namespace problems as other macro definitions: I can have a macro named
new()
and also variables and fields named new
because they don’t look
like function calls.
For GCC and Clang, my favorite assert
macro now looks like this:
#define assert(c) while (!(c)) __builtin_unreachable()
It has useful properties beyond the usual benefits:
-
It does not require separate definitions for debug and release builds.
Instead it’s controlled by the presence of Undefined Behavior Sanitizer
(UBSan), which is already present/absent in these circumstances. That
includes fuzz testing.
-
libubsan
provides a diagnostic printout with a file and line number.
-
In release builds it turns into a practical optimization hint.
To enable assertions in release builds, put UBSan in trap mode with
-fsanitize-trap
and then enable at least -fsanitize=unreachable
. In
theory this can also be done with -funreachable-traps
, but as of this
writing it’s been broken for the past few GCC releases.
Parameters and functions
No const
. It serves no practical role in optimization, and I cannot
recall an instance where it caught, or would have caught, a mistake. I
held out for awhile as prototype documentation, but on reflection I found
that good parameter names were sufficient. Dropping const
has made me
noticeably more productive by reducing cognitive load and eliminating
visual clutter. I now believe its inclusion in C was a costly mistake.
(One small exception: I still like it as a hint to place static tables in
read-only memory closer to the code. I’ll cast away the const
if needed.
This is only of minor importance.)
Literal 0
for null pointers. Short and sweet. This is not new, but a
style I’ve used for about 7 years now, and has appeared all over my
writing since. There are some theoretical edge cases where it may cause
defects, and lots of ink has been spilled on the subject, but
after a couple 100K lines of code I’ve yet to see it happen.
restrict
when necessary, but better to organize code so that it’s not,
e.g. don’t write to “out” parameters in loops, or don’t use out parameters
at all (more on that momentarily). I don’t bother with inline
because I
compile everything as one translation unit anyway.
typedef
all structures. I used to shy away from it, but eliminating the
struct
keyword makes code easier to read. If it’s a recursive structure,
use a forward declaration immediately above so that such fields can use
the short name:
typedef struct map map;
struct map {
map *child[4];
// ...
};
Declare all functions static
except for entry points. Again, with
everything compiled as a single translation unit there’s no reason to do
otherwise. It was probably a mistake for C not to default to static
,
though I don’t have a strong opinion on the matter. With the clutter
eliminated through short types, no const
, no struct
, etc. functions
fit comfortably on the same line as their return type. I used to break
them apart so that the function name began on its own line, but that’s no
longer necessary.
In my writing I sometimes omit static
to simplify, and because outside
the context of a complete program it’s mostly irrelevant. However, I will
use it below to emphasize this style.
For awhile I capitalized type names as that effectively put them in a kind
of namespace apart from variables and functions, but I eventually stopped.
I may try this idea in different way in the future.
Strings
One of my most productive changes this year has been the total rejection
of null terminated strings — another of those terrible mistakes — and the
embrace of this basic string type:
#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
u8 *data;
size len;
} s8;
I’ve used a few names for it, but this is my favorite. The s
is for
string, and the 8
is for UTF-8 or u8
. The s8
macro (sometimes just
spelled S
) wraps a C string literal, making a s8
string out of it. A
s8
is handled like a fat pointer, passed and returned by copy.
s8
makes for a great function prefix, unlike str
, all of which are
reserved. Some examples:
static s8 s8span(u8 *, u8 *);
static b32 s8equals(s8, s8);
static size s8compare(s8, s8);
static u64 s8hash(s8);
static s8 s8trim(s8);
static s8 s8clone(s8, arena *);
Then when combined with the macro:
if (s8equals(tagname, s8("body"))) {
// ...
}
You might be tempted to use a flexible array member to pack the size and
array together as one allocation. Tried it. Its inflexibility is totally
not worth whatever benefits it might have. Consider, for instance, how
you’d create such a string out of a literal, and how it would be used.
A few times I’ve thought, “This program is simple enough that I don’t need
a string type for this data.” That thought is nearly always wrong. Having
it available helps me think more clearly, and makes for simpler programs.
(C++ got it only a few years ago with std::string_view
and std::span
.)
It has a natural UTF-16 counterpart, s16
:
#define s16(s) (s16){u##s, lengthof(u##s)}
typedef struct {
c16 *data;
size len;
} s16;
I’m not entirely sold on gluing u
to the literal in the macro, versus
writing it out on the string literal.
More structures
Another change has been preferring structure returns instead of out
parameters. It’s effectively a multiple value return, though without
destructuring. A great organizational change. For example, this function
returns two values, a parse result and a status:
typedef struct {
i32 value;
b32 ok;
} i32parsed;
static i32parsed i32parse(s8);
Worried about the “extra copying?” Have no fear, because in practice
calling conventions turn this into a hidden, restrict
-qualified out
parameter — if it’s not inlined such that any return value overhead would
be irrelevant anyway. With this return style I’m less tempted to use
in-band signals like special null returns to indicate errors, which is
less clear.
It’s also led to a style of defining a zero-initialized return value at
the top of the function, i.e. ok
is false, and then use it for all
return
statements. On error, it can bail out with an immediate return.
The success path sets ok
to true before the return.
static i32parsed i32parse(s8 s)
{
i32parsed r = {0};
for (size i = 0; i < s.len; i++) {
u8 digit = s.data[i] - '0';
// ...
if (overflow) {
return r;
}
r.value = r.value*10 + digit;
}
r.ok = 1;
return r;
}
Aside from static data, I’ve also moved away from initializers except the
conventional zero initializer. (Notable exception: s8
and s16
macros.)
This includes designated initializers. Instead I’ve been initializing with
assignments. For example, this buffered output “constructor”:
typedef struct {
u8 *buf;
i32 len;
i32 cap;
i32 fd;
b32 err;
} u8buf;
static u8buf newu8buf(arena *perm, i32 cap, i32 fd)
{
u8buf r = {0};
r.buf = new(perm, u8, cap);
r.cap = cap;
r.fd = fd;
return r;
}
I like how this reads, but it also eliminates a cognitive burden: The
assignments are separated by sequence points, giving them an explicit
order. It doesn’t matter here, but in other cases it does:
example e = {
.name = randname(&rng),
.age = randage(&rng),
.seat = randseat(&rng),
};
There are 6 possible values for e
from the same seed. I like no longer
thinking about these possibilities.
Odds and ends
Prefer __attribute
to __attribute__
. The __
suffix is excessive and
unnecessary.
__attribute((malloc, warn_unused_result))
For Win32 systems programming, which typically only requires a modest
number of declarations and definitions, rather than include windows.h
,
write the prototypes out by hand using custom types. It reduces
build times, declutters namespaces, and interfaces more cleanly with the
program (no more DWORD
/BOOL
/ULONG_PTR
, but u32
/b32
/uptr
).
#define W32(r) __declspec(dllimport) r __stdcall
W32(void) ExitProcess(u32);
W32(i32) GetStdHandle(u32);
W32(byte *) VirtualAlloc(byte *, usize, u32, u32);
W32(b32) WriteConsoleA(uptr, u8 *, u32, u32 *, void *);
W32(b32) WriteConsoleW(uptr, c16 *, u32, u32 *, void *);
For inline assembly, treat the outer parentheses like braces, put a space
before the opening parenthesis, just like if
, and start each constraint
line with its colon.
static u64 rdtscp(void)
{
u32 hi, lo;
asm volatile (
"rdtscp"
: "=d"(hi), "=a"(lo)
:
: "cx", "memory"
);
return (u64)hi<<32 | lo;
}
There’s surely a lot more to my style than this, but unlike the above,
those details haven’t changed this year. To see most of the mentioned
items in action in a small program, see wordhist.c
, one of my
testing grounds for hash-tries, or for a slightly larger program,
asmint.c
, a mini programming language implementation.