nullprogram.com/blog/2016/10/07/
I’ve worked on many programs that frequently require small,
short-lived buffers for use as a temporary workspace, perhaps to
construct a string or array. In C this is often accomplished with
arrays of automatic storage duration (i.e. allocated on the
stack). This is dirt cheap — much cheaper than a heap allocation —
and, unlike a typical general-purpose allocator, involves no thread
contention. However, the catch that there may be no hard bound to the
buffer. For correctness, the scratch space must scale appropriately to
match its input. Whatever arbitrary buffer size I pick may be too small.
A widespread extension to C is the alloca() pseudo-function. It’s like
malloc(), but allocates memory on the stack, just like an automatic
variable. The allocation is automatically freed when the function (not
its scope!) exits, even with a longjmp() or other non-local exit.
void *alloca(size_t size);
Besides its portability issues, the most dangerous property is the
complete lack of error detection. If size
is too large, the
program simply crashes, or worse.
For example, suppose I have an intern() function that finds or creates
the canonical representation/storage for a particular string. My
program needs to intern a string composed of multiple values, and will
construct a temporary string to do so.
const char *intern(const char *);
const char *
intern_identifier(const char *prefix, long id)
{
size_t size = strlen(prefix) + 32;
char *buffer = alloca(size);
sprintf(buffer, "%s%ld", prefix, id);
return intern(buffer);
}
I expect the vast majority of these prefix
strings to be very small,
perhaps on the order of 10 to 80 bytes, and this function will handle
them extremely efficiently. But should this function get passed a huge
prefix
, perhaps by a malicious actor, the program will misbehave
without warning.
A portable alternative to alloca() is variable-length arrays (VLA),
introduced in C99. Arrays with automatic storage duration need not
have a fixed, compile-time size. It’s just like alloca(), having
exactly the same dangers, but at least it’s properly scoped. It
was rejected for inclusion in C++11 due to this danger.
const char *
intern_identifier(const char *prefix, long id)
{
char buffer[strlen(prefix) + 32];
sprintf(buffer, "%s%ld", prefix, id);
return intern(buffer);
}
There’s a middle-ground to this, using neither VLAs nor alloca().
Suppose the function always allocates a small, fixed size buffer —
essentially a free operation — but only uses this buffer if it’s large
enough for the job. If it’s not, a normal heap allocation is made with
malloc().
const char *
intern_identifier(const char *prefix, long id)
{
char temp[256];
char *buffer = temp;
size_t size = strlen(prefix) + 32;
if (size > sizeof(temp))
if (!(buffer = malloc(size)))
return NULL;
sprintf(buffer, "%s%ld", prefix, id);
const char *result = intern(buffer);
if (buffer != temp)
free(buffer);
return result;
}
Since the function can now detect allocation errors, this version has
an error condition. Though, intern() itself would presumably return
NULL for its own allocation errors, so this is probably transparent to
the caller.
We’ve now entered the realm of small-size optimization. The vast
majority of cases are small and will therefore be very fast, but we
haven’t given up on the odd large case either. In fact, it’s been made
a little bit worse (via the unnecessary small allocation), selling
it out to make the common case fast. That’s sound engineering.
Visual Studio has a pair of functions that nearly automate this
solution: _malloca() and _freea(). It’s like alloca(), but
allocations beyond a certain threshold go on the heap. This allocation
is freed with _freea(), which does nothing in the case of a stack
allocation.
void *_malloca(size_t);
void _freea(void *);
I said “nearly” because Microsoft screwed it up: instead of returning
NULL on failure, it generates a stack overflow structured exception
(for a heap allocation failure).
I haven’t tried it yet, but I bet something similar to malloca() /
freea() could be implemented using a couple of macros.
Toward Structured Small-Size Optimization
CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the
talks. I don’t like developing in C++, but I always learn new,
interesting concepts from this conference, many of which apply
directly to C. I look forward to Chandler Carruth’s talks the most,
having learned so much from his past talks. I recommend these all:
After writing this article, I saw Nicholas Ormrod’s talk, The strange
details of std::string at Facebook, which is also highly
relevant.
Chandler’s talk this year was the one on hybrid data structures. I’d
already been mulling over small-size optimization for months, and the
first 5–10 minutes of his talk showed me I was on the right track. In
his talk he describes LLVM’s SmallVector class (among others), which
is basically a small-size-optimized version of std::vector, which, due
to constraints on iterators under std::move() semantics, can’t itself
be small-size optimized.
I picked up a new trick from this talk, which I’ll explain in C’s
terms. Suppose I have a dynamically growing buffer “vector” of long
values. I can keep pushing values into the buffer, doubling the
storage in size each time it fills. I’ll call this one “simple.”
struct vec_simple {
size_t size;
size_t count;
long *values;
};
Initialization is obvious. Though for easy overflow checks, and for
another reason I’ll explain later, I’m going to require the starting
size, hint
, to be a power of two. It returns 1 on success and 0 on
error.
int
vec_simple_init(struct vec_simple *v, size_t hint)
{
assert(hint && (hint & (hint - 1)) == 0); // power of 2
v->size = hint;
v->count = 0;
v->values = malloc(sizeof(v->values[0]) * v->size);
return !!v->values;
}
Pushing is straightforward, using realloc() when the buffer fills,
returning 0 for integer overflow or allocation failure.
int
vec_simple_push(struct vec_simple *v, long x)
{
if (v->count == v->size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = v->size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; // overflow
void *new_values = realloc(v->values, new_size * value_size);
if (!new_values)
return 0; // out of memory
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
And finally, cleaning up. I hadn’t thought about this before, but if
the compiler manages to inline vec_simple_free(), that NULL pointer
assignment will probably get optimized out, possibly even in the face
of a use-after-free bug.
void
vec_simple_free(struct vec_simple *v)
{
free(v->values);
v->values = 0; // trap use-after-free bugs
}
And finally an example of its use (without checking for errors).
long
example(long (*f)(void *), void *arg)
{
struct vec_simple v;
vec_simple_init(&v, 16);
long n;
while ((n = f(arg)) > 0)
vec_simple_push(&v, n);
// ... process vector ...
vec_simple_free(&v);
return result;
}
If the common case is only a handful of long
values, and this
function is called frequently, we’re doing a lot of heap allocation
that could be avoided. Wouldn’t it be nice to put all that on the
stack?
Applying Small-Size Optimization
Modify the struct to add this temp
field. It’s probably obvious what
I’m getting at here. This is essentially the technique in SmallVector.
struct vec_small {
size_t size;
size_t count;
long *values;
long temp[16];
};
The values
field is initially pointed at the small buffer. Notice
that unlike the “simple” vector above, this initialization function
cannot fail. It’s one less thing for the caller to check. It also
doesn’t take a hint
since the buffer size is fixed.
void
vec_small_init(struct vec_small *v)
{
v->size = sizeof(v->temp) / sizeof(v->temp[0]);
v->count = 0;
v->values = v->temp;
}
Pushing gets a little more complicated. If it’s the first time the
buffer has grown, the realloc() has to be done “manually” with
malloc() and memcpy().
int
vec_small_push(struct vec_small *v, long x)
{
if (v->count == v->size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = v->size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; // overflow
void *new_values;
if (v->temp == v->values) {
/* First time heap allocation. */
new_values = malloc(new_size * value_size);
if (new_values)
memcpy(new_values, v->temp, sizeof(v->temp));
} else {
new_values = realloc(v->values, new_size * value_size);
}
if (!new_values)
return 0; // out of memory
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
Finally, only call free() if the buffer was actually allocated on the
heap.
void
vec_small_free(struct vec_small *v)
{
if (v->values != v->temp)
free(v->values);
v->values = 0;
}
If 99% of these vectors never exceed 16 elements, then 99% of the time
the heap isn’t touched. That’s much better than before. The 1% case
is still covered, too, at what is probably an insignificant cost.
An important difference to SmallVector is that they parameterize the
small buffer’s size through the template. In C we’re stuck with fixed
sizes or macro hacks. Or are we?
Using a Caller-Provided Buffer
This time remove the temporary buffer, making it look like the simple
vector from before.
struct vec_flex {
size_t size;
size_t count;
long *values;
};
The user will provide the initial buffer, which will presumably be an
adjacent, stack-allocated array, but whose size is under the user’s
control.
void
vec_flex_init(struct vec_flex *v, long *init, size_t nmemb)
{
assert(nmemb > 1); // we need that low bit!
assert(nmemb && (nmemb & (nmemb - 1)) == 0); // power of 2
v->size = nmemb | 1;
v->count = 0;
v->values = init;
}
The power of two size, greater than one, means the size will always be
an even number. Why is this important? There’s one piece of
information missing from the struct: Is the buffer currently heap
allocated or not? That’s just one bit of information, but adding just
one more bit to the struct will typically pad it out another 31 or 63
more bits. What a waste! Since I’m not using the lowest bit of the
size (always being an even number), I can smuggle it in there. Hence
the nmemb | 1
, the 1 indicating that it’s not heap allocated.
When pushing, the actual_size
is extracted by clearing the bottom
bit (size & ~1
) and the indicator bit is extracted with a 1 bit mask
(size & 1
). The bit is cleared by virtue of not intentionally
setting it again.
int
vec_flex_push(struct vec_flex *v, long x)
{
size_t actual_size = v->size & ~(size_t)1; // clear bottom bit
if (v->count == actual_size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = actual_size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; /* overflow */
void *new_values;
if (v->size & 1) {
/* First time heap allocation. */
new_values = malloc(new_size * value_size);
if (new_values)
memcpy(new_values, v->values, actual_size * value_size);
} else {
new_values = realloc(v->values, new_size * value_size);
}
if (!new_values)
return 0; /* out of memory */
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
Only free() when it’s been allocated, like before.
void
vec_flex_free(struct vec_flex *v)
{
if (!(v->size & 1))
free(v->values);
v->values = 0;
}
And here’s what it looks like in action.
long
example(long (*f)(void *), void *arg)
{
struct vec_flex v;
long buffer[16];
vec_flex_init(&v, buffer, sizeof(buffer) / sizeof(buffer[0]));
long n;
while ((n = f(arg)) > 0)
vec_flex_push(&v, n);
// ... process vector ...
vec_flex_free(&v);
return result;
}
If you were to log all vector sizes as part of profiling, and the
assumption about their typical small number of elements was correct,
you could easily tune the array size in each case to remove the vast
majority of vector heap allocations.
Now that I’ve learned this optimization trick, I’ll be looking out for
good places to apply it. It’s also a good reason for me to stop
abusing VLAs.