You Can't Always Hash Pointers in C

May 30, 2016

nullprogram.com/blog/2016/05/30/

Occasionally I’ve needed to key a hash table with C pointers. I don’t care about the contents of the object itself — especially if it might change — just its pointer identity. For example, suppose I’m using null-terminated strings as keys and I know these strings will always be interned in a common table. These strings can be compared directly by their pointer values (str_a == str_b) rather than, more slowly, by their contents (strcmp(str_a, str_b) == 0). The intern table ensures that these expressions both have the same result.

As a key in a hash table, or other efficient map/dictionary data structure, I’ll need to turn pointers into numerical values. However, C pointers aren’t integers. Following certain rules it’s permitted to cast pointers to integers and back, but doing so will reduce the program’s portability. The most important consideration is that the integer form isn’t guaranteed to have any meaningful or stable value. In other words, even in a conforming implementation, the same pointer might cast to two different integer values. This would break any algorithm that isn’t keenly aware of the implementation details.

To show why this is, I’m going to be citing the relevant parts of the C99 standard (ISO/IEC 9899:1999). The draft for C99 is freely available (and what I use myself since I’m a cheapass). My purpose is not to discourage you from casting pointers to integers and using the result. The vast majority of the time this works fine and as you would expect. I just think it’s an interesting part of the language, and C/C++ programmers should be aware of potential the trade-offs.

Integer to pointer casts

What does the standard have to say about casting pointers to integers? §6.3.2.3¶5:

An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

It also includes a footnote:

The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

Casting an integer to a pointer depends entirely on the implementation. This is intended for things like memory mapped hardware. The programmer may need to access memory as a specific physical address, which would be encoded in the source as an integer constant and cast to a pointer of the appropriate type.

int
read_sensor_voltage(void)
{
    return *(int *)0x1ffc;
}

It may also be used by a loader and dynamic linker to compute the virtual address of various functions and variables, then cast to a pointer before use.

Both cases are already dependent on implementation defined behavior, so there’s nothing lost in relying on these casts.

An integer constant expression of 0 is a special case. It casts to a NULL pointer in all implementations (§6.3.2.3¶3). However, a NULL pointer doesn’t necessarily point to address zero, nor is it necessarily a zero bit pattern (i.e. beware memset and calloc on memory with pointers). It’s just guaranteed never to compare equally with a valid object, and it is undefined behavior to dereference.

Pointer to integer casts

What about the other way around? §6.3.2.3¶6:

Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

Like before, it’s implementation defined. However, the negatives are a little stronger: the cast itself may be undefined behavior. I speculate this is tied to integer overflow. The last part makes pointer to integer casts optional for an implementation. This is one way that the hash table above would be less portable.

When the cast is always possible, an implementation can provide an integer type wide enough to hold any pointer value. §7.18.1.4¶1:

The following type designates a signed integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:

intptr_t

The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:

uintptr_t

These types are optional.

The take-away is that the integer has no meaningful value. The only guarantee is that the integer can be cast back into a void pointer that will compare equally. It would be perfectly legal for an implementation to pass these assertions (and still sometimes fail).

void
example(void *ptr_a, void *ptr_b)
{
    if (ptr_a == ptr_b) {
        uintptr_t int_a = (uintptr_t)ptr_a;
        uintptr_t int_b = (uintptr_t)ptr_b;
        assert(int_a != int_b);
        assert((void *)int_a == (void *)int_b);
    }
}

Since the bits don’t have any particular meaning, arithmetic operations involving them will also have no meaning. When a pointer might map to two different integers, the hash values might not match up, breaking hash tables that rely on them. Even with uintptr_t provided, casting pointers to integers isn’t useful without also relying on implementation defined properties of the result.

Reasons for this pointer insanity

What purpose could such strange pointer-to-integer casts serve?

A security-conscious implementation may choose to annotate pointers with additional information by setting unused bits. It might be for baggy bounds checks or, someday, in an undefined behavior sanitizer. Before dereferencing annotated pointers, the metadata bits would be checked for validity, and cleared/set before use as an address. Or it may map the same object at multiple virtual addresses) to avoid setting/clearing the metadata bits, providing interoperability with code unaware of the annotations. When pointers are compared, these bits would be ignored.

When these annotated pointers are cast to integers, the metadata bits will be present, but a program using the integer wouldn’t know their meaning without tying itself closely to that implementation. Completely unused bits may even be filled with random garbage when cast. It’s allowed.

You may have been thinking before about using a union or char * to bypass the cast and access the raw pointer bytes, but you’d run into the same problems on the same implementations.

Conforming programs

The standard makes a distinction between strictly conforming programs (§4¶5) and conforming programs (§4¶7). A strictly conforming program must not produce output depending on implementation defined behavior nor exceed minimum implementation limits. Very few programs fit in this category, including any program using uintptr_t since it’s optional. Here are more examples of code that isn’t strictly conforming:

    printf("%zu", sizeof(int)); // §6.5.3.4
    printf("%d", -1 >> 1);      // §6.5¶4
    printf("%d", MAX_INT);      // §5.2.4.2.1

On the other hand, a conforming program is allowed to depend on implementation defined behavior. Relying on meaningful, stable values for pointers cast to uintptr_t/intptr_t is conforming even if your program may exhibit bugs on some implementations.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.