nullprogram.com/blog/2022/03/07/
Wordle is all the rage, resulting in an explosion of hobbyist clones,
with new ones appearing every day. At the current rate I estimate by the
end of 2022 that 99% of all new software releases will be Wordle clones.
That’s no surprise since the rules are simple, it’s more fun to implement
and study than to actually play, and the hard part is building a decent
user interface. Such implementations go back at least 30 years.
Implementers get to decide on a platform, language, and the particular
subject of this article: how to handle the word list. Is it a separate
file/database or embedded in the program? If embedded, is it
worth compressing? In this article I’ll present a simple, tailored Wordle
list compression strategy that beats general purpose compressors.
Last week one particular QuickBASIC clone, WorDOSle, caught my
eye. It embeds its word list despite the dire constraints of its 16-bit
platform. The original Wordle list (1, 2) has 12,972 words which,
naively stored, would consume 77,832 bytes (5 letters, plus newline).
Sadly this exceeds a 16-bit address space. Eliminating the redundant
newline delimiter brings it down to 64,860 bytes — just small enough to
fit in an 8086 segment, but probably still difficult to manage from
QuickBASIC.
The author made a trade-off, reducing the word list to a more manageable,
if meager, 2,318 words, wisely excluding delimiters. Otherwise no further
effort made towards reducing the size. The list is sorted, and the program
cleverly tests words against the list in place using a binary search.
Compaction baseline
Before getting into any real compression technologies, there’s low hanging
fruit to investigate. Words are exactly five, case-insensitive, English
language letters: a–z. To illustrate, here are the first 100 5-letter
words from a short Wordle word list.
abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (log2(26)
) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (26**5
is ~23.5 bits). A 40% savings just by using a
smarter representation.
With 12,972 words, that’s 38,916 bytes for the whole list. Any
compression I apply must at least beat this size in order to be worth
using.
Letter frequency
Not all letters occur at the same frequency. Here’s the letter frequency
for the original Wordle word list:
a:5990 e:6662 i:3759 m:1976 q: 112 u:2511 y:2074
b:1627 f:1115 j: 291 n:2952 r:4158 v: 694 z: 434
c:2028 g:1644 k:1505 o:4438 s:6665 w:1039
d:2453 h:1760 l:3371 p:2019 t:3295 x: 288
When encoding a word, I can save space by spending fewer bits on frequent
letters like e
at the cost of spending more bits on infrequent letters
like q
. There are multiple approaches, but the simplest is Huffman
coding. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.
While my ultimate target is C, I did the frequency analysis, explored the
problem space, and implemented my compressors in Python. I don’t normally
like to use Python, but it is good for one-shot, disposable data
science-y stuff like this. The decompressor will be implemented in C,
partially via meta-programming: Python code generating my C code. Here’s
my letter histogram code:
words = [line[:5] for line in sys.stdin]
hist = collections.defaultdict(int)
for c in itertools.chain(*words):
hist[c] += 1
To build a Huffman coding tree, I’ll need a min-heap (priority queue)
initially filled with nodes representing each letter and its frequency.
While the heap has more than one element, I pop off the two lowest
frequency nodes, create a new parent node with the sum of their
frequencies, and push it into the heap. When the heap has one element, the
remaining element is the root of the Huffman coding tree.
def huffman(hist):
heap = [(n, c) for c, n in hist.items()]
heapq.heapify(heap)
while len(heap) > 1:
a, b = heapq.heappop(heap), heapq.heappop(heap)
node = a[0]+b[0], (a[1], b[1])
heapq.heappush(heap, node)
return heap[0][1]
tree = huffman(hist)
(By the way, I love that heapq
operates directly on a plain list
rather than being its own data structure.) This produces the following
Huffman coding tree (via pprint
):
((('e', 's'),
(('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
((('i', ('p', 'c')),
('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
(('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))
It would be more useful to actually see the encodings.
def flatten(tree, prefix=""):
if isinstance(tree, tuple):
return flatten(tree[0], prefix+"0") + \
flatten(tree[1], prefix+"1")
else:
return [(tree, prefix)]
I used isinstance
to distinguish leaves (str
) from internal nodes
(tuple
). With sorted(flatten(tree))
, I get something like Morse Code:
[('a', '1110'), ('j', '10111110'), ('s', '001'),
('b', '111111'), ('k', '111110'), ('t', '0100'),
('c', '10011'), ('l', '0101'), ('u', '11011'),
('d', '11010'), ('m', '01111'), ('v', '011010'),
('e', '000'), ('n', '11110'), ('w', '011011'),
('f', '101110'), ('o', '1100'), ('x', '101111111'),
('g', '01100'), ('p', '10010'), ('y', '10110'),
('h', '01110'), ('q', '101111110'), ('z', '1011110')]
('i', '1000'), ('r', '1010'),
In terms of encoded bit length, what is the shortest and longest?
codes = dict(flatten(tree))
lengths = [(sum(len(codes[c]) for c in w), w) for w in words]
min(lengths)
is “esses” at 15 bits, and max(lengths)
is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: sum(w[0] for w in lengths)
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.
Speaking of bit packing, I’m ready to compress the entire word list into a
bit stream:
bits = "".join("".join(codes[c] for c in w) for w in words)
Where bits
begins with:
11101110011100001101011101110010110001000111011101...
On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused textwrap
to dice it up, and I also need to reverse each
set of bits before converting to an integer.
u32 = [int(b[::-1], 2) for b in textwrap.wrap(bits, width=32)]
I now have my compressed data as a sequence of 32-bit integers. Next, some
meta-programming:
print(f"static const uint32_t words[{len(u32)}] =", "{", end="")
for i, u in enumerate(u32):
if i%6 == 0:
print("\n ", end="")
print(f"0x{u:08x},", end="")
print("\n};")
That produces a C table, the beginnings of my decompressor. The array
length isn’t necessary since the C compiler can figure it out, but being
explicit allows human readers to know the size at a glance, too. Observe
how the final 32-bit integer isn’t entirely filled.
static const uint32_t words[8812] = {
0x4eeb0e77,0xb8caee23,0xffb892bb,0x397fddf2,0xddfcbfee,0x5ff7997f,
// ...
0x7b4e66bd,0x35ebcccd,0x8f9af60f,0x0000000c,
};
Now, how to go about building the rest of the decompressor? I have a
Huffman coding tree, which is an awful lot like a state machine,
eh? I can even have Python generate a state transition table from the
Huffman tree:
def transitions(tree, states, state):
if isinstance(tree, tuple):
child = len(states)
states[state] = -child
states.extend((None, None))
transitions(tree[0], states, child+0)
transitions(tree[1], states, child+1)
else:
states[state] = ord(tree)
return states
states = transitions(tree, [None], 0)
The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In transitions
, the caller reserves
space in the state table for callees, hence starting with [None]
. I’ll
show the actual table in C form after some more meta-programming:
print(f"static const int8_t states[{len(states)}] =", "{", end="")
for i, s in enumerate(states):
if i%12 == 0:
print("\n ", end="")
print(f"{s:4},", end="")
print("\n};")
I chose int8_t
since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:
static const int8_t states[51] = {
-1, -3, -19, -5, -7, 101, 115, -9, -11, 116, 108, -13,
-17, 103, -15, 118, 119, 104, 109, -21, -39, -23, -27, 105,
-25, 112, 99, 114, -29, 121, -31, 102, -33, 122, -35, 106,
-37, 113, 120, -41, -45, 111, -43, 100, 117, 97, -47, 110,
-49, 107, 98,
};
The first node is -1, meaning if you read a 0 bit then transition to state
1, else state 2 (e.g. immediately following 1). The decompressor reads one
bit at a time, walking the state table until it hits a positive value,
which is an ASCII code. I’ve decided on this function prototype:
int32_t next(char word[5], int32_t n);
The n
is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:
int32_t next(char word[5], int32_t n)
{
for (int i = 0; i < 5; i++) {
int state = 0;
for (; states[state] < 0; n++) {
int b = words[n>>5]>>(n&31) & 1; // next bit
state = b - states[state];
}
word[i] = states[state];
}
return n;
}
When compiled, this is about 80 bytes of instructions, both x86-64 and
ARM64. This, along with the 51 bytes for the state table, should be
counted against the compression size. That’s 35,579 bytes total.
Trying it out, this program indeed reproduces the original word list:
int main(void)
{
int32_t state = 0;
char word[] = ".....\n";
for (int i = 0; i < 12972; i++) {
state = next(word, state);
fwrite(word, 6, 1, stdout);
}
}
Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with b
is at bit offset 15,743. If the
word I’m looking up begins with b
then I can start there and stop at the
first c
, decompressing just 909 words.
Taking it to the next level: run-length encoding
Here’s the 100-word word list sample again. The sorting is deliberate:
abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
If I look at words column-wise, I see a long run of a
, then a long run
of b
, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)
For this second scheme, I’ll pivot the whole list so that I can encode it
in column-order. (This is roughly how one part of bzip2 works, by the
way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’,
etc.”, then I’ll encode these RLE tokens using Huffman coding, per the
first scheme, since there will be lots of repeated tokens.
First, pivot the word list:
pivot = "".join("".join(w[i] for w in words) for i in range(5))
Next compute the RLE token stream. The stream works in pairs, first
indicating a letter (1–26), then the run length.
tokens = []
offset = 0
while offset < len(pivot):
c = pivot[offset]
start = offset
while offset < len(pivot) and pivot[offset] == c:
offset += 1
tokens.append(ord(c) - ord('a') + 1)
tokens.append(offset - start)
I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 —
since I’m going to encode all the tokens using the same Huffman tree.
(Exercise for the reader: Does compression improve with two distinct
Huffman trees, one for letters and the other for runs?) There are no
zero-length runs, and I want there to be as few unique tokens as possible.
tokens
looks like so (e.g. 737 ‘a’, 909 ‘b’, …):
[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]
The original Wordle list results in 139 unique tokens. A few tokens appear
many times, but most of appear only once. Reusing my Huffman coding tree
builder from before:
tree = huffman(collections.Counter(tokens))
This makes for a more complex and interesting tree:
(1,
((((18, 20), (25, (((10, 24), (26, 22)), 8))),
(5,
((11,
((23,
((17,
(((35, (46, 76)), ((82, 93), (104, 111))),
(((165, 168), 27), (28, (((30, 39), 31), 38))))),
((((((40, 41), ((44, 48), 45)),
((53, (54, 56)), 55)),
((((57, 59), 58), ((60, 61), (62, 63))),
((64, (65, 66)), ((67, 70), 68)))),
(((((71, 75), 74), (77, (78, 79))),
(((80, 85), 87), 81)),
((((90, 91), (92, 97)), (96, (99, 100))),
(((101, 103), 102),
((105, 106), (109, 110)))))),
((((((113, 114), 117), ((120, 121), (125, 129))),
(((130, 133), (137, 139)), (138, (140, 142)))),
((((144, 145), (147, 153)), (148, (166, 175))),
(((181, 183), (187, 189)),
((193, 202), (220, 242))))),
(((((262, 303), (325, 376)),
((413, 489), (577, 598))),
(((628, 638), (685, 693)),
((737, 815), (859, 909)))),
((((922, 1565), 29), 32), (34, (33, 43)))))))),
6)),
3))),
((19, 2),
((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))
Peeking at the first 21 elements of sorted(flatten(tree))
, which chops
off the long tail of large-valued, single-occurrence tokens:
[(1, '0'), (8, '100111'), (15, '111010'),
(2, '1101'), (9, '111101'), (16, '1110111'),
(3, '10111'), (10, '10011000'), (17, '1011010100'),
(4, '11100'), (11, '101100'), (18, '10000'),
(5, '1010'), (12, '111110'), (19, '1100'),
(6, '1011011'), (13, '1111110'), (20, '10001'),
(7, '1111111'), (14, '111100'), (21, '1110110')]
Huffman-encoding the RLE stream is more straightforward:
codes = dict(flatten(tree))
bits = "".join(codes[token] for token in tokens)
This time len(bits)
is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!
Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be int16_t
. I copy-pasted the
original meta-programming code and make the appropriate adjustments:
static const int16_t states[277] = {
-1, 1, -3, -5,-257, -7, -21, -9, -11, 18, 20, 25,
-13, -15, 8, -17, -19, 10, 24, 26, 22, 5, -23, -25,
3, 11, -27, -29, 6, 23, -31, -33, -63, 17, -35, -37,
-49, -39, -43, 35, -41, 46, 76, -45, -47, 82, 93, 104,
111, -51, -55, -53, 27, 165, 168, 28, -57, -59, 38, -61,
31, 30, 39, -65,-155, -67,-109, -69, -85, -71, -79, -73,
-75, 40, 41, -77, 45, 44, 48, -81, 55, 53, -83, 54,
56, -87, -99, -89, -93, -91, 58, 57, 59, -95, -97, 60,
61, 62, 63,-101,-105, 64,-103, 65, 66,-107, 68, 67,
70,-111,-129,-113,-123,-115,-119,-117, 74, 71, 75, 77,
-121, 78, 79,-125, 81,-127, 87, 80, 85,-131,-143,-133,
-139,-135,-137, 90, 91, 92, 97, 96,-141, 99, 100,-145,
-149,-147, 102, 101, 103,-151,-153, 105, 106, 109, 110,-157,
-213,-159,-185,-161,-173,-163,-167,-165, 117, 113, 114,-169,
-171, 120, 121, 125, 129,-175,-181,-177,-179, 130, 133, 137,
139, 138,-183, 140, 142,-187,-199,-189,-195,-191,-193, 144,
145, 147, 153, 148,-197, 166, 175,-201,-207,-203,-205, 181,
183, 187, 189,-209,-211, 193, 202, 220, 242,-215,-245,-217,
-231,-219,-225,-221,-223, 262, 303, 325, 376,-227,-229, 413,
489, 577, 598,-233,-239,-235,-237, 628, 638, 685, 693,-241,
-243, 737, 815, 859, 909,-247,-253,-249, 32,-251, 29, 922,
1565, 34,-255, 33, 43,-259,-261, 19, 2,-263,-269, 4,
-265, 15,-267, 21, 16,-271,-273, 14, 9, 12,-275, 13,
7,
};
(Since 277 is prime it will never wrap to a nice rectangle no matter what
width I plug in. Ugh.)
With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a 12972*5
-byte buffer to be
filled:
Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a 12972*3
-byte buffer.
Here’s my decoder, much like before:
void decompress(char *buf)
{
for (int32_t x = 0, y = 0, i = 0; i < 164958;) {
// Decode letter
int state = 0;
for (; states[state] < 0; i++) {
int b = words[i>>5]>>(i&31) & 1;
state = b - states[state];
}
int c = states[state] + 96;
// Decode run-length
state = 0;
for (; states[state] < 0; i++) {
int b = words[i>>5]>>(i&31) & 1;
state = b - states[state];
}
int len = states[state];
// Fill columns
for (int n = 0; n < len; n++, y++) {
buf[y*5+x] = c;
}
if (y == 12972) {
y = 0;
x++;
}
}
}
And my new test exactly reproduces the original list:
int main(void)
{
char buf[12972*5L];
decompress(buf);
char word[] = ".....\n";
for (int i = 0; i < 12972; i++) {
memcpy(word, buf+i*5, 5);
fwrite(word, 6, 1, stdout);
}
}
Totalling it up:
- Compressed data is 20,620 bytes
- State table is 554 bytes
- Decompressor is about 200 bytes
That’s a total of 21,374 bytes. Surprisingly this beats general purpose
compressors!
PROGRAM VERSION SIZE
bzip2 -9 1.0.8 33,752
gzip -9 1.10 30,338
zstd -19 1.4.8 27,098
brotli -9 1.0.9 26,031
xz -9e 5.2.5 16,656
lzip -9 1.22 16,608
Only xz
and lzip
come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.
Update: Johannes Rudolph has pointed out a compression scheme for
a Game Boy Wordle clone last month that gets it down to 17,871 bytes,
and supports iteration. I improved on this scheme to further
reduce it to 16,659 bytes.