Articles tagged compression at null program

QOI is now my favorite asset format

2022-12-18T03:45:44Z

This article was discussed on Hacker News.

The Quite OK Image (QOI) format was announced late last year and finalized into a specification a month later. Initially dismissive, a revisit has shifted my opinion to impressed. The format hits a sweet spot in the trade-off space between complexity, speed, and compression ratio. Also considering its alpha channel support, QOI has become my default choice for embedded image assets. It’s not perfect, but at the very least it’s a solid foundation.

Since I’m now working with QOI images, I need a good QOI viewer, and so I added support to my ill-named pbmview tool, which I wrote to serve the same purpose for Netpbm. I will continue to use Netpbm as an output format, especially for raw video output, but no longer will I use it for an embedded asset (nor re-invent yet another RLE over Netpbm).

I was dismissive because the website claimed, and still claims today, QOI images are “a similar size” to PNG. However, for the typical images where I would use PNG, QOI is around 3x larger, and some outliers are far worse. The 745 PNGs on my blog — a perfect test corpus for my own needs — convert to QOIs 2.8x larger on average. The official QOI benchmark has much better results, 1.3x larger, but that’s because it includes a lot of photography where PNG and QOI both do poorly, making QOI seem more comparable.

However, as I said, QOI’s strength is its trade-off sweet spot. The specification is one page, and an experienced developer can write a complete implementation from scratch in a single sitting. My own implementation is about 100 lines of libc-free C for each of the encoder and decoder. With error checking removed, my decoder is ~600 bytes of x86 object code — a great story for embedding alongside assets. It’s more complex than Netpbm or farbfeld, but it’s far simpler than BMP. I’ve already begun experimenting with converting assets to QOI, and the results have so far exceeded my expectations.

To my surprise, the encoder was easier to write than the decoder. The format is so straightforward such that two different encoders will produce the identical files. There’s little room for specialized optimization, and no meaningful “compression level” knob.

Criticism

There are a lot of dimensions on which QOI could be improved, but most cases involve trade-offs, e.g. more complexity for better compression. The areas where QOI could have been strictly better, the dimensions on which it is not on the Pareto frontier, are more meaningful criticisms — missed opportunities. My criticisms of this kind:

Big endian fields are an odd choice for a 2020s file format. Little endian dominates the industry, and it would have made for a slightly smaller decoder footprint on typical machines today if QOI used little endian.
The header has two flags and spends an entire byte on each. It should have instead had a flag byte, with two bits assigned to these flags. One flag indicates if the alpha channel is important, and the other selects between two color spaces (sRGB, linear). Both flags are only advisory.
The 4-channel encoded pixel format is ABGR (or RGBA), placing the alpha channel next to the blue channel. This is somewhat unconventional. A decoder is likely to use a single load into 32-bit integer, and ideally it’s already in the desired format or close to it. A few times already I’ve had to shuffle the RGB bytes within the 32-bit sample to be compatible with some other format. QOI channel ordering is arbitrary, and I would have chosen ARGB (when viewed as little endian).
The QOI hash function operates on channels individually, with individual overflow, making it slower and larger than necessary. The hash function should have been over a packed 32-bit sample. I would have used a multiplication by a carefully-chosen 32-bit integer, then a right shift using the highest 6 bits of the result for the index.

More subjective criticisms that might count as having trade-offs:

Given a “flag byte” (mentioned above) it would have been free to assign another flag bit indicating pre-multiplied alpha, also still advisory. You want to use pre-multiplied alpha for your assets, and the option store them this way would help.
There’s an 8-byte end-of-stream marker — a bit excessive — deliberately an invalid encoding so that reads past the end of the image will result in a decoding error. I probably would have chosen a dead simple 32-bit checksum of packed 32-bit images samples, even if literally a sum.

Of course, you’re not obligated to follow QOI exactly to spec for your own assets, so you could always use a modified QOI with one or more of these tweaks. That’s what I meant about it being a solid foundation: You don’t have to start from scratch with some custom RLE. Since the format is so simple, you can easily build your own tools — as I’ve already begun doing myself — so you don’t need to rely on tools supporting your QOI fork.

Minimalist API

I’m really happy with my QOI implementation, particularly since it’s another example of a minimalist C API: no allocating, no input or output, and no standard library use. As usual, the expectation is that it’s in the same translation unit where it’s used, so it’s likely inlined into callers.

The encoder is streaming — it accepts and returns only a little bit of input and output at a time. It has three functions and one struct with no “public” fields:

struct qoiencoder qoiencoder(void *buf, int w, int h, const char *flags);
int qoiencode(struct qoiencoder *, void *buf, unsigned color);
int qoifinish(struct qoiencoder *, void *buf);

The first function initializes an encoder and writes a fixed-length header into the QOI buffer. The flags field is a mode string, like fopen. I would normally use bit flags, but this is a little experiment. The second function encodes a single pixel into the QOI buffer, returning the number of bytes written (possibly zero). The last flushes any encoding state and writes the end-of-stream marker. There are no errors. My typical use so far looks like:

char buf[16];
struct qoiencoder q = qoiencoder(buf, width, height, "a");
fwrite(buf, QOIHDRLEN, 1, file);
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        // ... compute 32-bit ABGR sample at (x, y) ...
        fwrite(buf, qoiencode(&q, buf, abgr), 1, file);
    }
}
fwrite(buf, qoifinish(&q, buf), 1, file);
fflush(file);
return ferror(file);

This appends encoder outputs to a buffered stream, but it could just as well accumulate directly into a larger buffer, advancing the write pointer a little after each call.

The decoder is two functions, but its struct has some “public” fields.

struct qoidecoder {
    int width, height;
    _Bool alpha, srgb, error;
    // ...
};
struct qoidecoder qoidecoder(const void *buf, int len);
static unsigned qoidecode(struct qoidecoder *);

The input is not streamed and the entire buffer must be loaded into memory at once — not too bad since it’s compressed, and perhaps even already loaded as part of the executable image — but the output is streamed, delivering one packed 32-bit ABGR sample per call. The decoder makes no assumptions about the output format, and the caller unpacks samples and stores them in whatever format is appropriate (shader texture, etc.).

To make it easier to use, my decoder range checks to guarantee that width and height can be multiplied without overflow. Unlike encoding, there may be errors due to invalid input, including that failed range check. The decoder error flag is “sticky” and the decoder returns zero samples when in an error state, so callers can wait to check for errors until the end. (Though if you’re only decoding embedded assets, then there are no practical errors, and checks can be removed/ignored.)

Example usage, copied almost verbatim from a real program:

int loadimage(Image *image, const uint8_t *qoi, int len)
{
    struct qoidecoder q = qoidecoder(qoi, len);
    if (/* image dimensions too large */) {
        return 0;
    }
    image->width  = q.width;
    image->height = q.height;
    int count = q.width * q.height;
    for (int i = 0; i < count; i++) {
        unsigned abgr = qoidecode(&q);
        image->data[4*i+0] = abgr >> 16;
        image->data[4*i+1] = abgr >>  8;
        image->data[4*i+2] = abgr >>  0;
        image->data[4*i+3] = abgr >> 24;
    }
    return !q.error;
}

Note the aforementioned awkward RGB shuffle.

It’s safe to say that I’m excited about QOI, and that it now has a permanent slot on my developer toolbelt.

Compressing and embedding a Wordle word list

2022-03-07T03:22:41Z

Wordle is all the rage, resulting in an explosion of hobbyist clones, with new ones appearing every day. At the current rate I estimate by the end of 2022 that 99% of all new software releases will be Wordle clones. That’s no surprise since the rules are simple, it’s more fun to implement and study than to actually play, and the hard part is building a decent user interface. Such implementations go back at least 30 years. Implementers get to decide on a platform, language, and the particular subject of this article: how to handle the word list. Is it a separate file/database or embedded in the program? If embedded, is it worth compressing? In this article I’ll present a simple, tailored Wordle list compression strategy that beats general purpose compressors.

Last week one particular QuickBASIC clone, WorDOSle, caught my eye. It embeds its word list despite the dire constraints of its 16-bit platform. The original Wordle list (1, 2) has 12,972 words which, naively stored, would consume 77,832 bytes (5 letters, plus newline). Sadly this exceeds a 16-bit address space. Eliminating the redundant newline delimiter brings it down to 64,860 bytes — just small enough to fit in an 8086 segment, but probably still difficult to manage from QuickBASIC.

The author made a trade-off, reducing the word list to a more manageable, if meager, 2,318 words, wisely excluding delimiters. Otherwise no further effort made towards reducing the size. The list is sorted, and the program cleverly tests words against the list in place using a binary search.

Compaction baseline

Before getting into any real compression technologies, there’s low hanging fruit to investigate. Words are exactly five, case-insensitive, English language letters: a–z. To illustrate, here are the first 100 5-letter words from a short Wordle word list.

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands

In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only need 5 bits per letter, or more specifically, ~4.7 bits (log2(26)) per letter. If I instead treat each word as a base-26 number, I can pack each word into 3 bytes (26**5 is ~23.5 bits). A 40% savings just by using a smarter representation.

With 12,972 words, that’s 38,916 bytes for the whole list. Any compression I apply must at least beat this size in order to be worth using.

Letter frequency

Not all letters occur at the same frequency. Here’s the letter frequency for the original Wordle word list:

a:5990  e:6662  i:3759  m:1976  q: 112  u:2511  y:2074
b:1627  f:1115  j: 291  n:2952  r:4158  v: 694  z: 434
c:2028  g:1644  k:1505  o:4438  s:6665  w:1039
d:2453  h:1760  l:3371  p:2019  t:3295  x: 288

When encoding a word, I can save space by spending fewer bits on frequent letters like e at the cost of spending more bits on infrequent letters like q. There are multiple approaches, but the simplest is Huffman coding. It’s not the most efficient, but it’s so easy I can almost code it in my sleep.

While my ultimate target is C, I did the frequency analysis, explored the problem space, and implemented my compressors in Python. I don’t normally like to use Python, but it is good for one-shot, disposable data science-y stuff like this. The decompressor will be implemented in C, partially via meta-programming: Python code generating my C code. Here’s my letter histogram code:

words = [line[:5] for line in sys.stdin]
hist = collections.defaultdict(int)
for c in itertools.chain(*words):
    hist[c] += 1

To build a Huffman coding tree, I’ll need a min-heap (priority queue) initially filled with nodes representing each letter and its frequency. While the heap has more than one element, I pop off the two lowest frequency nodes, create a new parent node with the sum of their frequencies, and push it into the heap. When the heap has one element, the remaining element is the root of the Huffman coding tree.

def huffman(hist):
    heap = [(n, c) for c, n in hist.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        a, b = heapq.heappop(heap), heapq.heappop(heap)
        node = a[0]+b[0], (a[1], b[1])
        heapq.heappush(heap, node)
    return heap[0][1]

tree = huffman(hist)

(By the way, I love that heapq operates directly on a plain list rather than being its own data structure.) This produces the following Huffman coding tree (via pprint):

((('e', 's'),
  (('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
 ((('i', ('p', 'c')),
   ('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
  (('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))

It would be more useful to actually see the encodings.

def flatten(tree, prefix=""):
    if isinstance(tree, tuple):
        return flatten(tree[0], prefix+"0") + \
               flatten(tree[1], prefix+"1")
    else:
        return [(tree, prefix)]

I used isinstance to distinguish leaves (str) from internal nodes (tuple). With sorted(flatten(tree)), I get something like Morse Code:

[('a', '1110'),       ('j', '10111110'),   ('s', '001'),
 ('b', '111111'),     ('k', '111110'),     ('t', '0100'),
 ('c', '10011'),      ('l', '0101'),       ('u', '11011'),
 ('d', '11010'),      ('m', '01111'),      ('v', '011010'),
 ('e', '000'),        ('n', '11110'),      ('w', '011011'),
 ('f', '101110'),     ('o', '1100'),       ('x', '101111111'),
 ('g', '01100'),      ('p', '10010'),      ('y', '10110'),
 ('h', '01110'),      ('q', '101111110'),  ('z', '1011110')]
 ('i', '1000'),       ('r', '1010'),

In terms of encoded bit length, what is the shortest and longest?

codes = dict(flatten(tree))
lengths = [(sum(len(codes[c]) for c in w), w) for w in words]

min(lengths) is “esses” at 15 bits, and max(lengths) is “qajaq” at 34 bits. In other words, the worst case is worse than the compact, 24-bit representation! However, the total is better: sum(w[0] for w in lengths) reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves off ~3.5kB, though it comes at the cost of losing random access, and therefore binary search.

Speaking of bit packing, I’m ready to compress the entire word list into a bit stream:

bits = "".join("".join(codes[c] for c in w) for w in words)

Where bits begins with:

11101110011100001101011101110010110001000111011101...

On the C side I’ll pack these into 32-bit integers, least significant bit first. I abused textwrap to dice it up, and I also need to reverse each set of bits before converting to an integer.

u32 = [int(b[::-1], 2) for b in textwrap.wrap(bits, width=32)]

I now have my compressed data as a sequence of 32-bit integers. Next, some meta-programming:

print(f"static const uint32_t words[{len(u32)}] =", "{", end="")
for i, u in enumerate(u32):
    if i%6 == 0:
        print("\n    ", end="")
    print(f"0x{u:08x},", end="")
print("\n};")

That produces a C table, the beginnings of my decompressor. The array length isn’t necessary since the C compiler can figure it out, but being explicit allows human readers to know the size at a glance, too. Observe how the final 32-bit integer isn’t entirely filled.

static const uint32_t words[8812] = {
    0x4eeb0e77,0xb8caee23,0xffb892bb,0x397fddf2,0xddfcbfee,0x5ff7997f,
    // ...
    0x7b4e66bd,0x35ebcccd,0x8f9af60f,0x0000000c,
};

Now, how to go about building the rest of the decompressor? I have a Huffman coding tree, which is an awful lot like a state machine, eh? I can even have Python generate a state transition table from the Huffman tree:

def transitions(tree, states, state):
    if isinstance(tree, tuple):
        child = len(states)
        states[state] = -child
        states.extend((None, None))
        transitions(tree[0], states, child+0)
        transitions(tree[1], states, child+1)
    else:
        states[state] = ord(tree)
    return states

states = transitions(tree, [None], 0)

The central idea: positive entries are leaves, and negative entries are internal nodes. The negated value is the index of the left child, with the right child immediately following. In transitions, the caller reserves space in the state table for callees, hence starting with [None]. I’ll show the actual table in C form after some more meta-programming:

print(f"static const int8_t states[{len(states)}] =", "{", end="")
for i, s in enumerate(states):
    if i%12 == 0:
        print("\n    ", end="")
    print(f"{s:4},", end="")
print("\n};")

I chose int8_t since I know these values will all fit in an octet, and it must be signed because of the negatives. The result:

static const int8_t states[51] = {
      -1,  -3, -19,  -5,  -7, 101, 115,  -9, -11, 116, 108, -13,
     -17, 103, -15, 118, 119, 104, 109, -21, -39, -23, -27, 105,
     -25, 112,  99, 114, -29, 121, -31, 102, -33, 122, -35, 106,
     -37, 113, 120, -41, -45, 111, -43, 100, 117,  97, -47, 110,
     -49, 107,  98,
};

The first node is -1, meaning if you read a 0 bit then transition to state 1, else state 2 (e.g. immediately following 1). The decompressor reads one bit at a time, walking the state table until it hits a positive value, which is an ASCII code. I’ve decided on this function prototype:

int32_t next(char word[5], int32_t n);

The n is the bit index, which starts at zero. The function decodes the word at the given index, then returns the bit index for the next word. Callers can iterate the entire word list without decompressing the whole list at once. Finally the decompressor code:

int32_t next(char word[5], int32_t n)
{
    for (int i = 0; i < 5; i++) {
        int state = 0;
        for (; states[state] < 0; n++) {
            int b = words[n>>5]>>(n&31) & 1;  // next bit
            state = b - states[state];
        }
        word[i] = states[state];
    }
    return n;
}

When compiled, this is about 80 bytes of instructions, both x86-64 and ARM64. This, along with the 51 bytes for the state table, should be counted against the compression size. That’s 35,579 bytes total.

Trying it out, this program indeed reproduces the original word list:

int main(void)
{
    int32_t state = 0;
    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        state = next(word, state);
        fwrite(word, 6, 1, stdout);
    }
}

Searching 12,972 words linearly isn’t too bad, even for an old 16-bit machine. However, if you really need to speed it up, you could build a little run time index to track various bit positions in the list. For example, the first word starting with b is at bit offset 15,743. If the word I’m looking up begins with b then I can start there and stop at the first c, decompressing just 909 words.

Taking it to the next level: run-length encoding

Here’s the 100-word word list sample again. The sorting is deliberate:

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands

If I look at words column-wise, I see a long run of a, then a long run of b, etc. Even the second column has long runs. I should really exploit this somehow. The first scheme would have worked equally as well on a shuffled list as a sorted list, which is an indication that it’s storing unnecessary information, namely the word list order. (Rule of thumb: Compression should work better on sorted inputs.)

For this second scheme, I’ll pivot the whole list so that I can encode it in column-order. (This is roughly how one part of bzip2 works, by the way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’, etc.”, then I’ll encode these RLE tokens using Huffman coding, per the first scheme, since there will be lots of repeated tokens.

First, pivot the word list:

pivot = "".join("".join(w[i] for w in words) for i in range(5))

Next compute the RLE token stream. The stream works in pairs, first indicating a letter (1–26), then the run length.

tokens = []
offset = 0
while offset < len(pivot):
    c = pivot[offset]
    start = offset
    while offset < len(pivot) and pivot[offset] == c:
        offset += 1
    tokens.append(ord(c) - ord('a') + 1)
    tokens.append(offset - start)

I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 — since I’m going to encode all the tokens using the same Huffman tree. (Exercise for the reader: Does compression improve with two distinct Huffman trees, one for letters and the other for runs?) There are no zero-length runs, and I want there to be as few unique tokens as possible.

tokens looks like so (e.g. 737 ‘a’, 909 ‘b’, …):

[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]

The original Wordle list results in 139 unique tokens. A few tokens appear many times, but most of appear only once. Reusing my Huffman coding tree builder from before:

tree = huffman(collections.Counter(tokens))

This makes for a more complex and interesting tree:

(1,
 ((((18, 20), (25, (((10, 24), (26, 22)), 8))),
   (5,
    ((11,
      ((23,
        ((17,
          (((35, (46, 76)), ((82, 93), (104, 111))),
           (((165, 168), 27), (28, (((30, 39), 31), 38))))),
         ((((((40, 41), ((44, 48), 45)),
             ((53, (54, 56)), 55)),
            ((((57, 59), 58), ((60, 61), (62, 63))),
             ((64, (65, 66)), ((67, 70), 68)))),
           (((((71, 75), 74), (77, (78, 79))),
             (((80, 85), 87), 81)),
            ((((90, 91), (92, 97)), (96, (99, 100))),
             (((101, 103), 102),
              ((105, 106), (109, 110)))))),
          ((((((113, 114), 117), ((120, 121), (125, 129))),
             (((130, 133), (137, 139)), (138, (140, 142)))),
            ((((144, 145), (147, 153)), (148, (166, 175))),
             (((181, 183), (187, 189)),
              ((193, 202), (220, 242))))),
           (((((262, 303), (325, 376)),
              ((413, 489), (577, 598))),
             (((628, 638), (685, 693)),
              ((737, 815), (859, 909)))),
            ((((922, 1565), 29), 32), (34, (33, 43)))))))),
       6)),
     3))),
  ((19, 2),
   ((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))

Peeking at the first 21 elements of sorted(flatten(tree)), which chops off the long tail of large-valued, single-occurrence tokens:

[(1, '0'),            (8, '100111'),       (15, '111010'),
 (2, '1101'),         (9, '111101'),       (16, '1110111'),
 (3, '10111'),        (10, '10011000'),    (17, '1011010100'),
 (4, '11100'),        (11, '101100'),      (18, '10000'),
 (5, '1010'),         (12, '111110'),      (19, '1100'),
 (6, '1011011'),      (13, '1111110'),     (20, '10001'),
 (7, '1111111'),      (14, '111100'),      (21, '1110110')]

Huffman-encoding the RLE stream is more straightforward:

codes = dict(flatten(tree))
bits = "".join(codes[token] for token in tokens)

This time len(bits) is 164,958, or 20,620 bytes! A huge difference, around 40% additional savings!

Slicing and dicing 32-bit integers and printing the table works the same as before. However, this time the state table has larger values (e.g. that run of 909), and so the state table will be int16_t. I copy-pasted the original meta-programming code and make the appropriate adjustments:

static const int16_t states[277] = {
      -1,   1,  -3,  -5,-257,  -7, -21,  -9, -11,  18,  20,  25,
     -13, -15,   8, -17, -19,  10,  24,  26,  22,   5, -23, -25,
       3,  11, -27, -29,   6,  23, -31, -33, -63,  17, -35, -37,
     -49, -39, -43,  35, -41,  46,  76, -45, -47,  82,  93, 104,
     111, -51, -55, -53,  27, 165, 168,  28, -57, -59,  38, -61,
      31,  30,  39, -65,-155, -67,-109, -69, -85, -71, -79, -73,
     -75,  40,  41, -77,  45,  44,  48, -81,  55,  53, -83,  54,
      56, -87, -99, -89, -93, -91,  58,  57,  59, -95, -97,  60,
      61,  62,  63,-101,-105,  64,-103,  65,  66,-107,  68,  67,
      70,-111,-129,-113,-123,-115,-119,-117,  74,  71,  75,  77,
    -121,  78,  79,-125,  81,-127,  87,  80,  85,-131,-143,-133,
    -139,-135,-137,  90,  91,  92,  97,  96,-141,  99, 100,-145,
    -149,-147, 102, 101, 103,-151,-153, 105, 106, 109, 110,-157,
    -213,-159,-185,-161,-173,-163,-167,-165, 117, 113, 114,-169,
    -171, 120, 121, 125, 129,-175,-181,-177,-179, 130, 133, 137,
     139, 138,-183, 140, 142,-187,-199,-189,-195,-191,-193, 144,
     145, 147, 153, 148,-197, 166, 175,-201,-207,-203,-205, 181,
     183, 187, 189,-209,-211, 193, 202, 220, 242,-215,-245,-217,
    -231,-219,-225,-221,-223, 262, 303, 325, 376,-227,-229, 413,
     489, 577, 598,-233,-239,-235,-237, 628, 638, 685, 693,-241,
    -243, 737, 815, 859, 909,-247,-253,-249,  32,-251,  29, 922,
    1565,  34,-255,  33,  43,-259,-261,  19,   2,-263,-269,   4,
    -265,  15,-267,  21,  16,-271,-273,  14,   9,  12,-275,  13,
       7,
};

(Since 277 is prime it will never wrap to a nice rectangle no matter what width I plug in. Ugh.)

With column-wise compression it’s not possible to iterate a word at a time. The entire list must be decompressed at once. The interface now looks like so, where the caller supplies a 12972*5-byte buffer to be filled:

void decompress(char *);

Exercise for the reader: Modify this to decompress into the 24-bit compact form, so the caller only needs a 12972*3-byte buffer.

Here’s my decoder, much like before:

void decompress(char *buf)
{
    for (int32_t x = 0, y = 0, i = 0; i < 164958;) {
        // Decode letter
        int state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int c = states[state] + 96;

        // Decode run-length
        state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int len = states[state];

        // Fill columns
        for (int n = 0; n < len; n++, y++) {
            buf[y*5+x] = c;
        }
        if (y == 12972) {
            y = 0;
            x++;
        }
    }
}

And my new test exactly reproduces the original list:

int main(void)
{
    char buf[12972*5L];
    decompress(buf);

    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        memcpy(word, buf+i*5, 5);
        fwrite(word, 6, 1, stdout);
    }
}

Totalling it up:

Compressed data is 20,620 bytes
State table is 554 bytes
Decompressor is about 200 bytes

That’s a total of 21,374 bytes. Surprisingly this beats general purpose compressors!

PROGRAM     VERSION   SIZE
bzip2 -9    1.0.8     33,752
gzip -9     1.10      30,338
zstd -19    1.4.8     27,098
brotli -9   1.0.9     26,031
xz -9e      5.2.5     16,656
lzip -9     1.22      16,608

Only xz and lzip come out ahead on the raw compressed data, but lose if accounting for an embedded decompressor (on the order of 10kB). Clearly there’s an advantage to customizing compression to a particular dataset.

Update: Johannes Rudolph has pointed out a compression scheme for a Game Boy Wordle clone last month that gets it down to 17,871 bytes, and supports iteration. I improved on this scheme to further reduce it to 16,659 bytes.

Modifying the Middle of a zlib Stream

2016-09-09T03:37:03Z

I recently ran into problem where I needed to modify bytes at the beginning of an existing zlib stream. My program creates a file in a format I do not control, and the file format has a header indicating the total, uncompressed data size, followed immediately by the data. The tricky part is that the header and data are zlib compressed together, and I don’t know how much data there will be until I’ve collected it all. Sometimes it’s many gigabytes. I don’t know how to fill out the header when I start, and I can’t rewrite it when I’m done since it’s compressed in the zlib stream … or so I thought.

My original solution was not to compress anything until it gathered the entirety of the data. The input would get concatenated into a huge buffer, then finally compressed and written out. It’s not ideal, because the program uses a lot more memory than it theoretical could, especially if the data is highly compressible. It would be far better to compress the data as it arrives and somehow update the header later.

My first thought was to ask zlib to leave the header uncompressed, then enable compression (deflateParams()) for the data. I’d work out the magic offset and overwrite the uncompressed header bytes once I’m done. There are two major issues with this, and I’ll address each:

zlib includes a checksum (adler32) at the end of the data, and editing the stream would cause a mismatch. This fairly easy to correct thanks to adler32’s properties.
zlib is an LZ77-family compressor and compression comes from back-references into past (and sometimes future) bytes of decompressed output. Up to 32kB of data following the header could reference bytes in the header as a dictionary. I would need to ask zlib not to reference these bytes. Fortunately the zlib API is intentionally designed for this, though for different purposes.

Fixing the checksum

Ignoring the second problem for a moment, I could fix the checksum by computing it myself. When I overwrite my uncompressed header bytes, I could also overwrite the checksum at the end of the compressed stream. For illustration, here’s an simple example implementation of adler32 (from Wikipedia):

#define MOD_ADLER 65521

uint32_t
example_adler32(uint8_t *data, size_t len)
{
    uint32_t a = 1;
    uint32_t b = 0;
    for (size_t i = 0; i < len; i++) {
        a = (a + data[i]) % MOD_ADLER;
        b = (b + a) % MOD_ADLER;
    }
    return (b << 16) | a;
}

If you think about this for a moment, you may notice this puts me back at square one. If I don’t know the header, then I don’t know the checksum value at the end of the header, going into the data buffer. I’d need to buffer all the data to compute the checksum. Fortunately adler32 has the nice property that two checksums can be concatenated as if they were one long stream. In a malicious context this is known as a length extension attack, but it’s a real benefit here.

It’s like the zlib authors anticipated my needs, because the zlib library has a function exactly for this:

uint32_t adler32_combine(uint32_t adler1, uint32_t adler2, size_t len2);

I just have to keep track of the data checksum adler2 and I can compute the proper checksum later.

uint64_t total = 0;
uint32_t data_adler = adler32(0, 0, 0); // initial value
while (processing_input) {
    // ...
    data_adler = adler32(data_adler, data, size);
    total += size;
}
// ...
uint32_t header_adler = adler32(0, 0, 0);
header_adler = adler32(header_adler, header, header_size);
uint32_t adler = adler32_combine(header_adler, data_adler, total);

Preventing back-references

This part is more complicated and it helps to have some familiarity with zlib. Every time zlib is asked to compress data, it’s given a flush parameter. Under normal operation, this value is always Z_NO_FLUSH until the end of the stream, in which case it’s finalized with Z_FINISH. Other flushing options force it to emit data sooner at the cost of reduced compression ratio. This would primarily be used to eliminate output latency on interactive streams (e.g. compressed SSH sessions).

The necessary flush option for this situation is Z_FULL_FLUSH. It forces out all output data and resets the dictionary: a fence. Future inputs cannot reference anything before a full flush. Since the header is uncompressed, it will not reference itself either. Ignoring the checksum problem, I can safely modify these bytes.

Putting it all together

To fully demonstrate all of this, I’ve put together an example using one of my favorite image formats, Netpbm P6.

https://github.com/skeeto/zlib-mutate-demo

In the P6 format, the image header is an ASCII description of the image’s dimensions followed immediately by raw pixel data.

P6
width height
depth
[RGB bytes]

It’s a bit contrived, but it’s the project I used to work it all out. The demo reads arbitrary raw byte data on standard input and uses it to produce a zlib-compressed PPM file on standard output. It doesn’t know the size of the input ahead of time, nor does it naively buffer it all. There’s no dynamic allocation (except for what zlib does internally), but the program can process arbitrarily large input. The only requirement is that standard output is seekable. Using the technique described above, it patches the header within the zlib stream with the final image dimensions after the input has been exhausted.

If you’re on a Debian system, you can use zlib-flate to decompress raw zlib streams (gzip wraps zlib, but can’t raw zlib). Alternatively your system’s openssl program may have zlib support. Here’s running it on itself as input. Remember, you can’t pipe it into zlib-flate because the output needs to be seekable in order to write the header.

$ ./zppm < zppm > out.ppmz
$ zlib-flate -uncompress < out.ppmz > out.ppm

Unfortunately due to the efficiency-mindedness of zlib, its use requires careful bookkeeping that’s easy to get wrong. It’s a little machine that at each step needs to be either fed more input or its output buffer cleared. Even with all the error checking stripped away, it’s still too much to go over in full here, but I’ll summarize the parts.

First I process an empty buffer with compression disabled. The output buffer will be discarded, so input buffer could be left uninitialized, but I don’t want to upset anyone. All I need is the output size, which I use to seek over the to-be-written header. I use Z_FULL_FLUSH as described, and there’s no loop because I presume my output buffer is easily big enough for this.

char bufin[4096];
char bufout[4096];

z_stream z = {
    .next_in = (void *)bufin,
    .avail_in = HEADER_SIZE,
    .next_out = (void *)bufout,
    .avail_out = sizeof(bufout),
};
deflateInit(&z, Z_NO_COMPRESSION);
memset(bufin, 0, HEADER_SIZE);
deflate(&z, Z_FULL_FLUSH);
fseek(stdout, sizeof(bufout) - z.avail_out, SEEK_SET);

Next I enable compression and reset the checksum. This makes zlib track the checksum for all of the non-header input. Otherwise I’d be throwing away all its checksum work and repeating it myself.

deflateParams(&z, Z_BEST_COMPRESSION, Z_DEFAULT_STRATEGY);
z.adler = adler32(0, 0, 0);

I won’t include it in this article, but what follows is a standard zlib compression loop, consuming all the input data. There’s one key difference compared to a normal zlib compression loop: when the input is exhausted, instead of Z_FINISH I use Z_SYNC_FLUSH to force everything out. The problem with Z_FINISH is that it will write the checksum, but we’re not ready for that.

With all the input processed, it’s time to go back to rewrite the header. Rather than mess around with magic byte offsets, I start a second, temporary zlib stream and do the Z_FULL_FLUSH like before, but this time with the real header. In deciding the header size, I reserved 6 characters for the width and 10 characters for the height.

sprintf(bufin, "P6\n%-6lu\n%-10lu\n255\n", width, height);
uint32_t adler = adler32(0, 0, 0);
adler = adler32(adler, (void *)bufin, HEADER_SIZE);

z_stream zh = {
    .next_in = (void *)bufin,
    .avail_in = HEADER_SIZE,
    .next_out = (void *)bufout,
    .avail_out = sizeof(bufout),
};
deflateInit(&zh, Z_NO_COMPRESSION);
deflate(&zh, Z_FULL_FLUSH);
fseek(stdout, 0, SEEK_SET);
fwrite(bufout, 1, sizeof(bufout) - zh.avail_out, stdout);
fseek(stdout, 0, SEEK_END);
deflateEnd(&zh);

The header is now complete, so I can go back to finish the original compression stream. Again, I assume the output buffer is big enough for these final bytes.

z.adler = adler32_combine(adler, z.adler, z.total_in - HEADER_SIZE);
z.next_out = (void *)bufout;
z.avail_out = sizeof(bufout);
deflate(&z, Z_FINISH);
fwrite(bufout, 1, sizeof(bufout) - z.avail_out, stdout);
deflateEnd(&z);

It’s a lot more code than I expected, but it wasn’t too hard to work out. If you want to get into the nitty gritty and really hack a zlib stream, check out RFC 1950 and RFC 1951.

LZSS Quine Puzzle

2014-11-22T05:29:18Z

When I was a kid I spent some time playing a top-down, 2D, puzzle/action, 1993, MS-DOS game called God of Thunder. It came on a shareware CD, now long lost, called Games People Play. A couple decades later I was recently reminded of the game and decided to dig it up and play it again. It’s not quite as exciting as I remember it — nostalgia really warps perception — but it’s still an interesting game nonetheless.

That got me thinking about how difficult it might be to modify (“mod”) the game to add my own levels and puzzles. It’s a tiny game, so there aren’t many assets to reverse engineer. Unpacked, the game just barely fits on a 1.44 MB high density floppy disk. That was probably one of the game’s primary design constraints. It also means it’s almost certainly employing some sort of home-brew compression algorithm in order to fit more content. I find these sorts of things absolutely interesting and delightful.

You see, back in those old days, compression wasn’t really a “solved” problem like it is today. They had to design and implement their own algorithms, with varying degrees of success. Today if you need compression for a project, you just grab zlib. Released in 1995, it implements the most widely used compression algorithm today, DEFLATE, with a tidy, in-memory API. zlib is well-tested, thoroughly optimized, and sits in a nearly-perfect sweet spot between compression ratio and performance. There’s even an embeddable version. Since spinning platters are so slow compared to CPUs, compression is likely to speed up an application simply because fewer bytes need to go to and from the disk. Today it’s less about saving storage space and more about reducing input/output demands.

Fortunately for me, someone has already reversed engineered most of the God of Thunder assets. It uses its own flavor of Lempel-Ziv-Storer-Szymanski (LZSS), which itself is derived from LZ77, one of the algorithms used in DEFLATE. The original LZSS paper focuses purely on the math, describing the algorithm in terms of symbols with no concern for how it’s actually serialized into bits. Those specific details were decided by the game’s developers, and that’s what I’ll be describing below.

As an adult I’m finding the God of Thunder asset formats to be more interesting than the game itself. It’s a better puzzle! I really enjoy studying the file formats of various applications, especially older ones that didn’t have modern standards to build on. Usually lots of thought and engineering goes into the design these formats — and, too often, not enough thought goes into it. The format’s specifics reveal insights into the internal workings of the application, sometimes exposing unanticipated failure modes. Prying apart odd, proprietary formats (i.e. “data reduction”) is probably my favorite kind of work at my day job, and it comes up fairly often.

God of Thunder LZSS Definition

An LZSS compression stream is made up of two kinds of chunks: literals and back references. A literal chunk is passed through to the output buffer unchanged. A reference chunk is a pair of numbers: a length and an offset backwards into the output buffer. Only a single bit is needed for each chunk to identify its type.

To avoid any sort of complicated and slow bit wrangling, the God of Thunder developers (or whoever inspired them) came up with the smart idea to stage 8 of these bits up at once as a single byte, a “control” byte. Since literal chunks are 1 byte and reference chunks are 2 bytes, everything falls onto clean byte boundaries. Every group of 8 chunks is prefixed with one of these control bytes, and so every LZSS compression stream begins with a control byte. The least significant bit controls the 1st chunk in the group and the most significant bit controls the 8th chunk. A 1 denotes a literal and a 0 denotes a reference.

So, for example, a control byte of 0xff means to pass through unchanged the next 8 bytes of the compression stream. This would be the least efficient compression scenario, because the “compressed” stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come entirely from the back references.

A back reference is two bytes little endian (this was in MS-DOS running on x86), the lower 12 bits are the offset and the upper 4 bits are the length, minus 2. That is, you read the 4 length bits and add 2. This is because it doesn’t make any sense to reference a length shorter than 2: a literal chunk would be shorter. The offset doesn’t have anything added to it. This was a design mistake since an offset of 0 doesn’t make any sense. It refers to a byte just outside the output buffer. It should have been stored as the offset minus 1.

A 12-bit offset means up to a 4kB sliding window of output may be referenced at any time. A 4-bit length, plus two, means up to 17 bytes may be copied in a single back reference. Compared to other compression algorithms, this is rather short.

It’s important to note that the length is allowed to extend beyond the output buffer (offset < length). The bytes are, in effect, copied one at a time into the output and may potentially be reused within the same operation (like the opposite of memmove). An offset of 1 and a length of 10 means “repeat the last output byte 10 times.”

That’s the entire format! It’s extremely simple but reasonably effective for the game’s assets.

Worst Case and Best Case

In the worst case, such as compressing random data, the compression stream will be at most 112.5% (9/8) bigger than the uncompressed stream.

In the best case, such as a long string of zeros, the compressed stream will be, at minimum, 12.5% (1/8) the size of the decompressed stream. Think about it this way: imagine every chunk is a reference of maximum length. That’s 1 control byte plus 16 (8 * 2) reference bytes, for a total of 17 compressed bytes. This emits 17 * 8 decompressed bytes, 17 being the maximum length from 8 chunks. Conveniently those two 17s cancel, leaving a factor of 8 for the best case.

LZSS End of Stream

If you’re paying really close attention, you may have noticed that by grouping 8 control bits at a time, the length of the input stream is, strictly speaking, constrained to certain lengths. What if, during compression, the input stream stream comes up short of exactly those 8 chunks? As is, there’s no way to communicate a premature end to the stream. There are three ways around this using a small amount of metadata, each differing in robustness.

Keep track of the size of the decompressed data. When that many bytes have been emitted, halt. This is how God of Thunder handles it. A small validation check could be performed here. The output stream should always end between chunks, not in the middle of a chunk (i.e. in the middle of copying a back reference). Some of the bits in the control byte may contain arbitrary data that doesn’t effect the output, which is a concern when hashing compressed data. My suggestion: require the unused control bits to be 0, which allows for an additional validation check. The output stream should never end just short of a literal chunk.
Keep track of the size of the compressed data. Halt when no more chunks are encountered. A similar, weaker validation check can be performed here: the input stream should never stop between two bytes of a reference. It’s weaker because it’s less sensitive to corruption, making it harder to detect. The same unused control bit padding situation applies here.
Use an out-of-band end marker (EOF). This is very similar to keeping track of the input size (the filesystem is doing it), but has the weakest validation of all. The stream could be accidentally truncated at any point between chunks, which is undetectable. This makes it the least sensitive to corruption.

An LZSS Quine

After spending some time playing around with this format, I thought about what it would take to make an LZSS quine. That is, find an LZSS compressed stream that decompresses to itself. It’s been done for DEFLATE, which I imagine is a much harder problem. There are zip files containing exact copies of themselves, recursively. I’m pretty confident it’s never been done for this exact compression format, simply because it’s so specific to this old MS-DOS game.

I haven’t figured it out yet, so you won’t find the solution here. This, dear readers, is my challenge to you! Using the format described above, craft an LZSS quine. LZSS doesn’t have no-op chunks (i.e. length = 0), which makes this harder than it would otherwise be. It may not even be possible, which, in that case, your challenge is to prove it!

So far I’ve determined that it begins with at least 4kB of 0xff. Why is this? First, as I mentioned, all compression streams begin with a control byte. Second, no references can be made until at least one literal byte has been passed, so the first bit (LSB) of the first byte is a 1, and the second byte is exactly the same as the first byte. So the first two bytes are xxxxxx1, with the x being “don’t care (yet).”

If the next chunk is a back reference, those first two bytes become xxxxxx01. It could only reference that one byte (so offset = 1), and the length would need to be at least two, ensuring at least the first three bytes of output all have that same pattern. However, on the most significant byte of the reference chunk, this conflicts with having an offset of 1 because the 9th bit of the offset is set to 1, forcing the offset to an invalid 257 bytes. Therefore, the second chunk must be a literal.

This pattern continues until the first eight chunks are all literals, which means the quine begins with at least 9 0xff bytes. Going on, this also means the first back reference is going to be 0xffff (offset = 4095, length = 17), so the sliding window needs to be filled enough to make that a offset valid. References would then be used to “catch up” with the compression stream, then some magic is needed to finish off the stream.

That’s where I’m stuck.

Lossless Optimizers

2009-08-23T00:00:00Z

I've been using lossless optimizers for awhile now for PNGs, but more recently I have found some for other formats. Here's the ones I know about. These are all intended to be lossless, so running them should result in no information loss (well, except the SVG one).

For PNG, there are a number of choices, but my favorite is OptiPNG. It adjust the PNG parameters and recompresses to find the optimal parameters. I run it on almost all my images around here, and I tend to get around 10% to 30% reduction for images fresh off Gimp, Kolourpaint, and ImageMagick.

For JPEG, I use jpegoptim. It works by optimizing the Huffman tables (the lossless part of JPEG compression). I only found this one recently, but I will be using it all the time, like on our new thousands of wedding reception photos.

For PDF, I found something called QPDF. It's designed more for other PDF transformations, but without any other parameters it will simply losslessly optimize a PDF. From what I've seen so far it cuts PDFs down by about a third.

For SVG, Scour is a young project, only a few months old. I've been looking for an SVG optimizer for some time, so this was exciting to find. Due to the type of file it's working with, it's not quite entirely lossless. Visually, it is lossless, but it will toss all metadata (comments, etc.), which may be important. If you hand-crafted your SVG, you won't want to use this tool. It's good for removing Inkscape and Illustrator cruft, though.

I have yet to find a good (Free) GIF optimizer. Animated GIFs, with lots of redundancy between frames, have a lot of potential for optimization too. A video optimizer (for, say, Theora) would be interesting; I imagine it might work similarly to jpegoptim. Audio files (like Vorbis, FLAC, or MP3) probably don't have any room for optimization. I could be wrong. For XHTML there is tidy if you want to count that. All the other XML formats (ODF, RSS, etc.) could have their own too. Or optimizers for archives, like zip and tar. For tar it might rearrange things to better suit gzip, bzip2, or lzma. Executable optimizers? Postscript optimizers? It goes on and on.

If you know about any more, especially for other file formats, let me know.

Avoid Zip Archives

2009-03-22T00:00:00Z

In a previous post about the LZMA compression algorithm, I made a negative comment about zip archives and moved on. I would like to go into more detail about it now.

A zip archive serves three functions all-in-one: compression, archive, and encryption. On a unix-like system, these functions would normally provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The unix philosophy says to "write programs that do one thing and do it well".

So in the case of zip archives, we are doing three things poorly when, instead, we should be using three separate tools that each do one thing well.

When we use three different tools, our encrypted archive is a lot like an onion. On the outside we have encryption. After we peel that off by decrypting it, we have compression, and after removing that lair, finally the archive. This is reflected in the filename: .tar.gz.gpg. As a side note, if GPG didn't already support it, we could add base-64 encoding if needed as another layer on the onion: .tar.gz.gpg.b64.

By using separate tools, we can also swap different tools in and out without breaking any spec. Previously I mentioned using LZMA, which could be used in place of gzip or bzip2. Instead of .tar.gz.gpg you can have .tar.lzma.gpg. Or you can swap out GPG for encryption and use, say, CipherSaber as .tar.lzma.cs2. If we use a single one-size-fits-all format, we are limited by the spec.

Compression

Both zip and gzip basically use the same compression algorithm. The zip spec actually allows for a variety of other compression algorithms, but you cannot rely on other tools to support them.

Zip archives are also inside out. Instead of solid compression, which is what happens in tarballs, each file is compressed individually. Redundancy between different files cannot be exploited. The equivalent would be an inside out tarball: .gz.tar. This would be produced by first individually gzipping each file in a directory tree, then archiving them with tar. This results in larger archive sizes.

However, there is an advantage to inside out archives: random access. We can access a file in the middle of the archive without having to take the whole thing apart. In general use, this sort of thing isn't really needed, and solid compression would be more useful.

Encryption

Encryption is where zip has been awful in the past. The original spec's encryption algorithm had serious flaws and no one should even consider using them today.

Since then, AES encryption has been worked into the standard and implemented differently by different tools. Unless the same zip tool is used on each end, you can't be sure AES encryption will work.

By placing encryption as part of the file spec, each tool has to implement its own encryption, probably leaving out considerations like using secure memory. These tools are concentrating on archiving and compression, and so encryption will likely not be given a solid effort.

In the implementations I know of, the archive index isn't encrypted, so someone could open it up and see lots of file metadata, including filenames.

When you encrypt a tarball with GnuPG, you have all the flexibility of PGP available. Asymmetric encryption, web of trust, multiple strong encryption algorithms, digital signatures, strong key management, etc. It would be unreasonable for an archive format to have this kind of thing built in.

Conclusion

You are almost always better off using a tarball rather than a zip archive. Unfortunately the receiver of an archive will often be unable to open anything else, so you may have no choice.

LZMA Tarballs Are Coming

2009-03-16T00:00:00Z

Any developer that uses a non-toy operating system will be familiar with gzip and bzip2 tarballs (.tar.gz, .tgz, and .tar.bz2). Most places will provide both versions so that the user can use his preferred decompresser.

Both types are useful because they make tradeoffs at different points: gzip is very fast with low memory requirements and bzip2 has much better compression ratios at the cost of more memory and CPU time. Users of older hardware will prefer gzip, because the benefits of bzip2 are negated by the long decompression times, around 6 times longer. This is why OpenBSD prefers gzip.

But there is a new compression algorithm in town. Well, it has been around for about 10 years now, but, if I understand correctly, was patent encumbered (aka useless) for awhile. It is called the Lempel-Ziv-Markov chain algorithm (LZMA). It is still maturing and different software that uses LZMA still can't quite talk to each other. 7-zip and LZMA Utils are a couple examples.

GNU tar added an --lzma option just last April, and finally gave it a short option, -J, this past December. I take this as a sign that LZMA tarballs (.tar.lzma) are going to become common over the next several years. It also would seem that the GNU project has officially blessed LZMA.

And not only that, I think LZMA tarballs will supplant bzip2 tarballs. The reason is because it is even more asymmetric than bzip2.

According to the LZMA Utils page, LZMA compression ratios are 15% better than those of bzip2, but at the cost of being 4 to 12 times slower on compression. In many applications, including tarball distribution, this is completely acceptable because decompression is faster than bzip2! There is an extreme asymmetry here that can readily be exploited.

So, when a developer has a new release he tells his version control system, or maybe his build system, to make a tar archive and compress it with LZMA. If he has a computer from this millennium, it won't take a lifetime to do, but it will still take some time. Since it could take as much as two orders of magnitude longer to make than a gzip tarball, he could make a gzip tarball first and put it up for distribution. When the LZMA tarball is done, it will be about 30% smaller and decompress almost as fast as the gzip tarball (but while using a large amount of memory).

At this point, why would someone download a bzip2 archive? It's bigger and slower. Right now possible reasons may be a lack of an LZMA decompresser and/or lack of familiarity. Over time, these will both be remedied.

Don't get me wrong. I don't hate bzip2. It is a very interesting algorithm. In fact, I was breathless when I first understood the Burrows-Wheeler transform, which bzip2 uses at one stage. I would argue that bzip2 is more elegant than gzip and LZMA because it is less arbitrary. But I do think it will become obsolete.

Unfortunately, the confused zip archive is here to stay for now because it is the only compression tool that a certain popular, but inferior, operating system ships with. I say "confused" because it makes the mistake of combining three tools into one: archive, compression, and encryption. As a result, instead of doing one thing well it does three things poorly. Cell phone designers also make the same mistake. Fortunately I don't have to touch zip archives often.

Finally, don't forget that LZMA is mostly useful where the asymmetry can be exploited: data is compressed once and decompressed many times. Take the gitweb interface, which provides access to a git repository through a browser. It will provide a gzip tarball of any commit on the fly. It doesn't do this by having all these tarballs lying around, but creates them on demand. Data is compressed once and decompressed once. Because of this, gzip is, and will remain, the best option for this setting.

In conclusion, consider creating LZMA tarballs next time, and don't be afraid to use them when you come across them.