LZSS Quine Puzzle

November 22, 2014

nullprogram.com/blog/2014/11/22/

When I was a kid I spent some time playing a top-down, 2D, puzzle/action, 1993, MS-DOS game called God of Thunder. It came on a shareware CD, now long lost, called Games People Play. A couple decades later I was recently reminded of the game and decided to dig it up and play it again. It’s not quite as exciting as I remember it — nostalgia really warps perception — but it’s still an interesting game nonetheless.

That got me thinking about how difficult it might be to modify (“mod”) the game to add my own levels and puzzles. It’s a tiny game, so there aren’t many assets to reverse engineer. Unpacked, the game just barely fits on a 1.44 MB high density floppy disk. That was probably one of the game’s primary design constraints. It also means it’s almost certainly employing some sort of home-brew compression algorithm in order to fit more content. I find these sorts of things absolutely interesting and delightful.

You see, back in those old days, compression wasn’t really a “solved” problem like it is today. They had to design and implement their own algorithms, with varying degrees of success. Today if you need compression for a project, you just grab zlib. Released in 1995, it implements the most widely used compression algorithm today, DEFLATE, with a tidy, in-memory API. zlib is well-tested, thoroughly optimized, and sits in a nearly-perfect sweet spot between compression ratio and performance. There’s even an embeddable version. Since spinning platters are so slow compared to CPUs, compression is likely to speed up an application simply because fewer bytes need to go to and from the disk. Today it’s less about saving storage space and more about reducing input/output demands.

Fortunately for me, someone has already reversed engineered most of the God of Thunder assets. It uses its own flavor of Lempel-Ziv-Storer-Szymanski (LZSS), which itself is derived from LZ77, one of the algorithms used in DEFLATE. The original LZSS paper focuses purely on the math, describing the algorithm in terms of symbols with no concern for how it’s actually serialized into bits. Those specific details were decided by the game’s developers, and that’s what I’ll be describing below.

As an adult I’m finding the God of Thunder asset formats to be more interesting than the game itself. It’s a better puzzle! I really enjoy studying the file formats of various applications, especially older ones that didn’t have modern standards to build on. Usually lots of thought and engineering goes into the design these formats — and, too often, not enough thought goes into it. The format’s specifics reveal insights into the internal workings of the application, sometimes exposing unanticipated failure modes. Prying apart odd, proprietary formats (i.e. “data reduction”) is probably my favorite kind of work at my day job, and it comes up fairly often.

God of Thunder LZSS Definition

An LZSS compression stream is made up of two kinds of chunks: literals and back references. A literal chunk is passed through to the output buffer unchanged. A reference chunk is a pair of numbers: a length and an offset backwards into the output buffer. Only a single bit is needed for each chunk to identify its type.

To avoid any sort of complicated and slow bit wrangling, the God of Thunder developers (or whoever inspired them) came up with the smart idea to stage 8 of these bits up at once as a single byte, a “control” byte. Since literal chunks are 1 byte and reference chunks are 2 bytes, everything falls onto clean byte boundaries. Every group of 8 chunks is prefixed with one of these control bytes, and so every LZSS compression stream begins with a control byte. The least significant bit controls the 1st chunk in the group and the most significant bit controls the 8th chunk. A 1 denotes a literal and a 0 denotes a reference.

So, for example, a control byte of 0xff means to pass through unchanged the next 8 bytes of the compression stream. This would be the least efficient compression scenario, because the “compressed” stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come entirely from the back references.

A back reference is two bytes little endian (this was in MS-DOS running on x86), the lower 12 bits are the offset and the upper 4 bits are the length, minus 2. That is, you read the 4 length bits and add 2. This is because it doesn’t make any sense to reference a length shorter than 2: a literal chunk would be shorter. The offset doesn’t have anything added to it. This was a design mistake since an offset of 0 doesn’t make any sense. It refers to a byte just outside the output buffer. It should have been stored as the offset minus 1.

A 12-bit offset means up to a 4kB sliding window of output may be referenced at any time. A 4-bit length, plus two, means up to 17 bytes may be copied in a single back reference. Compared to other compression algorithms, this is rather short.

It’s important to note that the length is allowed to extend beyond the output buffer (offset < length). The bytes are, in effect, copied one at a time into the output and may potentially be reused within the same operation (like the opposite of memmove). An offset of 1 and a length of 10 means “repeat the last output byte 10 times.”

That’s the entire format! It’s extremely simple but reasonably effective for the game’s assets.

Worst Case and Best Case

In the worst case, such as compressing random data, the compression stream will be at most 112.5% (9/8) bigger than the uncompressed stream.

In the best case, such as a long string of zeros, the compressed stream will be, at minimum, 12.5% (1/8) the size of the decompressed stream. Think about it this way: imagine every chunk is a reference of maximum length. That’s 1 control byte plus 16 (8 * 2) reference bytes, for a total of 17 compressed bytes. This emits 17 * 8 decompressed bytes, 17 being the maximum length from 8 chunks. Conveniently those two 17s cancel, leaving a factor of 8 for the best case.

LZSS End of Stream

If you’re paying really close attention, you may have noticed that by grouping 8 control bits at a time, the length of the input stream is, strictly speaking, constrained to certain lengths. What if, during compression, the input stream stream comes up short of exactly those 8 chunks? As is, there’s no way to communicate a premature end to the stream. There are three ways around this using a small amount of metadata, each differing in robustness.

Keep track of the size of the decompressed data. When that many bytes have been emitted, halt. This is how God of Thunder handles it. A small validation check could be performed here. The output stream should always end between chunks, not in the middle of a chunk (i.e. in the middle of copying a back reference). Some of the bits in the control byte may contain arbitrary data that doesn’t effect the output, which is a concern when hashing compressed data. My suggestion: require the unused control bits to be 0, which allows for an additional validation check. The output stream should never end just short of a literal chunk.
Keep track of the size of the compressed data. Halt when no more chunks are encountered. A similar, weaker validation check can be performed here: the input stream should never stop between two bytes of a reference. It’s weaker because it’s less sensitive to corruption, making it harder to detect. The same unused control bit padding situation applies here.
Use an out-of-band end marker (EOF). This is very similar to keeping track of the input size (the filesystem is doing it), but has the weakest validation of all. The stream could be accidentally truncated at any point between chunks, which is undetectable. This makes it the least sensitive to corruption.

An LZSS Quine

After spending some time playing around with this format, I thought about what it would take to make an LZSS quine. That is, find an LZSS compressed stream that decompresses to itself. It’s been done for DEFLATE, which I imagine is a much harder problem. There are zip files containing exact copies of themselves, recursively. I’m pretty confident it’s never been done for this exact compression format, simply because it’s so specific to this old MS-DOS game.

I haven’t figured it out yet, so you won’t find the solution here. This, dear readers, is my challenge to you! Using the format described above, craft an LZSS quine. LZSS doesn’t have no-op chunks (i.e. length = 0), which makes this harder than it would otherwise be. It may not even be possible, which, in that case, your challenge is to prove it!

So far I’ve determined that it begins with at least 4kB of 0xff. Why is this? First, as I mentioned, all compression streams begin with a control byte. Second, no references can be made until at least one literal byte has been passed, so the first bit (LSB) of the first byte is a 1, and the second byte is exactly the same as the first byte. So the first two bytes are xxxxxx1, with the x being “don’t care (yet).”

If the next chunk is a back reference, those first two bytes become xxxxxx01. It could only reference that one byte (so offset = 1), and the length would need to be at least two, ensuring at least the first three bytes of output all have that same pattern. However, on the most significant byte of the reference chunk, this conflicts with having an offset of 1 because the 9th bit of the offset is set to 1, forcing the offset to an invalid 257 bytes. Therefore, the second chunk must be a literal.

This pattern continues until the first eight chunks are all literals, which means the quine begins with at least 9 0xff bytes. Going on, this also means the first back reference is going to be 0xffff (offset = 4095, length = 17), so the sliding window needs to be filled enough to make that a offset valid. References would then be used to “catch up” with the compression stream, then some magic is needed to finish off the stream.

That’s where I’m stuck.

compression

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.