<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged compression at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/compression/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/compression/feed/"/>
  <updated>2026-04-07T03:24:16Z</updated>
  <id>urn:uuid:1c53dce4-4b3d-4f09-8857-46c6a45c2c7b</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>QOI is now my favorite asset format</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/12/18/"/>
    <id>urn:uuid:184bb5f6-3c31-4faf-9a15-3a693b8f4c7d</id>
    <updated>2022-12-18T03:45:44Z</updated>
    <category term="c"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34035024">on Hacker News</a>.</em></p>

<p>The <a href="https://qoiformat.org/">Quite OK Image (QOI) format</a> was announced late last year and
finalized into a specification a month later. Initially dismissive, a
revisit has shifted my opinion to impressed. The format hits a sweet spot
in the trade-off space between complexity, speed, and compression ratio.
Also considering its alpha channel support, QOI has become my default
choice for embedded image assets. It’s not perfect, but at the very least
it’s a solid foundation.</p>

<!--more-->

<p>Since I’m now working with QOI images, I need a good QOI viewer, and so I
added support to my ill-named <a href="https://github.com/skeeto/scratch/tree/master/pbmview">pbmview</a> tool, which I wrote to
serve the same purpose for <a href="https://netpbm.sourceforge.net/doc/ppm.html">Netpbm</a>. I will <a href="/blog/2020/06/29/">continue to use Netpbm
as an output format</a>, especially for raw video output, but no
longer will I use it for an embedded asset (nor re-invent yet another
<a href="https://en.wikipedia.org/wiki/Run-length_encoding">RLE</a> over Netpbm).</p>

<p>I was dismissive because the website claimed, and still claims today, QOI
images are “a similar size” to PNG. However, for the typical images where
I would use PNG, QOI is around 3x larger, and some outliers are far worse.
The 745 PNGs on my blog — a perfect test corpus for my own needs — convert
to QOIs 2.8x larger on average. The official QOI benchmark has much better
results, 1.3x larger, but that’s because it includes a lot of photography
where PNG and QOI both do poorly, making QOI seem more comparable.</p>

<p>However, as I said, QOI’s strength is its trade-off sweet spot. The
<a href="https://qoiformat.org/qoi-specification.pdf">specification is one page</a>, and an experienced developer can write
a complete implementation from scratch in a single sitting. <a href="https://github.com/skeeto/scratch/blob/master/parsers/qoi.c">My own
implementation is about 100 lines of libc-free C</a> for each of the
encoder and decoder. With error checking removed, my decoder is ~600 bytes
of x86 object code — a great story for embedding alongside assets. It’s
more complex than Netpbm or <a href="https://tools.suckless.org/farbfeld/">farbfeld</a>, but it’s far simpler than BMP.
I’ve already begun <a href="https://github.com/skeeto/chess/commit/5c123b3">experimenting with converting assets to QOI</a>,
and the results have so far exceeded my expectations.</p>

<p>To my surprise, the encoder was easier to write than the decoder. The
format is so straightforward such that two different encoders will produce
the identical files. There’s little room for specialized optimization, and
no meaningful “compression level” knob.</p>

<h3 id="criticism">Criticism</h3>

<p>There are a <a href="https://github.com/nigeltao/qoi2-bikeshed">lot of dimensions</a> on which QOI could be improved,
but most cases involve trade-offs, e.g. more complexity for better
compression. The areas where QOI could have been strictly better, the
dimensions on which it is not on the Pareto frontier, are more meaningful
criticisms — missed opportunities. My criticisms of this kind:</p>

<ul>
  <li>
    <p>Big endian fields are an odd choice for a 2020s file format. Little
endian dominates the industry, and it would have made for a slightly
smaller decoder footprint on typical machines today if QOI used little
endian.</p>
  </li>
  <li>
    <p>The header has two flags and spends an entire byte on each. It should
have instead had a flag byte, with two bits assigned to these flags. One
flag indicates if the alpha channel is important, and the other selects
between two color spaces (sRGB, linear). Both flags are only advisory.</p>
  </li>
  <li>
    <p>The 4-channel encoded pixel format is ABGR (or RGBA), placing the alpha
channel next to the blue channel. This is somewhat unconventional. A
decoder is likely to use a single load into 32-bit integer, and ideally
it’s already in the desired format or close to it. A few times already
I’ve had to shuffle the RGB bytes within the 32-bit sample to be
compatible with some other format. QOI channel ordering is arbitrary,
and I would have chosen ARGB (when viewed as little endian).</p>
  </li>
  <li>
    <p>The QOI hash function operates on channels individually, with individual
overflow, making it slower and larger than necessary. The hash function
should have been <a href="/blog/2018/07/31/">over a packed 32-bit sample</a>. I would have used
<a href="/blog/2022/08/08/#hash-functions">a multiplication</a> by a carefully-chosen 32-bit integer, then a
right shift using the highest 6 bits of the result for the index.</p>
  </li>
</ul>

<p>More subjective criticisms that might count as having trade-offs:</p>

<ul>
  <li>
    <p>Given a “flag byte” (mentioned above) it would have been free to assign
another flag bit indicating pre-multiplied alpha, also still advisory.
<a href="https://www.adriancourreges.com/blog/2017/05/09/beware-of-transparent-pixels/">You want to use pre-multiplied alpha</a> for your assets, and the
option store them this way would help.</p>
  </li>
  <li>
    <p>There’s an 8-byte end-of-stream marker — a bit excessive — deliberately
an invalid encoding so that reads past the end of the image will result
in a decoding error. I probably would have chosen a dead simple 32-bit
checksum of packed 32-bit images samples, even if literally a sum.</p>
  </li>
</ul>

<p>Of course, you’re not obligated to follow QOI exactly to spec for your own
assets, so you could always use a modified QOI with one or more of these
tweaks. That’s what I meant about it being a solid foundation: You don’t
have to start from scratch with some custom RLE. Since the format is so
simple, you can easily build your own tools — as I’ve already begun doing
myself — so you don’t need to rely on tools supporting your QOI fork.</p>

<h3 id="minimalist-api">Minimalist API</h3>

<p>I’m really happy with my QOI implementation, particularly since it’s
another example of <a href="/blog/2018/06/10/">a minimalist C API</a>: no allocating, no input or
output, and no standard library use. As usual, the expectation is that
it’s in the same translation unit where it’s used, so it’s likely inlined
into callers.</p>

<p>The encoder is streaming — it accepts and returns only a little bit of
input and output at a time. It has three functions and one struct with no
“public” fields:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="nf">qoiencoder</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">w</span><span class="p">,</span> <span class="kt">int</span> <span class="n">h</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">flags</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">qoiencode</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">color</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">qoifinish</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>The first function initializes an encoder and writes a fixed-length header
into the QOI buffer. The <code class="language-plaintext highlighter-rouge">flags</code> field is a mode string, like <code class="language-plaintext highlighter-rouge">fopen</code>. I
would normally use bit flags, but this is <a href="https://flak.tedunangst.com/post/string-interfaces">a little experiment</a>. The
second function encodes a single pixel into the QOI buffer, returning the
number of bytes written (possibly zero). The last flushes any encoding
state and writes the end-of-stream marker. There are no errors. My typical
use so far looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">qoiencoder</span> <span class="n">q</span> <span class="o">=</span> <span class="n">qoiencoder</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">QOIHDRLEN</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ... compute 32-bit ABGR sample at (x, y) ...</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">qoiencode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">abgr</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">qoifinish</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">buf</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="n">fflush</span><span class="p">(</span><span class="n">file</span><span class="p">);</span>
<span class="k">return</span> <span class="nf">ferror</span><span class="p">(</span><span class="n">file</span><span class="p">);</span>
</code></pre></div></div>

<p>This appends encoder outputs to a buffered stream, but it could just as
well accumulate directly into a larger buffer, advancing the write pointer
a little after each call.</p>

<p>The decoder is two functions, but its struct has some “public” fields.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">qoidecoder</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">srgb</span><span class="p">,</span> <span class="n">error</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">qoidecoder</span> <span class="nf">qoidecoder</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="nf">qoidecode</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoidecoder</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The input is not streamed and the entire buffer must be loaded into memory
at once — not too bad since it’s compressed, and perhaps even already
loaded as part of the executable image — but the output <em>is</em> streamed,
delivering one packed 32-bit ABGR sample per call. The decoder makes no
assumptions about the output format, and the caller unpacks samples and
stores them in whatever format is appropriate (shader texture, etc.).</p>

<p>To make it easier to use, my decoder range checks to guarantee that width
and height <a href="/blog/2017/07/19/">can be multiplied without overflow</a>. Unlike encoding,
there may be errors due to invalid input, including that failed range
check. The decoder error flag is “sticky” and the decoder returns zero
samples when in an error state, so callers can wait to check for errors
until the end. (Though if you’re only decoding embedded assets, then there
are no practical errors, and checks can be removed/ignored.)</p>

<p>Example usage, copied almost verbatim from a real program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">loadimage</span><span class="p">(</span><span class="n">Image</span> <span class="o">*</span><span class="n">image</span><span class="p">,</span> <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">qoi</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">qoidecoder</span> <span class="n">q</span> <span class="o">=</span> <span class="n">qoidecoder</span><span class="p">(</span><span class="n">qoi</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* image dimensions too large */</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">image</span><span class="o">-&gt;</span><span class="n">width</span>  <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">width</span><span class="p">;</span>
    <span class="n">image</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">count</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">width</span> <span class="o">*</span> <span class="n">q</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="n">abgr</span> <span class="o">=</span> <span class="n">qoidecode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">q</span><span class="p">.</span><span class="n">error</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the aforementioned awkward RGB shuffle.</p>

<p>It’s safe to say that I’m excited about QOI, and that it now has a
permanent slot on my developer toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Compressing and embedding a Wordle word list</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/03/07/"/>
    <id>urn:uuid:95e1a2c2-c1b6-4472-9954-7bc76b4bab10</id>
    <updated>2022-03-07T03:22:41Z</updated>
    <category term="c"/><category term="python"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p><a href="https://en.wikipedia.org/wiki/Wordle">Wordle</a> is all the rage, resulting in an explosion of hobbyist clones,
with new ones appearing every day. At the current rate I estimate by the
end of 2022 that 99% of all new software releases will be Wordle clones.
That’s no surprise since the rules are simple, it’s more fun to implement
and study than to actually play, and the hard part is building a decent
user interface. Such implementations go back <a href="https://www.youtube.com/watch?v=Yi2mTMWC4BM&amp;t=1270s">at least 30 years</a>.
Implementers get to decide on a platform, language, and the particular
subject of this article: how to handle the word list. Is it a separate
file/database or <a href="/blog/2016/11/15/">embedded in the program</a>? If embedded, is it
worth compressing? In this article I’ll present a simple, tailored Wordle
list compression strategy that beats general purpose compressors.</p>

<p>Last week one particular <a href="/blog/2020/11/17/">QuickBASIC</a> clone, <a href="http://grahamdowney.com/software/WorDOSle/WorDOSle.htm">WorDOSle</a>, caught my
eye. It embeds its word list despite the dire constraints of its 16-bit
platform. The original Wordle list (<a href="https://gist.github.com/cfreshman/cdcdf777450c5b5301e439061d29694c">1</a>, <a href="https://gist.github.com/cfreshman/a03ef2cba789d8cf00c08f767e0fad7b">2</a>) has 12,972 words which,
naively stored, would consume 77,832 bytes (5 letters, plus newline).
Sadly this exceeds a 16-bit address space. Eliminating the redundant
newline delimiter brings it down to 64,860 bytes — just small enough to
fit in an 8086 segment, but probably still difficult to manage from
QuickBASIC.</p>

<p>The author made a trade-off, reducing the word list to a more manageable,
if meager, 2,318 words, wisely excluding delimiters. Otherwise no further
effort made towards reducing the size. The list is sorted, and the program
cleverly tests words against the list in place using a binary search.</p>

<h3 id="compaction-baseline">Compaction baseline</h3>

<p>Before getting into any real compression technologies, there’s low hanging
fruit to investigate. Words are exactly five, case-insensitive, English
language letters: a–z. To illustrate, here are the first 100 5-letter
words from a short Wordle word list.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
</code></pre></div></div>

<p>In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (<code class="language-plaintext highlighter-rouge">log2(26)</code>) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (<code class="language-plaintext highlighter-rouge">26**5</code> is ~23.5 bits). A 40% savings just by using a
smarter representation.</p>

<p>With 12,972 words, that’s <strong>38,916 bytes</strong> for the whole list. Any
compression I apply must at least beat this size in order to be worth
using.</p>

<h3 id="letter-frequency">Letter frequency</h3>

<p>Not all letters occur at the same frequency. Here’s the letter frequency
for the original Wordle word list:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a:5990  e:6662  i:3759  m:1976  q: 112  u:2511  y:2074
b:1627  f:1115  j: 291  n:2952  r:4158  v: 694  z: 434
c:2028  g:1644  k:1505  o:4438  s:6665  w:1039
d:2453  h:1760  l:3371  p:2019  t:3295  x: 288
</code></pre></div></div>

<p>When encoding a word, I can save space by spending fewer bits on frequent
letters like <code class="language-plaintext highlighter-rouge">e</code> at the cost of spending more bits on infrequent letters
like <code class="language-plaintext highlighter-rouge">q</code>. There are multiple approaches, but the simplest is <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman
coding</a>. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.</p>

<p>While my ultimate target is C, I did the frequency analysis, explored the
problem space, and implemented my compressors in Python. I don’t normally
like to use Python, but it <em>is</em> good for one-shot, disposable data
science-y stuff like this. The decompressor will be implemented in C,
partially via meta-programming: Python code generating my C code. Here’s
my letter histogram code:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">sys</span><span class="p">.</span><span class="n">stdin</span><span class="p">]</span>
<span class="n">hist</span> <span class="o">=</span> <span class="n">collections</span><span class="p">.</span><span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">itertools</span><span class="p">.</span><span class="n">chain</span><span class="p">(</span><span class="o">*</span><span class="n">words</span><span class="p">):</span>
    <span class="n">hist</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>

<p>To build a Huffman coding tree, I’ll need a min-heap (priority queue)
initially filled with nodes representing each letter and its frequency.
While the heap has more than one element, I pop off the two lowest
frequency nodes, create a new parent node with the sum of their
frequencies, and push it into the heap. When the heap has one element, the
remaining element is the root of the Huffman coding tree.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">huffman</span><span class="p">(</span><span class="n">hist</span><span class="p">):</span>
    <span class="n">heap</span> <span class="o">=</span> <span class="p">[(</span><span class="n">n</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">hist</span><span class="p">.</span><span class="n">items</span><span class="p">()]</span>
    <span class="n">heapq</span><span class="p">.</span><span class="n">heapify</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span>
    <span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappop</span><span class="p">(</span><span class="n">heap</span><span class="p">),</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappop</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span>
        <span class="n">node</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">heapq</span><span class="p">.</span><span class="n">heappush</span><span class="p">(</span><span class="n">heap</span><span class="p">,</span> <span class="n">node</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">heap</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>

<span class="n">tree</span> <span class="o">=</span> <span class="n">huffman</span><span class="p">(</span><span class="n">hist</span><span class="p">)</span>
</code></pre></div></div>

<p>(By the way, I love that <code class="language-plaintext highlighter-rouge">heapq</code> operates directly on a plain <code class="language-plaintext highlighter-rouge">list</code>
rather than being its own data structure.) This produces the following
Huffman coding tree (via <code class="language-plaintext highlighter-rouge">pprint</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((('e', 's'),
  (('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
 ((('i', ('p', 'c')),
   ('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
  (('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))
</code></pre></div></div>

<p>It would be more useful to actually see the encodings.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="s">""</span><span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">prefix</span><span class="o">+</span><span class="s">"0"</span><span class="p">)</span> <span class="o">+</span> \
               <span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">prefix</span><span class="o">+</span><span class="s">"1"</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="p">[(</span><span class="n">tree</span><span class="p">,</span> <span class="n">prefix</span><span class="p">)]</span>
</code></pre></div></div>

<p>I used <code class="language-plaintext highlighter-rouge">isinstance</code> to distinguish leaves (<code class="language-plaintext highlighter-rouge">str</code>) from internal nodes
(<code class="language-plaintext highlighter-rouge">tuple</code>). With <code class="language-plaintext highlighter-rouge">sorted(flatten(tree))</code>, I get something like Morse Code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('a', '1110'),       ('j', '10111110'),   ('s', '001'),
 ('b', '111111'),     ('k', '111110'),     ('t', '0100'),
 ('c', '10011'),      ('l', '0101'),       ('u', '11011'),
 ('d', '11010'),      ('m', '01111'),      ('v', '011010'),
 ('e', '000'),        ('n', '11110'),      ('w', '011011'),
 ('f', '101110'),     ('o', '1100'),       ('x', '101111111'),
 ('g', '01100'),      ('p', '10010'),      ('y', '10110'),
 ('h', '01110'),      ('q', '101111110'),  ('z', '1011110')]
 ('i', '1000'),       ('r', '1010'),
</code></pre></div></div>

<p>In terms of encoded bit length, what is the shortest and longest?</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">codes</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">))</span>
<span class="n">lengths</span> <span class="o">=</span> <span class="p">[(</span><span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">w</span><span class="p">),</span> <span class="n">w</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">min(lengths)</code> is “esses” at 15 bits, and <code class="language-plaintext highlighter-rouge">max(lengths)</code> is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: <code class="language-plaintext highlighter-rouge">sum(w[0] for w in lengths)</code>
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.</p>

<p>Speaking of bit packing, I’m ready to compress the entire word list into a
bit stream:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bits</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">w</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">)</span>
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">bits</code> begins with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11101110011100001101011101110010110001000111011101...
</code></pre></div></div>

<p>On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused <code class="language-plaintext highlighter-rouge">textwrap</code> to dice it up, and I also need to reverse each
set of bits before converting to an integer.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u32</span> <span class="o">=</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">b</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">textwrap</span><span class="p">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">bits</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">32</span><span class="p">)]</span>
</code></pre></div></div>

<p>I now have my compressed data as a sequence of 32-bit integers. Next, some
meta-programming:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"static const uint32_t words[</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">u32</span><span class="p">)</span><span class="si">}</span><span class="s">] ="</span><span class="p">,</span> <span class="s">"{"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">u</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">u32</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">6</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">    "</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"0x</span><span class="si">{</span><span class="n">u</span><span class="si">:</span><span class="mi">08</span><span class="n">x</span><span class="si">}</span><span class="s">,"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">};"</span><span class="p">)</span>
</code></pre></div></div>

<p>That produces a C table, the beginnings of my decompressor. The array
length isn’t necessary since the C compiler can figure it out, but being
explicit allows human readers to know the size at a glance, too. Observe
how the final 32-bit integer isn’t entirely filled.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">words</span><span class="p">[</span><span class="mi">8812</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mh">0x4eeb0e77</span><span class="p">,</span><span class="mh">0xb8caee23</span><span class="p">,</span><span class="mh">0xffb892bb</span><span class="p">,</span><span class="mh">0x397fddf2</span><span class="p">,</span><span class="mh">0xddfcbfee</span><span class="p">,</span><span class="mh">0x5ff7997f</span><span class="p">,</span>
    <span class="c1">// ...</span>
    <span class="mh">0x7b4e66bd</span><span class="p">,</span><span class="mh">0x35ebcccd</span><span class="p">,</span><span class="mh">0x8f9af60f</span><span class="p">,</span><span class="mh">0x0000000c</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Now, how to go about building the rest of the decompressor? I have a
Huffman coding tree, which is <em>an awful lot</em> <a href="/blog/2020/12/31/">like a state machine</a>,
eh? I can even have Python generate a state transition table from the
Huffman tree:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">states</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">):</span>
        <span class="n">child</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="n">child</span>
        <span class="n">states</span><span class="p">.</span><span class="n">extend</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">))</span>
        <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">states</span><span class="p">,</span> <span class="n">child</span><span class="o">+</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">states</span><span class="p">,</span> <span class="n">child</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="nb">ord</span><span class="p">(</span><span class="n">tree</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">states</span>

<span class="n">states</span> <span class="o">=</span> <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="p">[</span><span class="bp">None</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In <code class="language-plaintext highlighter-rouge">transitions</code>, the caller reserves
space in the state table for callees, hence starting with <code class="language-plaintext highlighter-rouge">[None]</code>. I’ll
show the actual table in C form after some more meta-programming:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"static const int8_t states[</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">)</span><span class="si">}</span><span class="s">] ="</span><span class="p">,</span> <span class="s">"{"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">states</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">12</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">    "</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">s</span><span class="si">:</span><span class="mi">4</span><span class="si">}</span><span class="s">,"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">};"</span><span class="p">)</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">int8_t</code> since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int8_t</span> <span class="n">states</span><span class="p">[</span><span class="mi">51</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
      <span class="o">-</span><span class="mi">1</span><span class="p">,</span>  <span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">19</span><span class="p">,</span>  <span class="o">-</span><span class="mi">5</span><span class="p">,</span>  <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="mi">101</span><span class="p">,</span> <span class="mi">115</span><span class="p">,</span>  <span class="o">-</span><span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">11</span><span class="p">,</span> <span class="mi">116</span><span class="p">,</span> <span class="mi">108</span><span class="p">,</span> <span class="o">-</span><span class="mi">13</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">17</span><span class="p">,</span> <span class="mi">103</span><span class="p">,</span> <span class="o">-</span><span class="mi">15</span><span class="p">,</span> <span class="mi">118</span><span class="p">,</span> <span class="mi">119</span><span class="p">,</span> <span class="mi">104</span><span class="p">,</span> <span class="mi">109</span><span class="p">,</span> <span class="o">-</span><span class="mi">21</span><span class="p">,</span> <span class="o">-</span><span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">27</span><span class="p">,</span> <span class="mi">105</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">25</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span>  <span class="mi">99</span><span class="p">,</span> <span class="mi">114</span><span class="p">,</span> <span class="o">-</span><span class="mi">29</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="o">-</span><span class="mi">31</span><span class="p">,</span> <span class="mi">102</span><span class="p">,</span> <span class="o">-</span><span class="mi">33</span><span class="p">,</span> <span class="mi">122</span><span class="p">,</span> <span class="o">-</span><span class="mi">35</span><span class="p">,</span> <span class="mi">106</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">37</span><span class="p">,</span> <span class="mi">113</span><span class="p">,</span> <span class="mi">120</span><span class="p">,</span> <span class="o">-</span><span class="mi">41</span><span class="p">,</span> <span class="o">-</span><span class="mi">45</span><span class="p">,</span> <span class="mi">111</span><span class="p">,</span> <span class="o">-</span><span class="mi">43</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">117</span><span class="p">,</span>  <span class="mi">97</span><span class="p">,</span> <span class="o">-</span><span class="mi">47</span><span class="p">,</span> <span class="mi">110</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">49</span><span class="p">,</span> <span class="mi">107</span><span class="p">,</span>  <span class="mi">98</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The first node is -1, meaning if you read a 0 bit then transition to state
1, else state 2 (e.g. immediately following 1). The decompressor reads one
bit at a time, walking the state table until it hits a positive value,
which is an ASCII code. I’ve decided on this function prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">next</span><span class="p">(</span><span class="kt">char</span> <span class="n">word</span><span class="p">[</span><span class="mi">5</span><span class="p">],</span> <span class="kt">int32_t</span> <span class="n">n</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">n</code> is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">next</span><span class="p">(</span><span class="kt">char</span> <span class="n">word</span><span class="p">[</span><span class="mi">5</span><span class="p">],</span> <span class="kt">int32_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">5</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">n</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">n</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// next bit</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="n">word</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When compiled, this is about 80 bytes of instructions, both x86-64 and
ARM64. This, along with the 51 bytes for the state table, should be
counted against the compression size. That’s 35,579 bytes total.</p>

<p>Trying it out, this program indeed reproduces the original word list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">word</span><span class="p">[]</span> <span class="o">=</span> <span class="s">".....</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">12972</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with <code class="language-plaintext highlighter-rouge">b</code> is at bit offset 15,743. If the
word I’m looking up begins with <code class="language-plaintext highlighter-rouge">b</code> then I can start there and stop at the
first <code class="language-plaintext highlighter-rouge">c</code>, decompressing just 909 words.</p>

<h3 id="taking-it-to-the-next-level-run-length-encoding">Taking it to the next level: run-length encoding</h3>

<p>Here’s the 100-word word list sample again. The sorting is deliberate:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
</code></pre></div></div>

<p>If I look at words column-wise, I see a long run of <code class="language-plaintext highlighter-rouge">a</code>, then a long run
of <code class="language-plaintext highlighter-rouge">b</code>, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)</p>

<p>For this second scheme, I’ll pivot the whole list so that I can encode it
in column-order. (This is roughly how one part of bzip2 works, by the
way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’,
etc.”, then I’ll encode these RLE tokens using Huffman coding, per the
first scheme, since there will be lots of repeated tokens.</p>

<p>First, pivot the word list:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pivot</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>

<p>Next compute the RLE token stream. The stream works in pairs, first
indicating a letter (1–26), then the run length.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">offset</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot</span><span class="p">):</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">pivot</span><span class="p">[</span><span class="n">offset</span><span class="p">]</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">offset</span>
    <span class="k">while</span> <span class="n">offset</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot</span><span class="p">)</span> <span class="ow">and</span> <span class="n">pivot</span><span class="p">[</span><span class="n">offset</span><span class="p">]</span> <span class="o">==</span> <span class="n">c</span><span class="p">:</span>
        <span class="n">offset</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">-</span> <span class="nb">ord</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">offset</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
</code></pre></div></div>

<p>I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 —
since I’m going to encode all the tokens using the same Huffman tree.
(Exercise for the reader: Does compression improve with two distinct
Huffman trees, one for letters and the other for runs?) There are no
zero-length runs, and I want there to be as few unique tokens as possible.</p>

<p><code class="language-plaintext highlighter-rouge">tokens</code> looks like so (e.g. 737 ‘a’, 909 ‘b’, …):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]
</code></pre></div></div>

<p>The original Wordle list results in 139 unique tokens. A few tokens appear
many times, but most of appear only once. Reusing my Huffman coding tree
builder from before:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tree</span> <span class="o">=</span> <span class="n">huffman</span><span class="p">(</span><span class="n">collections</span><span class="p">.</span><span class="n">Counter</span><span class="p">(</span><span class="n">tokens</span><span class="p">))</span>
</code></pre></div></div>

<p>This makes for a more complex and interesting tree:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1,
 ((((18, 20), (25, (((10, 24), (26, 22)), 8))),
   (5,
    ((11,
      ((23,
        ((17,
          (((35, (46, 76)), ((82, 93), (104, 111))),
           (((165, 168), 27), (28, (((30, 39), 31), 38))))),
         ((((((40, 41), ((44, 48), 45)),
             ((53, (54, 56)), 55)),
            ((((57, 59), 58), ((60, 61), (62, 63))),
             ((64, (65, 66)), ((67, 70), 68)))),
           (((((71, 75), 74), (77, (78, 79))),
             (((80, 85), 87), 81)),
            ((((90, 91), (92, 97)), (96, (99, 100))),
             (((101, 103), 102),
              ((105, 106), (109, 110)))))),
          ((((((113, 114), 117), ((120, 121), (125, 129))),
             (((130, 133), (137, 139)), (138, (140, 142)))),
            ((((144, 145), (147, 153)), (148, (166, 175))),
             (((181, 183), (187, 189)),
              ((193, 202), (220, 242))))),
           (((((262, 303), (325, 376)),
              ((413, 489), (577, 598))),
             (((628, 638), (685, 693)),
              ((737, 815), (859, 909)))),
            ((((922, 1565), 29), 32), (34, (33, 43)))))))),
       6)),
     3))),
  ((19, 2),
   ((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))
</code></pre></div></div>

<p>Peeking at the first 21 elements of <code class="language-plaintext highlighter-rouge">sorted(flatten(tree))</code>, which chops
off the long tail of large-valued, single-occurrence tokens:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(1, '0'),            (8, '100111'),       (15, '111010'),
 (2, '1101'),         (9, '111101'),       (16, '1110111'),
 (3, '10111'),        (10, '10011000'),    (17, '1011010100'),
 (4, '11100'),        (11, '101100'),      (18, '10000'),
 (5, '1010'),         (12, '111110'),      (19, '1100'),
 (6, '1011011'),      (13, '1111110'),     (20, '10001'),
 (7, '1111111'),      (14, '111100'),      (21, '1110110')]
</code></pre></div></div>

<p>Huffman-encoding the RLE stream is more straightforward:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">codes</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">))</span>
<span class="n">bits</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokens</span><span class="p">)</span>
</code></pre></div></div>

<p>This time <code class="language-plaintext highlighter-rouge">len(bits)</code> is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!</p>

<p>Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be <code class="language-plaintext highlighter-rouge">int16_t</code>. I copy-pasted the
original meta-programming code and make the appropriate adjustments:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int16_t</span> <span class="n">states</span><span class="p">[</span><span class="mi">277</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
      <span class="o">-</span><span class="mi">1</span><span class="p">,</span>   <span class="mi">1</span><span class="p">,</span>  <span class="o">-</span><span class="mi">3</span><span class="p">,</span>  <span class="o">-</span><span class="mi">5</span><span class="p">,</span><span class="o">-</span><span class="mi">257</span><span class="p">,</span>  <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="o">-</span><span class="mi">21</span><span class="p">,</span>  <span class="o">-</span><span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">11</span><span class="p">,</span>  <span class="mi">18</span><span class="p">,</span>  <span class="mi">20</span><span class="p">,</span>  <span class="mi">25</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">13</span><span class="p">,</span> <span class="o">-</span><span class="mi">15</span><span class="p">,</span>   <span class="mi">8</span><span class="p">,</span> <span class="o">-</span><span class="mi">17</span><span class="p">,</span> <span class="o">-</span><span class="mi">19</span><span class="p">,</span>  <span class="mi">10</span><span class="p">,</span>  <span class="mi">24</span><span class="p">,</span>  <span class="mi">26</span><span class="p">,</span>  <span class="mi">22</span><span class="p">,</span>   <span class="mi">5</span><span class="p">,</span> <span class="o">-</span><span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">25</span><span class="p">,</span>
       <span class="mi">3</span><span class="p">,</span>  <span class="mi">11</span><span class="p">,</span> <span class="o">-</span><span class="mi">27</span><span class="p">,</span> <span class="o">-</span><span class="mi">29</span><span class="p">,</span>   <span class="mi">6</span><span class="p">,</span>  <span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">31</span><span class="p">,</span> <span class="o">-</span><span class="mi">33</span><span class="p">,</span> <span class="o">-</span><span class="mi">63</span><span class="p">,</span>  <span class="mi">17</span><span class="p">,</span> <span class="o">-</span><span class="mi">35</span><span class="p">,</span> <span class="o">-</span><span class="mi">37</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">49</span><span class="p">,</span> <span class="o">-</span><span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">43</span><span class="p">,</span>  <span class="mi">35</span><span class="p">,</span> <span class="o">-</span><span class="mi">41</span><span class="p">,</span>  <span class="mi">46</span><span class="p">,</span>  <span class="mi">76</span><span class="p">,</span> <span class="o">-</span><span class="mi">45</span><span class="p">,</span> <span class="o">-</span><span class="mi">47</span><span class="p">,</span>  <span class="mi">82</span><span class="p">,</span>  <span class="mi">93</span><span class="p">,</span> <span class="mi">104</span><span class="p">,</span>
     <span class="mi">111</span><span class="p">,</span> <span class="o">-</span><span class="mi">51</span><span class="p">,</span> <span class="o">-</span><span class="mi">55</span><span class="p">,</span> <span class="o">-</span><span class="mi">53</span><span class="p">,</span>  <span class="mi">27</span><span class="p">,</span> <span class="mi">165</span><span class="p">,</span> <span class="mi">168</span><span class="p">,</span>  <span class="mi">28</span><span class="p">,</span> <span class="o">-</span><span class="mi">57</span><span class="p">,</span> <span class="o">-</span><span class="mi">59</span><span class="p">,</span>  <span class="mi">38</span><span class="p">,</span> <span class="o">-</span><span class="mi">61</span><span class="p">,</span>
      <span class="mi">31</span><span class="p">,</span>  <span class="mi">30</span><span class="p">,</span>  <span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">65</span><span class="p">,</span><span class="o">-</span><span class="mi">155</span><span class="p">,</span> <span class="o">-</span><span class="mi">67</span><span class="p">,</span><span class="o">-</span><span class="mi">109</span><span class="p">,</span> <span class="o">-</span><span class="mi">69</span><span class="p">,</span> <span class="o">-</span><span class="mi">85</span><span class="p">,</span> <span class="o">-</span><span class="mi">71</span><span class="p">,</span> <span class="o">-</span><span class="mi">79</span><span class="p">,</span> <span class="o">-</span><span class="mi">73</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">75</span><span class="p">,</span>  <span class="mi">40</span><span class="p">,</span>  <span class="mi">41</span><span class="p">,</span> <span class="o">-</span><span class="mi">77</span><span class="p">,</span>  <span class="mi">45</span><span class="p">,</span>  <span class="mi">44</span><span class="p">,</span>  <span class="mi">48</span><span class="p">,</span> <span class="o">-</span><span class="mi">81</span><span class="p">,</span>  <span class="mi">55</span><span class="p">,</span>  <span class="mi">53</span><span class="p">,</span> <span class="o">-</span><span class="mi">83</span><span class="p">,</span>  <span class="mi">54</span><span class="p">,</span>
      <span class="mi">56</span><span class="p">,</span> <span class="o">-</span><span class="mi">87</span><span class="p">,</span> <span class="o">-</span><span class="mi">99</span><span class="p">,</span> <span class="o">-</span><span class="mi">89</span><span class="p">,</span> <span class="o">-</span><span class="mi">93</span><span class="p">,</span> <span class="o">-</span><span class="mi">91</span><span class="p">,</span>  <span class="mi">58</span><span class="p">,</span>  <span class="mi">57</span><span class="p">,</span>  <span class="mi">59</span><span class="p">,</span> <span class="o">-</span><span class="mi">95</span><span class="p">,</span> <span class="o">-</span><span class="mi">97</span><span class="p">,</span>  <span class="mi">60</span><span class="p">,</span>
      <span class="mi">61</span><span class="p">,</span>  <span class="mi">62</span><span class="p">,</span>  <span class="mi">63</span><span class="p">,</span><span class="o">-</span><span class="mi">101</span><span class="p">,</span><span class="o">-</span><span class="mi">105</span><span class="p">,</span>  <span class="mi">64</span><span class="p">,</span><span class="o">-</span><span class="mi">103</span><span class="p">,</span>  <span class="mi">65</span><span class="p">,</span>  <span class="mi">66</span><span class="p">,</span><span class="o">-</span><span class="mi">107</span><span class="p">,</span>  <span class="mi">68</span><span class="p">,</span>  <span class="mi">67</span><span class="p">,</span>
      <span class="mi">70</span><span class="p">,</span><span class="o">-</span><span class="mi">111</span><span class="p">,</span><span class="o">-</span><span class="mi">129</span><span class="p">,</span><span class="o">-</span><span class="mi">113</span><span class="p">,</span><span class="o">-</span><span class="mi">123</span><span class="p">,</span><span class="o">-</span><span class="mi">115</span><span class="p">,</span><span class="o">-</span><span class="mi">119</span><span class="p">,</span><span class="o">-</span><span class="mi">117</span><span class="p">,</span>  <span class="mi">74</span><span class="p">,</span>  <span class="mi">71</span><span class="p">,</span>  <span class="mi">75</span><span class="p">,</span>  <span class="mi">77</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">121</span><span class="p">,</span>  <span class="mi">78</span><span class="p">,</span>  <span class="mi">79</span><span class="p">,</span><span class="o">-</span><span class="mi">125</span><span class="p">,</span>  <span class="mi">81</span><span class="p">,</span><span class="o">-</span><span class="mi">127</span><span class="p">,</span>  <span class="mi">87</span><span class="p">,</span>  <span class="mi">80</span><span class="p">,</span>  <span class="mi">85</span><span class="p">,</span><span class="o">-</span><span class="mi">131</span><span class="p">,</span><span class="o">-</span><span class="mi">143</span><span class="p">,</span><span class="o">-</span><span class="mi">133</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">139</span><span class="p">,</span><span class="o">-</span><span class="mi">135</span><span class="p">,</span><span class="o">-</span><span class="mi">137</span><span class="p">,</span>  <span class="mi">90</span><span class="p">,</span>  <span class="mi">91</span><span class="p">,</span>  <span class="mi">92</span><span class="p">,</span>  <span class="mi">97</span><span class="p">,</span>  <span class="mi">96</span><span class="p">,</span><span class="o">-</span><span class="mi">141</span><span class="p">,</span>  <span class="mi">99</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span><span class="o">-</span><span class="mi">145</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">149</span><span class="p">,</span><span class="o">-</span><span class="mi">147</span><span class="p">,</span> <span class="mi">102</span><span class="p">,</span> <span class="mi">101</span><span class="p">,</span> <span class="mi">103</span><span class="p">,</span><span class="o">-</span><span class="mi">151</span><span class="p">,</span><span class="o">-</span><span class="mi">153</span><span class="p">,</span> <span class="mi">105</span><span class="p">,</span> <span class="mi">106</span><span class="p">,</span> <span class="mi">109</span><span class="p">,</span> <span class="mi">110</span><span class="p">,</span><span class="o">-</span><span class="mi">157</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">213</span><span class="p">,</span><span class="o">-</span><span class="mi">159</span><span class="p">,</span><span class="o">-</span><span class="mi">185</span><span class="p">,</span><span class="o">-</span><span class="mi">161</span><span class="p">,</span><span class="o">-</span><span class="mi">173</span><span class="p">,</span><span class="o">-</span><span class="mi">163</span><span class="p">,</span><span class="o">-</span><span class="mi">167</span><span class="p">,</span><span class="o">-</span><span class="mi">165</span><span class="p">,</span> <span class="mi">117</span><span class="p">,</span> <span class="mi">113</span><span class="p">,</span> <span class="mi">114</span><span class="p">,</span><span class="o">-</span><span class="mi">169</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">171</span><span class="p">,</span> <span class="mi">120</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="mi">125</span><span class="p">,</span> <span class="mi">129</span><span class="p">,</span><span class="o">-</span><span class="mi">175</span><span class="p">,</span><span class="o">-</span><span class="mi">181</span><span class="p">,</span><span class="o">-</span><span class="mi">177</span><span class="p">,</span><span class="o">-</span><span class="mi">179</span><span class="p">,</span> <span class="mi">130</span><span class="p">,</span> <span class="mi">133</span><span class="p">,</span> <span class="mi">137</span><span class="p">,</span>
     <span class="mi">139</span><span class="p">,</span> <span class="mi">138</span><span class="p">,</span><span class="o">-</span><span class="mi">183</span><span class="p">,</span> <span class="mi">140</span><span class="p">,</span> <span class="mi">142</span><span class="p">,</span><span class="o">-</span><span class="mi">187</span><span class="p">,</span><span class="o">-</span><span class="mi">199</span><span class="p">,</span><span class="o">-</span><span class="mi">189</span><span class="p">,</span><span class="o">-</span><span class="mi">195</span><span class="p">,</span><span class="o">-</span><span class="mi">191</span><span class="p">,</span><span class="o">-</span><span class="mi">193</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span>
     <span class="mi">145</span><span class="p">,</span> <span class="mi">147</span><span class="p">,</span> <span class="mi">153</span><span class="p">,</span> <span class="mi">148</span><span class="p">,</span><span class="o">-</span><span class="mi">197</span><span class="p">,</span> <span class="mi">166</span><span class="p">,</span> <span class="mi">175</span><span class="p">,</span><span class="o">-</span><span class="mi">201</span><span class="p">,</span><span class="o">-</span><span class="mi">207</span><span class="p">,</span><span class="o">-</span><span class="mi">203</span><span class="p">,</span><span class="o">-</span><span class="mi">205</span><span class="p">,</span> <span class="mi">181</span><span class="p">,</span>
     <span class="mi">183</span><span class="p">,</span> <span class="mi">187</span><span class="p">,</span> <span class="mi">189</span><span class="p">,</span><span class="o">-</span><span class="mi">209</span><span class="p">,</span><span class="o">-</span><span class="mi">211</span><span class="p">,</span> <span class="mi">193</span><span class="p">,</span> <span class="mi">202</span><span class="p">,</span> <span class="mi">220</span><span class="p">,</span> <span class="mi">242</span><span class="p">,</span><span class="o">-</span><span class="mi">215</span><span class="p">,</span><span class="o">-</span><span class="mi">245</span><span class="p">,</span><span class="o">-</span><span class="mi">217</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">231</span><span class="p">,</span><span class="o">-</span><span class="mi">219</span><span class="p">,</span><span class="o">-</span><span class="mi">225</span><span class="p">,</span><span class="o">-</span><span class="mi">221</span><span class="p">,</span><span class="o">-</span><span class="mi">223</span><span class="p">,</span> <span class="mi">262</span><span class="p">,</span> <span class="mi">303</span><span class="p">,</span> <span class="mi">325</span><span class="p">,</span> <span class="mi">376</span><span class="p">,</span><span class="o">-</span><span class="mi">227</span><span class="p">,</span><span class="o">-</span><span class="mi">229</span><span class="p">,</span> <span class="mi">413</span><span class="p">,</span>
     <span class="mi">489</span><span class="p">,</span> <span class="mi">577</span><span class="p">,</span> <span class="mi">598</span><span class="p">,</span><span class="o">-</span><span class="mi">233</span><span class="p">,</span><span class="o">-</span><span class="mi">239</span><span class="p">,</span><span class="o">-</span><span class="mi">235</span><span class="p">,</span><span class="o">-</span><span class="mi">237</span><span class="p">,</span> <span class="mi">628</span><span class="p">,</span> <span class="mi">638</span><span class="p">,</span> <span class="mi">685</span><span class="p">,</span> <span class="mi">693</span><span class="p">,</span><span class="o">-</span><span class="mi">241</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">243</span><span class="p">,</span> <span class="mi">737</span><span class="p">,</span> <span class="mi">815</span><span class="p">,</span> <span class="mi">859</span><span class="p">,</span> <span class="mi">909</span><span class="p">,</span><span class="o">-</span><span class="mi">247</span><span class="p">,</span><span class="o">-</span><span class="mi">253</span><span class="p">,</span><span class="o">-</span><span class="mi">249</span><span class="p">,</span>  <span class="mi">32</span><span class="p">,</span><span class="o">-</span><span class="mi">251</span><span class="p">,</span>  <span class="mi">29</span><span class="p">,</span> <span class="mi">922</span><span class="p">,</span>
    <span class="mi">1565</span><span class="p">,</span>  <span class="mi">34</span><span class="p">,</span><span class="o">-</span><span class="mi">255</span><span class="p">,</span>  <span class="mi">33</span><span class="p">,</span>  <span class="mi">43</span><span class="p">,</span><span class="o">-</span><span class="mi">259</span><span class="p">,</span><span class="o">-</span><span class="mi">261</span><span class="p">,</span>  <span class="mi">19</span><span class="p">,</span>   <span class="mi">2</span><span class="p">,</span><span class="o">-</span><span class="mi">263</span><span class="p">,</span><span class="o">-</span><span class="mi">269</span><span class="p">,</span>   <span class="mi">4</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">265</span><span class="p">,</span>  <span class="mi">15</span><span class="p">,</span><span class="o">-</span><span class="mi">267</span><span class="p">,</span>  <span class="mi">21</span><span class="p">,</span>  <span class="mi">16</span><span class="p">,</span><span class="o">-</span><span class="mi">271</span><span class="p">,</span><span class="o">-</span><span class="mi">273</span><span class="p">,</span>  <span class="mi">14</span><span class="p">,</span>   <span class="mi">9</span><span class="p">,</span>  <span class="mi">12</span><span class="p">,</span><span class="o">-</span><span class="mi">275</span><span class="p">,</span>  <span class="mi">13</span><span class="p">,</span>
       <span class="mi">7</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>(Since 277 is prime it will never wrap to a nice rectangle no matter what
width I plug in. Ugh.)</p>

<p>With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a <code class="language-plaintext highlighter-rouge">12972*5</code>-byte buffer to be
filled:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">decompress</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a <code class="language-plaintext highlighter-rouge">12972*3</code>-byte buffer.</p>

<p>Here’s my decoder, much like before:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">decompress</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">164958</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// Decode letter</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">i</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">+</span> <span class="mi">96</span><span class="p">;</span>

        <span class="c1">// Decode run-length</span>
        <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">i</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>

        <span class="c1">// Fill columns</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">buf</span><span class="p">[</span><span class="n">y</span><span class="o">*</span><span class="mi">5</span><span class="o">+</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">12972</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
            <span class="n">x</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And my new test exactly reproduces the original list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">12972</span><span class="o">*</span><span class="mi">5L</span><span class="p">];</span>
    <span class="n">decompress</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>

    <span class="kt">char</span> <span class="n">word</span><span class="p">[]</span> <span class="o">=</span> <span class="s">".....</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">12972</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">buf</span><span class="o">+</span><span class="n">i</span><span class="o">*</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">);</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Totalling it up:</p>

<ul>
  <li>Compressed data is 20,620 bytes</li>
  <li>State table is 554 bytes</li>
  <li>Decompressor is about 200 bytes</li>
</ul>

<p>That’s a total of 21,374 bytes. Surprisingly this beats general purpose
compressors!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PROGRAM     VERSION   SIZE
bzip2 -9    1.0.8     33,752
gzip -9     1.10      30,338
zstd -19    1.4.8     27,098
brotli -9   1.0.9     26,031
xz -9e      5.2.5     16,656
lzip -9     1.22      16,608
</code></pre></div></div>

<p>Only <code class="language-plaintext highlighter-rouge">xz</code> and <code class="language-plaintext highlighter-rouge">lzip</code> come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.</p>

<p><em>Update</em>: <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCAKF7Hnc4nVKS%3D2adUjyiRb5yBZUdw5z0K_Fb9kFbaW5S6i7POw%40mail.gmail.com%3E">Johannes Rudolph has pointed out</a> a compression scheme for
a Game Boy Wordle clone last month that gets it <a href="http://alexanderpruss.blogspot.com/2022/02/game-boy-wordle-how-to-compress-12972.html">down to 17,871 bytes,
<em>and</em> supports iteration</a>. I improved on this scheme to <a href="https://github.com/skeeto/scratch/blob/master/misc/wordle.c">further
reduce it to 16,659 bytes</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Modifying the Middle of a zlib Stream</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/09/"/>
    <id>urn:uuid:804f0de4-93c7-3a70-d0d5-2b3b7192491f</id>
    <updated>2016-09-09T03:37:03Z</updated>
    <category term="c"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p>I recently ran into problem where I needed to modify bytes at the
beginning of an existing <a href="http://www.zlib.net/">zlib</a> stream. My program creates a
file in a format I do not control, and the file format has a header
indicating the total, uncompressed data size, followed immediately by
the data. The tricky part is that the <strong>header and data are zlib
compressed together</strong>, and I don’t know how much data there will be
until I’ve collected it all. Sometimes it’s many gigabytes. I don’t
know how to fill out the header when I start, and I can’t rewrite it
when I’m done since it’s compressed in the zlib stream … <em>or so I
thought</em>.</p>

<svg version="1.1" height="50" width="600">
  <rect fill="#dfd" width="149" height="48" x="1" y="1" stroke="black" stroke-width="2" />
  <rect fill="#ddf" width="449" height="48" x="150" y="1" stroke="black" stroke-width="2" />
  <text x="75" y="25" text-anchor="middle" dominant-baseline="central" font-size="22px" font-family="sans-serif">
    nelem
  </text>
  <text x="170" y="25" text-anchor="start" dominant-baseline="central" font-size="22px" font-family="sans-serif">
    samples[nelem]
  </text>
</svg>

<p>My original solution was not to compress anything until it gathered
the entirety of the data. The input would get concatenated into a huge
buffer, then finally compressed and written out. It’s not ideal,
because the program uses a lot more memory than it theoretical could,
especially if the data is highly compressible. It would be far better
to compress the data as it arrives and somehow update the header
later.</p>

<p>My first thought was to ask zlib to leave the header uncompressed,
then enable compression (<code class="language-plaintext highlighter-rouge">deflateParams()</code>) for the data. I’d work out
the magic offset and overwrite the uncompressed header bytes once I’m
done. There are two major issues with this, and I’ll address each:</p>

<ul>
  <li>
    <p>zlib includes a checksum (<a href="https://en.wikipedia.org/wiki/Adler-32">adler32</a>) at the end of the
data, and editing the stream would cause a mismatch. This fairly
easy to correct thanks to adler32’s properties.</p>
  </li>
  <li>
    <p>zlib is an LZ77-family compressor and <a href="/blog/2014/11/22/">compression comes from
back-references</a> into past (and sometimes future) bytes of
decompressed output. Up to 32kB of data following the header could
reference bytes in the header as a dictionary. I would need to ask
zlib not to reference these bytes. Fortunately the zlib API is
intentionally designed for this, though for different purposes.</p>
  </li>
</ul>

<h3 id="fixing-the-checksum">Fixing the checksum</h3>

<p>Ignoring the second problem for a moment, I could fix the checksum by
computing it myself. When I overwrite my uncompressed header bytes, I
could also overwrite the checksum at the end of the compressed stream.
For illustration, here’s an simple example implementation of adler32
(from Wikipedia):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MOD_ADLER 65521
</span>
<span class="kt">uint32_t</span>
<span class="nf">example_adler32</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">%</span> <span class="n">MOD_ADLER</span><span class="p">;</span>
        <span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="n">b</span> <span class="o">+</span> <span class="n">a</span><span class="p">)</span> <span class="o">%</span> <span class="n">MOD_ADLER</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you think about this for a moment, you may notice this puts me back
at square one. If I don’t know the header, then I don’t know the
checksum value at the end of the header, going into the data buffer.
I’d need to buffer all the data to compute the checksum. Fortunately
adler32 has the nice property that <strong>two checksums can be concatenated
as if they were one long stream</strong>. In a malicious context this is
known as a <a href="https://en.wikipedia.org/wiki/Length_extension_attack">length extension attack</a>, but it’s a real benefit
here.</p>

<p>It’s like the zlib authors anticipated my needs, because the zlib
library has a function <em>exactly</em> for this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">adler32_combine</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">adler1</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">adler2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len2</span><span class="p">);</span>
</code></pre></div></div>

<p>I just have to keep track of the data checksum <code class="language-plaintext highlighter-rouge">adler2</code> and I can
compute the proper checksum later.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">data_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// initial value</span>
<span class="k">while</span> <span class="p">(</span><span class="n">processing_input</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">data_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">data_adler</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="n">total</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// ...</span>
<span class="kt">uint32_t</span> <span class="n">header_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">header_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">header_adler</span><span class="p">,</span> <span class="n">header</span><span class="p">,</span> <span class="n">header_size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">adler</span> <span class="o">=</span> <span class="n">adler32_combine</span><span class="p">(</span><span class="n">header_adler</span><span class="p">,</span> <span class="n">data_adler</span><span class="p">,</span> <span class="n">total</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="preventing-back-references">Preventing back-references</h3>

<p>This part is more complicated and it helps to have some familiarity
with zlib. Every time zlib is asked to compress data, it’s given a
<a href="http://www.bolet.org/~pornin/deflate-flush.html">flush parameter</a>. Under normal operation, this value is always
<code class="language-plaintext highlighter-rouge">Z_NO_FLUSH</code> until the end of the stream, in which case it’s finalized
with <code class="language-plaintext highlighter-rouge">Z_FINISH</code>. Other flushing options force it to emit data sooner
at the cost of reduced compression ratio. This would primarily be used
to eliminate output latency on interactive streams (e.g. compressed
SSH sessions).</p>

<p>The necessary flush option for this situation is <code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code>. It
forces out all output data and resets the dictionary: a fence.
<strong>Future inputs cannot reference anything before a full flush.</strong> Since
the header is uncompressed, it will not reference itself either.
Ignoring the checksum problem, I can safely modify these bytes.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>To fully demonstrate all of this, I’ve put together an example using
one of my favorite image formats, <a href="https://en.wikipedia.org/wiki/Netpbm_format">Netpbm P6</a>.</p>

<ul>
  <li><a href="https://github.com/skeeto/zlib-mutate-demo">https://github.com/skeeto/zlib-mutate-demo</a></li>
</ul>

<p>In the P6 format, the image header is an ASCII description of the
image’s dimensions followed immediately by raw pixel data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P6
width height
depth
[RGB bytes]
</code></pre></div></div>

<p>It’s a bit contrived, but it’s the project I used to work it all out.
The demo reads arbitrary raw byte data on standard input and uses it
to produce a zlib-compressed PPM file on standard output. It doesn’t
know the size of the input ahead of time, nor does it naively buffer
it all. There’s no dynamic allocation (except for what zlib does
internally), but the program can process arbitrarily large input. The
only requirement is that <strong>standard output is seekable</strong>. Using the
technique described above, it patches the header within the zlib
stream with the final image dimensions after the input has been
exhausted.</p>

<p>If you’re on a Debian system, you can use <code class="language-plaintext highlighter-rouge">zlib-flate</code> to decompress
raw zlib streams (gzip wraps zlib, but can’t raw zlib). Alternatively
your system’s <code class="language-plaintext highlighter-rouge">openssl</code> program may have zlib support. Here’s running
it on itself as input. Remember, you can’t pipe it into zlib-flate
because the output needs to be seekable in order to write the header.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./zppm &lt; zppm &gt; out.ppmz
$ zlib-flate -uncompress &lt; out.ppmz &gt; out.ppm
</code></pre></div></div>

<p><img src="/img/zppm.png" alt="" /></p>

<p>Unfortunately due to the efficiency-mindedness of zlib, its use
requires careful bookkeeping that’s easy to get wrong. It’s a little
machine that at each step needs to be either fed more input or its
output buffer cleared. Even with all the error checking stripped away,
it’s still too much to go over in full here, but I’ll summarize the
parts.</p>

<p>First I process an empty buffer with compression disabled. The output
buffer will be discarded, so input buffer could be left uninitialized,
but I don’t want to <a href="http://valgrind.org/">upset anyone</a>. All I need is the output
size, which I use to seek over the to-be-written header. I use
<code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code> as described, and there’s no loop because I presume my
output buffer is easily big enough for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">bufin</span><span class="p">[</span><span class="mi">4096</span><span class="p">];</span>
<span class="kt">char</span> <span class="n">bufout</span><span class="p">[</span><span class="mi">4096</span><span class="p">];</span>

<span class="n">z_stream</span> <span class="n">z</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">next_in</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_in</span> <span class="o">=</span> <span class="n">HEADER_SIZE</span><span class="p">,</span>
    <span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">),</span>
<span class="p">};</span>
<span class="n">deflateInit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_NO_COMPRESSION</span><span class="p">);</span>
<span class="n">memset</span><span class="p">(</span><span class="n">bufin</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_FULL_FLUSH</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">SEEK_SET</span><span class="p">);</span>
</code></pre></div></div>

<p>Next I enable compression and reset the checksum. This makes zlib
track the checksum for all of the non-header input. Otherwise I’d be
throwing away all its checksum work and repeating it myself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">deflateParams</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_BEST_COMPRESSION</span><span class="p">,</span> <span class="n">Z_DEFAULT_STRATEGY</span><span class="p">);</span>
<span class="n">z</span><span class="p">.</span><span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>I won’t include it in this article, but what follows is a standard
zlib compression loop, consuming all the input data. There’s one key
difference compared to a normal zlib compression loop: when the input
is exhausted, instead of <code class="language-plaintext highlighter-rouge">Z_FINISH</code> I use <code class="language-plaintext highlighter-rouge">Z_SYNC_FLUSH</code> to force
everything out. The problem with <code class="language-plaintext highlighter-rouge">Z_FINISH</code> is that it will write the
checksum, but we’re not ready for that.</p>

<p>With all the input processed, it’s time to go back to rewrite the
header. Rather than mess around with magic byte offsets, I start a
second, temporary zlib stream and do the <code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code> like before,
but this time with the real header. In deciding the header size, I
reserved 6 characters for the width and 10 characters for the height.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sprintf</span><span class="p">(</span><span class="n">bufin</span><span class="p">,</span> <span class="s">"P6</span><span class="se">\n</span><span class="s">%-6lu</span><span class="se">\n</span><span class="s">%-10lu</span><span class="se">\n</span><span class="s">255</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">adler</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>

<span class="n">z_stream</span> <span class="n">zh</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">next_in</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_in</span> <span class="o">=</span> <span class="n">HEADER_SIZE</span><span class="p">,</span>
    <span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">),</span>
<span class="p">};</span>
<span class="n">deflateInit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">,</span> <span class="n">Z_NO_COMPRESSION</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">,</span> <span class="n">Z_FULL_FLUSH</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">SEEK_SET</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">bufout</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">zh</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">SEEK_END</span><span class="p">);</span>
<span class="n">deflateEnd</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">);</span>
</code></pre></div></div>

<p>The header is now complete, so I can go back to finish the original
compression stream. Again, I assume the output buffer is big enough
for these final bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">z</span><span class="p">.</span><span class="n">adler</span> <span class="o">=</span> <span class="n">adler32_combine</span><span class="p">(</span><span class="n">adler</span><span class="p">,</span> <span class="n">z</span><span class="p">.</span><span class="n">adler</span><span class="p">,</span> <span class="n">z</span><span class="p">.</span><span class="n">total_in</span> <span class="o">-</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>
<span class="n">z</span><span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">;</span>
<span class="n">z</span><span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_FINISH</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">bufout</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="n">deflateEnd</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s a lot more code than I expected, but it wasn’t too hard to work
out. If you want to get into the nitty gritty and <em>really</em> hack a zlib
stream, check out <a href="https://tools.ietf.org/html/rfc1950">RFC 1950</a> and <a href="https://tools.ietf.org/html/rfc1951">RFC 1951</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>LZSS Quine Puzzle</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/11/22/"/>
    <id>urn:uuid:7ec019e5-7b35-3f41-3dff-fd51c0d752bb</id>
    <updated>2014-11-22T05:29:18Z</updated>
    <category term="compression"/>
    <content type="html">
      <![CDATA[<p>When I was a kid I spent some time playing a top-down, 2D,
puzzle/action, 1993, MS-DOS game called <a href="http://en.wikipedia.org/wiki/God_of_Thunder_(video_game)"><em>God of Thunder</em></a>. It
came on a shareware CD, now long lost, called <em>Games People Play</em>. A
couple decades later I was recently reminded of the game and decided
<a href="http://www.adeptsoftware.com/got/">to dig it up</a> and play it again. It’s not quite as exciting
as I remember it — nostalgia really warps perception — but it’s
still an interesting game nonetheless.</p>

<p><img src="/img/screenshot/god-of-thunder.jpg" alt="" /></p>

<p>That got me thinking about how difficult it might be to modify (“mod”)
the game to add my own levels and puzzles. It’s a tiny game, so there
aren’t many assets to reverse engineer. Unpacked, the game just
<em>barely</em> fits on a 1.44 MB high density floppy disk. That was probably
one of the game’s primary design constraints. It also means it’s
almost certainly employing some sort of home-brew compression
algorithm in order to fit more content. I find these sorts of things
absolutely interesting and delightful.</p>

<p>You see, back in those old days, compression wasn’t really a “solved”
problem like it is today. They had to <a href="http://www.gamecrafters.com/gamecrafters/maddog/docs/history.html">design and implement their own
algorithms</a>, with varying degrees of success. Today if you
need compression for a project, you just grab <a href="http://www.zlib.net/">zlib</a>. Released
in 1995, it implements the most widely used compression algorithm
today, DEFLATE, with a tidy, in-memory API. zlib is well-tested,
thoroughly optimized, and sits in a nearly-perfect sweet spot between
compression ratio and performance. There’s even an <a href="https://code.google.com/p/miniz/">embeddable
version</a>. Since spinning platters are so slow compared to CPUs,
compression is likely to speed up an application simply because fewer
bytes need to go to and from the disk. Today it’s less about saving
storage space and more about reducing input/output demands.</p>

<p>Fortunately for me, someone has <a href="http://www.shikadi.net/moddingwiki/DAT_Format_%28God_of_Thunder%29">already reversed engineered</a>
most of the <em>God of Thunder</em> assets. It uses its own flavor of
<a href="http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Storer%E2%80%93Szymanski">Lempel-Ziv-Storer-Szymanski</a> (LZSS), which itself is derived
from LZ77, one of the algorithms used in DEFLATE. The original LZSS
paper focuses purely on the math, describing the algorithm in terms of
symbols with no concern for how it’s actually serialized into bits.
Those specific details were decided by the game’s developers, and
that’s what I’ll be describing below.</p>

<p>As an adult I’m finding the <em>God of Thunder</em> asset formats to be more
interesting than the game itself. It’s a better puzzle! I <a href="https://github.com/skeeto/binitools">really
enjoy</a> studying the file formats of various applications,
especially older ones that didn’t have modern standards to build on.
Usually lots of thought and engineering goes into the design these
formats — and, too often, not enough thought goes into it. The
format’s specifics reveal insights into the internal workings of the
application, sometimes exposing unanticipated failure modes. Prying
apart odd, proprietary formats (i.e. “data reduction”) is probably my
favorite kind of work at my day job, and it comes up fairly often.</p>

<h3 id="god-of-thunder-lzss-definition">God of Thunder LZSS Definition</h3>

<p>An LZSS compression stream is made up of two kinds of chunks: literals
and back references. A literal chunk is passed through to the output
buffer unchanged. A reference chunk is a pair of numbers: a length and
an offset backwards into the output buffer. Only a single bit is
needed for each chunk to identify its type.</p>

<p>To avoid any sort of complicated and slow bit wrangling, the <em>God of
Thunder</em> developers (or whoever inspired them) came up with the smart
idea to stage 8 of these bits up at once as a single byte, a “control”
byte. Since literal chunks are 1 byte and reference chunks are 2
bytes, everything falls onto clean byte boundaries. Every group of 8
chunks is prefixed with one of these control bytes, and so every LZSS
compression stream begins with a control byte. The least significant
bit controls the 1st chunk in the group and the most significant bit
controls the 8th chunk. A 1 denotes a literal and a 0 denotes a
reference.</p>

<p>So, for example, a control byte of <code class="language-plaintext highlighter-rouge">0xff</code> means to pass through
unchanged the next 8 bytes of the compression stream. This would be
the least efficient compression scenario, because the “compressed”
stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come
entirely from the back references.</p>

<p>A back reference is two bytes little endian (this was in MS-DOS
running on x86), the lower 12 bits are the offset and the upper 4 bits
are the length, minus 2. That is, you read the 4 length bits and
add 2. This is because it doesn’t make any sense to reference a length
shorter than 2: a literal chunk would be shorter. The offset doesn’t
have anything added to it. This was a design mistake since an offset
of 0 doesn’t make any sense. It refers to a byte just outside the
output buffer. It should have been stored as the offset minus 1.</p>

<p><img src="/img/diagram/lzss-reference.png" alt="" /></p>

<p>A 12-bit offset means up to a 4kB sliding window of output may be
referenced at any time. A 4-bit length, plus two, means up to 17 bytes
may be copied in a single back reference. Compared to other
compression algorithms, this is rather short.</p>

<p>It’s important to note that the length is allowed to extend beyond the
output buffer (offset &lt; length). The bytes are, in effect, copied one
at a time into the output and may potentially be reused within the
same operation (like the opposite of <a href="http://man7.org/linux/man-pages/man3/memmove.3.html">memmove</a>). An offset of
1 and a length of 10 means “repeat the last output byte 10 times.”</p>

<p>That’s the entire format! It’s extremely simple but reasonably
effective for the game’s assets.</p>

<h4 id="worst-case-and-best-case">Worst Case and Best Case</h4>

<p>In the worst case, such as compressing random data, the compression
stream will be at most 112.5% (9/8) bigger than the uncompressed
stream.</p>

<p>In the best case, such as a long string of zeros, the compressed
stream will be, at minimum, 12.5% (1/8) the size of the decompressed
stream. Think about it this way: imagine every chunk is a reference of
maximum length. That’s 1 control byte plus 16 (<code class="language-plaintext highlighter-rouge">8 * 2</code>) reference
bytes, for a total of 17 compressed bytes. This emits <code class="language-plaintext highlighter-rouge">17 * 8</code>
decompressed bytes, 17 being the maximum length from 8 chunks.
Conveniently those two 17s cancel, leaving a factor of 8 for the best
case.</p>

<h4 id="lzss-end-of-stream">LZSS End of Stream</h4>

<p>If you’re paying <em>really</em> close attention, you may have noticed that
by grouping 8 control bits at a time, the length of the input stream
is, strictly speaking, constrained to certain lengths. What if, during
compression, the input stream stream comes up short of exactly those 8
chunks? As is, there’s no way to communicate a premature end to the
stream. There are three ways around this using a small amount of
metadata, each differing in robustness.</p>

<ol>
  <li>
    <p>Keep track of the size of the decompressed data. When that many
bytes have been emitted, halt. This is how <em>God of Thunder</em> handles
it. A small validation check could be performed here. The output
stream should always end <em>between</em> chunks, not in the middle of a
chunk (i.e. in the middle of copying a back reference). Some of the
bits in the control byte may contain arbitrary data that doesn’t
effect the output, which is a concern when hashing compressed data.
My suggestion: require the unused control bits to be 0, which
allows for an additional validation check. The output stream should
never end just short of a literal chunk.</p>
  </li>
  <li>
    <p>Keep track of the size of the compressed data. Halt when no more
chunks are encountered. A similar, weaker validation check can be
performed here: the input stream should never stop between two
bytes of a reference. It’s weaker because it’s less sensitive to
corruption, making it harder to detect. The same unused control bit
padding situation applies here.</p>
  </li>
  <li>
    <p>Use an out-of-band end marker (EOF). This is very similar to
keeping track of the input size (the filesystem is doing it), but
has the weakest validation of all. The stream could be accidentally
truncated at any point between chunks, which is undetectable. This
makes it the least sensitive to corruption.</p>
  </li>
</ol>

<h3 id="an-lzss-quine">An LZSS Quine</h3>

<p>After spending some time playing around with this format, I thought
about what it would take to make an LZSS quine. That is, <strong>find an
LZSS compressed stream that decompresses to itself.</strong> It’s been done
for DEFLATE, which I imagine is a much harder problem. There are <a href="http://steike.com/code/useless/zip-file-quine/">zip
files containing exact copies of themselves</a>, recursively. I’m
pretty confident it’s never been done for this exact compression
format, simply because it’s so specific to this old MS-DOS game.</p>

<p>I haven’t figured it out yet, so you won’t find the solution here.
This, dear readers, is my challenge to you! <strong>Using the format
described above, craft an LZSS quine.</strong> LZSS doesn’t have no-op chunks
(i.e. length = 0), which makes this harder than it would otherwise be.
It may not even be possible, which, in that case, your challenge is to
prove it!</p>

<p>So far I’ve determined that it begins with at least 4kB of <code class="language-plaintext highlighter-rouge">0xff</code>. Why
is this? First, as I mentioned, all compression streams begin with a
control byte. Second, no references can be made until at least one
literal byte has been passed, so the first bit (LSB) of the first byte
is a 1, and the second byte is exactly the same as the first byte. So
the first two bytes are xxxxxx1, with the x being “don’t care (yet).”</p>

<p>If the next chunk is a back reference, those first two bytes become
xxxxxx01. It could only reference that one byte (so offset = 1), and
the length would need to be at least two, ensuring at least the first
three bytes of output all have that same pattern. However, on the most
significant byte of the reference chunk, this conflicts with having an
offset of 1 because the 9th bit of the offset is set to 1, forcing the
offset to an invalid 257 bytes. Therefore, the second chunk must be a
literal.</p>

<p>This pattern continues until the first eight chunks are all literals,
which means the quine begins with at least 9 <code class="language-plaintext highlighter-rouge">0xff</code> bytes. Going on,
this also means the first back reference is going to be <code class="language-plaintext highlighter-rouge">0xffff</code>
(offset = 4095, length = 17), so the sliding window needs to be filled
enough to make that a offset valid. References would then be used to
“catch up” with the compression stream, then some magic is needed to
finish off the stream.</p>

<p>That’s where I’m stuck.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Lossless Optimizers</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2009/08/23/"/>
    <id>urn:uuid:644ecd27-0f43-358b-e405-9553f79ae7e1</id>
    <updated>2009-08-23T00:00:00Z</updated>
    <category term="compression"/>
    <content type="html">
      <![CDATA[<!-- 23 August 2009 -->
<p>
I've been using lossless optimizers for awhile now for PNGs, but more
recently I have found some for other formats. Here's the ones I know
about. These are all intended to be lossless, so running them should
result in no information loss (well, except the SVG one).
</p>
<p>
For <b>PNG</b>, there are a number of choices, but my favorite is <a
href="http://optipng.sourceforge.net/">OptiPNG</a>. It adjust the PNG
parameters and recompresses to find the optimal parameters. I run it
on almost all my images around here, and I tend to get around 10% to
30% reduction for images fresh off <a href="http://www.gimp.org/">
Gimp</a>, <a href="http://kolourpaint.sourceforge.net/">
Kolourpaint</a>, and <a href="http://imagemagick.org/">
ImageMagick</a>.
</p>
<p>
For <b>JPEG</b>, I use <a
href="http://freshmeat.net/projects/jpegoptim/"> jpegoptim</a>. It
works by optimizing the Huffman tables (the lossless part of JPEG
compression). I only found this one recently, but I will be using it
all the time, like on our new thousands of wedding reception photos.
</p>
<p>
For <b>PDF</b>, I found something called <a
href="http://qpdf.sourceforge.net/"> QPDF</a>. It's designed more for
other PDF transformations, but without any other parameters it will
simply losslessly optimize a PDF. From what I've seen so far it cuts
PDFs down by about a third.
</p>
<p>
For <b>SVG</b>, <a href="http://codedread.com/scour/">Scour</a> is a
young project, only a few months old. I've been looking for an SVG
optimizer for some time, so this was exciting to find. Due to the type
of file it's working with, it's not quite entirely lossless. Visually,
it is lossless, but it will toss all metadata (comments, etc.), which
may be important. If you hand-crafted your SVG, you won't want to use
this tool. It's good for removing <a href="http://inkscape.org/">
Inkscape</a> and Illustrator cruft, though.
</p>
<p>
I have yet to find a good (Free) GIF optimizer. Animated GIFs, with
lots of redundancy between frames, have a lot of potential for
optimization too. A video optimizer (for, say, Theora) would be
interesting; I imagine it might work similarly to jpegoptim. Audio
files (like Vorbis, FLAC, or MP3) probably don't have any room for
optimization. I could be wrong. For XHTML there is <a
href="http://www.w3.org/People/Raggett/tidy/">tidy</a> if you want to
count that. All the other XML formats (ODF, RSS, etc.) could have
their own too. Or optimizers for archives, like zip and tar. For tar
it might rearrange things to better suit gzip, bzip2, or
lzma. Executable optimizers? Postscript optimizers? It goes on and on.
</p>
<p>
If you know about any more, especially for other file formats, let me
know.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Avoid Zip Archives</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2009/03/22/"/>
    <id>urn:uuid:d8e0047c-9ec9-3553-f6ac-5e8528aa82ca</id>
    <updated>2009-03-22T00:00:00Z</updated>
    <category term="rant"/><category term="compression"/><category term="crypto"/>
    <content type="html">
      <![CDATA[<!-- 22 March 2009 -->
<p>
<img src="/img/misc/onion.jpg" class="right"
     title="Onion on Lettuce by swatjester, cc-by-sa 2.0"/>

In a <a href="/blog/2009/03/16"> previous post</a> about the LZMA
compression algorithm, I made a negative comment about zip archives
and moved on. I would like to go into more detail about it now.
</p>
<p>
A zip archive serves three functions all-in-one: compression, archive,
and encryption. On a unix-like system, these functions would normally
provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The
unix philosophy says to "write programs that do one thing and do it
well".
</p>
<p>
So in the case of zip archives, we are doing three things poorly when,
instead, we should be using three separate tools that each do one
thing well.
</p>
<p>
When we use three different tools, our encrypted archive is a lot like
an onion. On the outside we have encryption. After we peel that off by
decrypting it, we have compression, and after removing that lair,
finally the archive. This is reflected in the filename:
<code>.tar.gz.gpg</code>. As a side note, if GPG didn't already
support it, we could add base-64 encoding if needed as another layer
on the onion: <code>.tar.gz.gpg.b64</code>.
</p>
<p>
By using separate tools, we can also swap different tools in and out
without breaking any spec. Previously I mentioned using LZMA, which
could be used in place of gzip or bzip2. Instead of
<code>.tar.gz.gpg</code> you can have <code>.tar.lzma.gpg</code>. Or
you can swap out GPG for encryption and use, say, <a
href="http://ciphersaber.gurus.org/">CipherSaber</a> as
<code>.tar.lzma.cs2</code>. If we use a single one-size-fits-all
format, we are limited by the spec.
</p>
<h4>Compression</h4>
<p>
Both zip and gzip basically use the same compression algorithm. The
zip spec actually allows for a variety of other compression
algorithms, but you cannot rely on other tools to support them.
</p>
<p>
Zip archives are also inside out. Instead of <a
href="http://en.wikipedia.org/wiki/Solid_archive"> solid
compression</a>, which is what happens in tarballs, each file is
compressed individually. Redundancy between different files cannot be
exploited. The equivalent would be an inside out tarball:
<code>.gz.tar</code>. This would be produced by first individually
gzipping each file in a directory tree, then archiving them with
tar. This results in larger archive sizes.
</p>
<p>
However, there is an advantage to inside out archives: random
access. We can access a file in the middle of the archive without
having to take the whole thing apart. In general use, this sort of
thing isn't really needed, and solid compression would be more useful.
</p>
<h4>Archive</h4>
<p>
In a zip archive, timestamp resolution is limited to 2 seconds, which
is based on the old FAT filesystem time resolution. If your system
supports finer timestamps, you will lose information. But really, this
isn't a big deal.
</p>
<p>
It also does not store file ownership information, but this is also
not a big deal. It may even be desirable as a privacy measure.
</p>
<p>
Actually, the archive part of zip seems to be pretty reasonable, and
better than I thought it was. There don't seem to be any really
annoying problems with it.
</p>
<p>
Tar is still has advantages over zip. Zip doesn't quite allow the same
range of filenames as unix-like systems do, but it does allow
characters like * and ?. What happens when you extract files with
names containing these characters on an inferior operating system that
forbids them will depend on the tool.
</p>
<h4>Encryption</h4>
<p>
Encryption is where zip has been awful in the past. The original
spec's encryption algorithm had serious flaws and no one should even
consider using them today.
</p>
<p>
Since then, AES encryption has been worked into the standard and
implemented differently by different tools. Unless the same zip tool
is used on each end, you can't be sure AES encryption will work.
</p>
<p>
By placing encryption as part of the file spec, each tool has to
implement its own encryption, probably leaving out considerations like
using secure memory. These tools are concentrating on archiving and
compression, and so encryption will likely not be given a solid
effort.
</p>
<p>
In the implementations I know of, the archive index isn't encrypted,
so someone could open it up and see lots of file metadata, including
filenames.
</p>
<p>
When you encrypt a tarball with GnuPG, you have all the flexibility of
PGP available. Asymmetric encryption, web of trust, multiple strong
encryption algorithms, digital signatures, strong key management,
etc. It would be unreasonable for an archive format to have this kind
of thing built in.
</p>
<h4>Conclusion</h4>
<p>
You are almost always better off using a tarball rather than a zip
archive. Unfortunately the receiver of an archive will often be unable
to open anything else, so you may have no choice.
</p>
]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>LZMA Tarballs Are Coming</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2009/03/16/"/>
    <id>urn:uuid:8abde298-3c30-3a06-a978-679230b0b3dc</id>
    <updated>2009-03-16T00:00:00Z</updated>
    <category term="compression"/>
    <content type="html">
      <![CDATA[<!-- 16 March 2009 -->
<p>
<img src="/img/diagram/compress.png" class="right"/>

Any developer that uses a non-toy operating system will be familiar
with <a href="http://www.gzip.org/">gzip</a> and <a
href="http://www.bzip.org/">bzip2</a> tarballs (.tar.gz, .tgz, and
.tar.bz2). Most places will provide both versions so that the user can
use his preferred decompresser.
</p>
<p>
Both types are useful because they make tradeoffs at different points:
gzip is very fast with low memory requirements and bzip2 has much
better compression ratios at the cost of more memory and CPU
time. Users of older hardware will prefer gzip, because the benefits
of bzip2 are negated by the long decompression times, around 6 times
longer. This is why <a
href="http://www.openbsd.org/faq/faq1.html#HowAbout"> OpenBSD prefers
gzip</a>.
</p>
<p>
But there is a new compression algorithm in town. Well, it has been
around for about 10 years now, but, if I understand correctly, was
patent encumbered (aka useless) for awhile. It is called the <a
href="http://en.wikipedia.org/wiki/LZMA">Lempel-Ziv-Markov chain
algorithm</a> (LZMA). It is still maturing and different software that
uses LZMA still can't quite talk to each other. <a
href="http://www.7-zip.org/">7-zip</a> and <a
href="http://tukaani.org/lzma/">LZMA Utils</a> are a couple examples.
</p>
<p>
GNU tar <a href="http://www.gnu.org/software/tar/">added an
<code>--lzma</code> option</a> just last April, and finally gave it a
short option, <code>-J</code>, this past December. I take this as a
sign that LZMA tarballs (.tar.lzma) are going to become common over
the next several years. It also would seem that the GNU project has
officially blessed LZMA.
</p>
<p>
And not only that, I think LZMA tarballs will supplant bzip2
tarballs. The reason is because it is even more asymmetric than bzip2.
</p>
<p>
According to the LZMA Utils page, LZMA compression ratios are 15%
better than those of bzip2, but at the cost of being 4 to 12 times
slower on compression. In many applications, including tarball
distribution, this is completely acceptable because <i>decompression
is faster than bzip2</i>! There is an extreme asymmetry here that can
readily be exploited.
</p>
<p>
So, when a developer has a new release he tells his version control
system, or maybe his build system, to make a tar archive and compress
it with LZMA. If he has a computer from this millennium, it won't take
a lifetime to do, but it will still take some time. Since it could
take as much as two orders of magnitude longer to make than a gzip
tarball, he could make a gzip tarball first and put it up for
distribution. When the LZMA tarball is done, it will be about 30%
smaller and decompress almost as fast as the gzip tarball (but while
using a large amount of memory).
</p>
<p>
At this point, why would someone download a bzip2 archive? It's bigger
<i>and</i> slower. Right now possible reasons may be a lack of an LZMA
decompresser and/or lack of familiarity. Over time, these will both be
remedied.
</p>
<p>
Don't get me wrong. I don't hate bzip2. It is a very interesting
algorithm. In fact, I was breathless when I first understood the <a
href="http://en.wikipedia.org/wiki/Burrows-Wheeler_transform">
Burrows-Wheeler transform</a>, which bzip2 uses at one stage. I would
argue that bzip2 is more elegant than gzip and LZMA because it is less
arbitrary. But I do think it will become obsolete.
</p>
<p>
Unfortunately, the confused zip archive is here to stay for now
because it is the only compression tool that a certain popular, but
inferior, operating system ships with. I say "confused" because it
makes the mistake of combining three tools into one: archive,
compression, and encryption. As a result, instead of doing one thing
well it does three things poorly. Cell phone designers also make the
same mistake. Fortunately I don't have to touch zip archives often.
</p>
<p>
Finally, don't forget that LZMA is mostly useful where the asymmetry
can be exploited: data is compressed once and decompressed many
times. Take the gitweb interface, which provides access to a git
repository through a browser. It will provide a gzip tarball of any
commit on the fly. It doesn't do this by having all these tarballs
lying around, but creates them on demand. Data is compressed once and
decompressed once. Because of this, gzip is, and will remain, the best
option for this setting.
</p>
<p>
In conclusion, consider creating LZMA tarballs next time, and don't be
afraid to use them when you come across them.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
