null program

An Entropy-Efficient Virtual Dice Rolling Algorithm

This is related to a project I am working on and will post here soon. I imagine that, with a little more effort, this algorithm could turn into a short amateur paper.

Suppose you want to use a computer to simulate the roll of two six-sided dice (notated 2d6). The simplest approach would be to replicate the results the same way you would roll dice: independently and randomly generate two numbers between 1 and 6 inclusively. We easily can do this for any number of dice, we just iterate and roll each die. Like this recursive function,

(defun roll (n sides)
  "Roll n dice with given number of sides."
  (if (zerop n) ()
    (cons (1+ (random sides))
          (roll (1- n) sides))))

However, generating a number between 1 and 6 wastes small amounts of entropy. A six-sided die only takes about 2.58 bits of entropy to generate. Since we can only use bits discretely we have to spend 3 bits, throwing out 0.42 bits. On top of that, when we pull out 3 bits and they are out of range (0 or 7) we have to throw them out and try again.

Let's say we wanted to roll 10 dice, or 100 dice, or 1000 dice? Do we really need to generate that many numbers individually? That's a lot of wasted entropy adding up, entropy which can be expensive to gather. Well, we could instead use the probability distribution of the roll so that only a single number needs to be generated.

For a 2d6 roll, there are 36 unique possible outcomes (6^2). We could select a number between 0 and 35, then choose that specific roll. This roll can be calculated with a series of division and modulus operations (u for a number from a uniform distribution) (also, note that the division is integer division),

(defun roll-perm (n s u)
  "Get uth permutation of n s-sided dice."
  (if (zerop n) ()
    (let ((perms (expt s (1- n))))
      (cons (1+ (/ u perms))
            (roll-perm (1- n) s (mod u perms))))))

If we're only interested in the sum, we could save memory by making this tail recursive -- or iterative -- and summing the dice as we calculate them. Ignoring the exponent, this is O(n), not better than the simple algorithm in terms of growth rate. This algorithm is more efficient when it comes to entropy, though.

Consider 3d6, with 216 possible outcomes, ideally with the simple algorithm takes 3 3-bit rolls, consuming 9 bits. About 1.25 bits was not actually used (0.42 * 3). In the entropy-efficient algorithm we need about 7.75 bits, so it only consumes 8 bits of entropy. We saved a bit. That gap only gets larger with more dice. For 100d6 the simple algorithm uses 41 more bits than necessary.

The efficient roll is basically defragmenting the individual rolls on the entropy stream.

In non-ideal world, though, some cases don't work out well. In 12d6, almost half the numbers (compared to 25% in the case of 1d6) from the uniform distribution will be out of range and a lot more bits would be needed. On average, rolling dice individually (or only some of them individually) for 12d6 will be more efficient.

The efficient algorithm is only more efficient above a point near where mod(log2(s), 1) < mod(log2(n^s), 1).

And, all of this doesn't come without a cost. You must pay the piper, and this algorithm is paid with CPU and memory. Notice that exponent there? That has to be done to exact precision (no floating point), and it grows very quickly. If you want to roll more than a handful of dice, you will be crunching some large numbers. Rolling just 100d6 means you have to work with a 78 digit integer. 10000d6 is a 7782 digit integer. These can't be done in floating point because the resolution of floating point is too low: some rolls would not be possible.

The exponent could be memoized to trade some of that CPU time for more memory usage. Still, pretty costly. If you don't value your entropy, the tradeoff might not be worth it.

I can't see a way around performing that calculation. We need to know that big number exactly. Perhaps a mathematician might be able to manipulate the formulas such that it's not so expensive.

If you're rolling lots of dice and you want to preserve binary entropy, try it out. If you want to be really efficient queue up rolls -- or generate them ahead of time -- so that the number of outcomes is just below a power of two. In the case of d6, some good number of dice to roll are 17 (~43.94 bits), 29 (~74.96), 41 (~105.983), 94 (~242.986 bits), 200 (~516.993), 253 (~653.995 bits), 306 (~791.99853 bits), and 971 (~2509.99859 bits). (Notice these get closer and closer to an integer number of bits.)


Lossless Optimizers

I've been using lossless optimizers for awhile now for PNGs, but more recently I have found some for other formats. Here's the ones I know about. These are all intended to be lossless, so running them should result in no information loss (well, except the SVG one).

For PNG, there are a number of choices, but my favorite is OptiPNG. It adjust the PNG parameters and recompresses to find the optimal parameters. I run it on almost all my images around here, and I tend to get around 10% to 30% reduction for images fresh off Gimp, Kolourpaint, and ImageMagick.

For JPEG, I use jpegoptim. It works by optimizing the Huffman tables (the lossless part of JPEG compression). I only found this one recently, but I will be using it all the time, like on our new thousands of wedding reception photos.

For PDF, I found something called QPDF. It's designed more for other PDF transformations, but without any other parameters it will simply losslessly optimize a PDF. From what I've seen so far it cuts PDFs down by about a third.

For SVG, Scour is a young project, only a few months old. I've been looking for an SVG optimizer for some time, so this was exciting to find. Due to the type of file it's working with, it's not quite entirely lossless. Visually, it is lossless, but it will toss all metadata (comments, etc.), which may be important. If you hand-crafted your SVG, you won't want to use this tool. It's good for removing Inkscape and Illustrator cruft, though.

I have yet to find a good (Free) GIF optimizer. Animated GIFs, with lots of redundancy between frames, have a lot of potential for optimization too. A video optimizer (for, say, Theora) would be interesting; I imagine it might work similarly to jpegoptim. Audio files (like Vorbis, FLAC, or MP3) probably don't have any room for optimization. I could be wrong. For XHTML there is tidy if you want to count that. All the other XML formats (ODF, RSS, etc.) could have their own too. Or optimizers for archives, like zip and tar. For tar it might rearrange things to better suit gzip, bzip2, or lzma. Executable optimizers? Postscript optimizers? It goes on and on.

If you know about any more, especially for other file formats, let me know.


IRC Random Number Generator

A year ago while I was reading Applied Cryptography I got an idea for using IRC as a random number generator. The book mentioned using a rolling car wheel to generate random numbers, by measuring its period. So why not IRC? Grab the code,

git clone http://git.nullprogram.com/ircrng.git

It's based entirely on event timing: it logs in and sits on several channels, then measures the time between events. When a new event occurs it compares the time passed between this event and the last event, and the time between the two events before it. If it's greater, emit a 1, if less emit a 0. Simple.

For skew removal I used transition mappings (rfc1750), invented by von Neumann. It looks at pairs of bits. If they differ, pass the first bit, otherwise toss both bits. So if it comes across "11" or "00", it tosses them. If it comes across "10" it emits "1", and "01" it emits 0.

Here's an analysis by ent of 476 output bytes I generated overnight: irc.random.bytes.

Entropy = 7.515132 bits per byte.

Optimum compression would reduce the size
of this 476 byte file by 6 percent.

Chi square distribution for 476 samples is 274.79, and randomly
would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.7248 (127.5 = random).
Monte Carlo value for Pi is 2.886075949 (error 8.13 percent).
Serial correlation coefficient is -0.062745 (totally uncorrelated = 0.0).

Eh, not awful, but not too great. It got a good score on the Chi-square test, which I attribute to the skew filter. This generator is also extremely slow, generating only a few bytes per hour. At best, each event only generates a half of a bit, after skew correction. It would probably be more profitable to take the hash of the actual event with a timestamp and use that as the RNG.

Also, someone else running this generator on the same channels would generate very similar numbers. Worse, someone eavesdropping on your network connection knows mostly what numbers you generated. Worse yet, someone actively modifying your connection could control your generator and force whatever output is desired.

So overall, I would say it's not very useful.


Flying Spaghetti Monster in the HTML 5 Spec

I was reading through the HTML 5 specification earlier tonight and I was delighted to come across an obvious reference to the Flying Spaghetti Monster, complete with notes about His "noodly appendages". I'm guessing that someone silently slipped it in there and the people that might be offended won't get the reference anyway.

It's in an example for the img tag,

In the following example, a picture representing the flying spaghetti monster emblem, with each of the left noodly appendages and the right noodly appendages in different images, so that the user can pick the left side or the right side in an adventure.

For those who are unaware, it is the deity in a parody religion Pastafarianism. It's often used to demonstrate how silly religion is.


Web Pages Are Liquids

Web pages aren't a static medium, like books, brochures, or pamphlets. The web is not print. Accordingly, the layout of web pages should not be locked to some static width, but instead flow to fill the width of the browser like a liquid. Web pages should normally have a liquid layout.

One of the most obvious problems with the fixed layout occurs when the browser window is stretched wider than the designer had intended. The page looks like this,

Image showing wasted space.

I, as a user, have little control my viewing of the website. I'm stuck reading through a keyhole. It gets much worse if the browser isn't as wide as the designer intended: a horizontal scrollbar appears and navigation becomes very difficult. My laptop runs at a resolution of 1024x768, and I frequently come across pages where this is an issue. And according to Jakob Nielsen, in 2006 77% of user's screens were 1024 pixels wide or less.

See the liquid for yourself right here: adjust the width of your browser and watch this text flow to fill the screen. You can also bring it in pretty far before you clip an image and the horizontal scrollbar appears. The exact width depends only on the widest image being displayed. This also comes into play if you adjust the font size.

Using a liquid layout allows the page to work well with a wide variety of screen widths, and most importantly, gives users lots of control over how they view the site. It's very unfortunate that (in my experience) most websites employ a poor, fixed layout. Even web design "expert" websites will ironically hand out web design tips from within these annoying confines. One of the biggest culprits driving this is Wordpress, which has this flawed layout by default.

The very worst offenders tend to be websites with little actual content, like corporate websites or "artist" portfolios. The less usable the page, the less I wanted to be there anyway.

So please drop the fancy, low-usability web designs for something with much better usability. Your users will probably appreciate it.


Ad-blocking and the Regrettable URL Format

I use Adblock Plus to block advertisements and, more importantly, invisible privacy-breaking trackers (most people aren't even aware of these). I think ad-blocking is actually easier than ever, because ads are served from a relatively small number of domains, rather than from the websites themselves. Instead of patterns matching parts of a path, I can just block domains.

Adblock Plus emphasizes this by providing, by default, a pattern matching the server root. Example,

http://ads.example.com/*

But sometimes advertising websites are trickier, and their sub-domain is a fairly unique string,

http://ldp38fm.example.com/*

That pattern isn't very useful. I want something more like,

http://*.example.com/*

Unfortunately Adblock Plus doesn't provide this pattern automatically yet, so I have to do it manually. I think this pattern is less obvious because the URL format is actually broken. Notice have have two matching globs (*) rather than just one, even though I am simply blocking everything under a certain level.

Tim Berners-Lee regrets the format of the URL, and I agree with him. This is what URLs like http://ads.example.com/spy-tracker.js should look like,

http://com/example/ads/spy-tracker.js

It's a single coherent hierarchy with each level in order. This makes so much more sense! If I wanted to block example.com and all it's sub-domains, the pattern is much simpler and less error prone,

http://com/example/*

To anyone who ever reinvents the web: please get it right next time.

Update: There is significant further discussion in the comments.


Don't stop here! This isn't everything. Check out the archives (on the left) for more posts. Or just have a look at the index.