null program

Magick Thumbnails

For a long time I couldn't figure out how to make decent thumbnails with ImageMagick. Specifically, I wanted to create uniform sized thumbnails from arbitrary images. Over the weekend I came across the ImageMagick Examples page, which shows exactly how to do this. Here's the command for a 150x150 thumbnail,

convert orig.jpg -thumbnail 150x150^ -gravity center \
        -extent 150x150 thumb.jpg

It cuts out the largest square possible from the center of the image and resizes that to 150x150. This capability has actually only been available for 2 years now! It wasn't there last time I needed it.

I can think of one way to improve it: instead of selecting the center, it selects the area with the highest information density. This could be measured by edge detection, corner detection, or some other statistical method. It would be selected by changing the gravity option to, say, "entropy".

I'm listing this here mostly for my own future reference. :-)


Game of Life in Java

Sources:

git clone http://git.nullprogram.com/GameOfLife.git

Compiled (2009-12-13): GoL.jar (6.7kb)

Since I recently got back into Java recently, I threw together this little Game of Life implementation in Java. It looks a lot like my maze generator/solver on the inside, reflecting the way I think about these things. Gavin wrote a competing version of the game in Processing which we were partially discussing the other night, so I made my own.

The individual cells are actually objects themselves, so you could inherit the abstract Cell class and drop in your own rules. I bet you could even write a Cell that does the iterated prisoner's dilemma cellular automata. The Cell objects are wired together into a sort of mesh network. Instead of growing it wraps around on the sides.

It takes up to four arguments right now, with three types of cells, basic, implementing the basic Conway's Game of Life, growth, which is a cell that matures over time, and random which mixes both types together (seen in the screenshot). The arguments work as follows,

java -jar GoL.jar [<cell type> [<width> <height> [<cell pixel width>]]]

I may look into extending this to do some things beyond simple cellular automata.


Tweaking Emacs for Ant and Java

Developing C in Emacs is a real joy, and it's mostly thanks to the compile command. Once you have your Makefile -- or SConstruct or whatever build system you like -- setup and you want to compile your latest changes, just run M-x compile, which will run your build system in a buffer. You can then step through the errors and warnings with C-x `, and Emacs will take you to them. It's a very nice way to write code.

I use the compile command so much that I bound it to C-x C-k (C-k tends to be part of compile key bindings),

(global-set-key "\C-x\C-k" 'compile)

Until recently, I didn't have as nice of a setup for Java. Since they generally force offensive IDEs onto me at work this wasn't something I needed yet anyway, but I get to choose my environment on a new project this time. If you're using Makefiles for some reason when building your Java project, it still works out fairly well because they're usually called recursively. It gets more complicated with Ant, where there is only one top-level build file. Emacs' compile command only runs the build command in the buffer's current directory.

I know three solutions to this problem. One is to provide the build file's absolute path when compile asks for the command with the -buildfile (-f) option. You only need to type it once per Emacs session, so that's not too bad.

ant -emacs -buildfile /path/to/build.xml

It's not well documented, but there is a -find option that can be given to Ant that will cause it to search for the build file itself. This is even nicer than the previous solution. Just remember to place it last, unless you give it the build filename too. For example, if you wanted to run the clean target,

ant -emacs clean -find

To keep the actual call as simple as possible, I wrote a wrapper for compile, and put a hook in java-mode to change the local binding. The wrapper, ant-compile, searches for the build file the same way -find would do.

(defun ant-compile ()
  "Traveling up the path, find build.xml file and run compile."
  (interactive)
  (with-temp-buffer
    (while (and (not (file-exists-p "build.xml"))
                (not (equal "/" default-directory)))
      (cd ".."))
    (call-interactively 'compile)))

So I can transparently keep using my muscle memory compile binding, I set up the key binding in a hook,

(add-hook 'java-mode-hook
          (lambda () (local-set-key "\C-x\C-k" 'ant-compile)))

Voila! Java works looks a little bit more like C.


First Maryland Snow of 2009

Yesterday was a slightly chilly day, in the mid 50s (Fahrenheit), so when my officemate said we were getting snow I thought she was kidding. Well, today we woke up to this winter wonderland,

My wife leapt for her camera and filmed that as soon as she realized what was going on. We had much more than that by late afternoon. According to the Baltimore Sun it has snowed on December 5th for six of the last eight years, and we don't get much snow around here. My only wish would be for this to have happened on a weekday. I work for a university, and educational organizations tend to fold and give snow days really easily. This would probably have given me a nice paid day off!


Emacs Web Servlets

Remember that Emacs web server I wrote back in May? Well, I got an e-mail last night from Chunye Wang containing a patch with a variant of my dynamic lisp idea, called "servlets" (not to be confused with Java servlets). Chunye had similar concept for an Emacs web server for a long time, but never implemented because Emacs lacked network functionality until recently (Specifically, make-network-process in Emacs 22.1, June 2007). This led Chunye to find my implementation.

Again, you can clone/view the code here. I turned the patch into a series of commits,

git clone http://git.nullprogram.com/emacs-httpd.git

This is some cool stuff here.

The servlets are simply functions installed under an "httpd/" namespace, where the trailing slash represents the server root. So, the function httpd/example-servlet will be executed when "/example-servlet" is requested from the server. The servlet runs on a temporary buffer, whose contents are served when the servlet function returns.

To assist in HTML generation, Chunye also wrote a function to turn an S-expression into HTML, similar to the one I described in the web server previous post. Symbols are converted into strings, alists are attributes, and the elisp symbol indicates code to be executed, and the results used to generate HTML. For a simple hello word,

(html (head (title "hello world")) (body "hello world"))

And for some dynamic content, a die roller,

(defun httpd/roll-die (uri-query req uri-path)
  "Rolls a die with the requested number of sides (default 6)."
  (let ((sides
         (1- (string-to-number (or (cadr (assoc "sides" uri-query)) "6")))))
    (httpd-generate-html
     '(html
       (head
        (title "Die Roll Servlet"))
       (body
        (h1 "Die Roll Servlet")
        "You rolled a "
        (b
         (elisp (list (number-to-string (1+ (random sides)))))))))))

That one would be accessed from the browser with with "/roll-die" or "/roll-die?sides=100".

Chunye provided some sample servlets that list the buffers, with links that serve them up. There is also another servlet that will switch the current buffer, which I find compelling. All of Emacs' functionality is available to the servlet.

Now, to write a servlet that runs the Emacs psychiatrist ...


Your BitTorrent Client is Probably Defective by Design

Your BitTorrent client probably has DRM in it, even if it's Free software. Torrent files (.torrent) may contain a special, but unofficial, "private" flag. When it does, clients are supposed to disable decentralized tracking features, regardless of the user's desires. This is a direct analog to the copy-prevention flag that PDFs may have set, which tell PDF viewers to cripple themselves, except that your Free PDF reader is actually more likely to ignore it.

It's impossible to simply open the torrent file and turn off the flag. The client has to be modified, fixing the purposeful defect, to ignore it. Note, simpler clients that don't have these features in the first place don't have this problem, since they don't have any features to disable.

The private flag exists because modern BitTorrent trackers can function without a central tracker. If the central tracker is down, or if the user doesn't want to use it, the client can fetch a list of peers in the torrent from a worldwide distributed hash table. It's one big decentralized BitTorrent tracker (though any arbitrary data can be inserted into it). Clients also have the ability to tell each other about peers when they are doing their normal data exchange. Thanks to this, clients can transcend central trackers and join the larger global torrent of peers. It makes for healthier torrents.

Anyone who knows a few peers involved with a torrent can join in, regardless of their ability to talk to the central tracker. But private tracker sites don't want their torrents to be available outside to those outside of their control, so they proposed an addition to the BitTorrent spec for a "private" flag. Clients with decentralized capabilities are advised cripple that ability when the flag is on, so no peer lists will leak outside the private tracker. This flag was never accepted into the official spec, and I hope it never is.

Unfortunately the private trackers set an ultimatum: obey the private flag or find your client banned. The client developers fell in line and, and as far as I am aware, no publicly available clients will use decentralized tracking while the flag is on. At one point, the BitComet client ignored the flag and was banned for some time until it was "fixed".

The private flag wasn't placed in front with the rest of the metadata where it belonged. It's intentionally placed at the end of the torrent file inside of the info section. This means that the flag is part of the info_hash property of the torrent, which is the global identifier for the torrent. Unset or remove the private flag and the hash changes, creating a whole new torrent without any seeds.

This is DRM, an artificial restriction imposed on the user. It's insulting. Users should be the ones that control what happens with their computers. The reasonable approach to a private flag is that, when the private flag is enabled, decentralized tracking is turned off by default, but can be re-enabled by the user should they desire. That way the desired behavior is indicated but the user has the final say, not some unrelated website operator.

I rarely use private trackers, since they are nearly pointless, but I still find this private flag set on public torrents, probably from someone simply reposting the torrent file from a private site. It's annoying to run into. It makes the torrents weaker.

Debian, which is my distribution of choice, is generally good about removing DRM from the software it distributes. For example, the PDF readers in the repositories have their DRM disabled (i.e. xpdf). So why not do the same thing for all the intentionally defective BitTorrent clients?

I went on the Debian IRC channel on Freenode and brought up the issue only to find out that everyone thought a little DRM was reasonable. So then I filed a bug report on it, which was simply closed citing that the DRM is a beneficial "feature" and that removing the intentional defect would make the clients "poorer". They also insisted that it's part of the spec when it's not. I'm really disappointed in Debian now.

Now, I could modify a client to ignore the flag, but it's not useful if I am the only one not running DRM. It takes two to tango. A client used by many people would have to be fixed before it becomes beneficial.

So when someone asks for an example of Free Software or Open Source software with DRM in it, you can point to BitTorrent clients.


Comments Upgrade with Avatars

I started getting Asian spam in my comments in the last couple days. If you are subscribed to the comments feed you probably noticed this. The spammer was manually filling out captchas, so this wasn't a bot but rather a patient human being. Getting tired of removing these, I set up some filters to silently drop messages that fit certain criteria. By "silently" I mean the server tells the client everything went fine but the comment never actually gets written to the database.

The spammer gained nothing except annoying us because all links in comments get a rel="nofollow" attribute, which tells search engines to ignore it. That, plus small readership and captcha-solving gives little incentive for spamming.

Well anyway, while I had my sleeves rolled up and my hands on the code I decided to make some upgrades I have been wanting to do. The e-mail address is no longer displayed (stupid idea in the first place) but instead used for a Gravatar image. You can also specify a home URL, which will be linked from your name. This makes my comment system work very similarly to what you find around the web, except that I don't require anything from you but a captcha and a comment.

I also fixed a small usability issue: when you preview a comment now it takes you right down to the form rather than leaving you at the top of the page. The back-end database was also adjusted from the original pollxn design to scale better as the website grows.

Now, Gravatar is a neat concept but I have two complaints. One, I don't like centralized systems. It's a single point of failure and a single point of control. It has privacy issues. It's anti-web. It's inelegant. Decentralized systems built around self-enforcing protocols are more robust and democratic.

Luckily, a decentralized version does exist! It's a specification called Pavatar. The avatar is tied to a URL rather than an e-mail address. However, it's a bit less flexible, since it needs to remain simple on the server side. It's harder to set up and I doubt 99% of the users on the web would be capable of doing it. What would help Pavatar gain a wider audience would be Pavatar provider services. Hmmm...

So, I think might switch it to Pavatar sometime, with a possible fall-back to Gravatar. That takes significantly more work to set up than Gravatar does, so it's a future project. And, well, no one uses it yet either. I actually thought the project was dead until just now because their website was down the first time I visited it a couple months ago.

My second complaint is that Gravatar incorrectly assumes e-mail addresses are not case-sensitive. The domain part is, but the alias part is not. These two addresses could technically arrive to two different e-mail inboxes,

chris@example.com
Chris@example.com

Pretty much every e-mail server will treat these as the same address as a convenience, because treating these differently would just be confusing, but it's not necessarily the case. Gravatar specifically says to hash the e-mail in lowercase form, so the unique address Chris@example.com can't be used with the service.

So, go ahead and play in the comments a bit.


Unorderable Sets

Under Gavin's suggestion, I've been watching The Prisoner, a 1960's British television show. The main character is an ex-spy held prisoner in "the Village", an Orwellian, isolated, enclosed town. No one in the Village has a name, but is instead assigned a number. The main character's number is 6.

As far as I can tell, after number 2 the order of the numbers is not important. Number 56 is no more important than number 12. By using numbers to name things there is an implied ordering, even if the the ordering is insignificant. It could be misleading to a newcomer.

Is there an unordered set could be used to name things? More specifically, is there a set that cannot be ordered? If it is unorderable then there is no implicit ordering to cause confusion. It's easy to have an unorderable set in theory, but I think it is difficult to have in practice.

Using letters is obviously out, as the alphabet has an order. Words and names made of letters can be sorted according to the alphabet. However, the ability to order words is almost never used outside of indexing. If words are used to name things, a newcomer is unlikely to assume relationships based on ordering. No one will assume Alan is more important than Bob.

Large numbers also tend to lack an assumed order. I don't think anyone assumes a larger or smaller social security number has meaning, or a larger or smaller phone number. However, these values are also known to be handed out in some semi-random way.

But can we do better? For at least English speakers, is it possible to create an unorderable set? If the items in the set have a vocal pronunciation, then they can probably be ordered by their phonetics. That could be avoided by using non-standard phonetic components, like clicks and pops, which won't have a standard ordering (in English, anyway).

A set has an order if there is a total, transitive, relational operator for the set. If such an operator does not exist then the set isn't linearly ordered. I want a set that can't easily have such an operator.

If a set of symbols was created, how might they be presented as to show no ordering. The order of the symbols in the original presentation might be considered the ordering, like how the alphabet is always presented in order. A circle could be used, but this is circularly ordered. I think there is also the issue of memorization. A human will have a much better time memorizing the symbols if memorized in some order. For example, try naming all the letters of the alphabet at random, without repeats. Or US states.

Thanks to modern day technology, with dynamic content, the set could be displayed in a random order each time it is viewed. For a web page, the server could select a random order, or a JavaScript program could reorder the images at random.

There could be partially ordered sets, like hierarchies and DAGs. The ordering in The Prisoner is one of these. There is number 1, then number 2, then everyone else. Is there a partially ordered set in use that has unique names at the same level?

The penalties incurred by intentionally prohibiting order would likely outweigh the benefit of the set. If it's not orderable, we can't index it, and it's difficult to deal with. I expect it's much easer to just use numbers and tell people that the order isn't important, or just use an obviously unordered set.


SumoBots Programming Game

In the summer of 2007 my friend, Michael Abraham, and I wrote a robot programming game in Matlab. The idea came out of a little demo he wrote where one circle was chasing another circle. Eventually the circles could be run by their own functions, then they were shooting at each other, then dropping mines, and so on. It was dubbed "Matbots".

It turned into a little programming competition among several of us at work. Someone would invent a new feature, like cooperative bots or bullet dodging, that would allow it to dominate for awhile. Then someone else would build on that and come up with some other killer feature. At one point we even hooked up a joystick so we could control our own bot live against our creations.

I never shared that here. Someday I will.

Anyway, Mike took a little bit of the Matbots code and design and came up with a new, similar game called SumoBots. Programmable bots are in a frictionless circular playing area with the ability to accelerate in any direction. Collisions between bots are governed by a spring force dynamics. Bots are out of the game when they leave the circular area. The goal is to knock out all the other bots. Here's a video,

I like how smooth and natural that looks.

He got a number of people at his lab to code up some bots to compete. I threw the code up over at bitbucket, a Mercurial host, which you can checkout with,

hg clone http://bitbucket.org/skeeto/sumobots/

(I haven't convinced anyone else over there to use a version control repository, let alone this one, but here's hoping!)

You can use either Matlab or Octave to program your bot. It works in both, with Octave being on the slow side. For Octave, you'll need to define a cla function as a call to clf (they don't define it for some reason). To start coding a bot, just look at the samplebot bot for information on programming a bot. Edit the players in player_list.txt, then launch it by running engine (engine.m).

I wrote one bot so far called bully. It charges at the nearest bot, being careful to cancel out its own velocity. In the video above this bot is red.

The other bots so far is kyleBot, blue in the video, which passively runs circles around the outside of the surface hoping to trick aggressive bots, like mine, into running out. The bot2fun bot, black in the video, is like bully, but tries to keep safely near the center of the playing area. The brick bot, green in the video, is a dummy, practice bot that just gets knocked around.

The tricky part in writing a bot is managing the frictionless surface. Acceleration can't just be in the direction you want to travel, but also against the current velocity. You might want to set up some kind of process control for this.

Right now I think there is a balance issue. Passive bots have an advantage, but the game won't progress if everyone takes this advantage and writes passive bots. It's like camping in a first-person shooter. As a partial solution to these stalemates, Mike has proposed that the playing surface should shrink over time.

If you want to join in, grab the code and start coding.


null program Turns Two Years Old

This website/blog turns two years old today. That's right, the first post was on September 1, 2007. This also happens to be the 100th post! Woo! Which means I have nearly kept up with my original one post per week planned average (104 would put me on schedule).

The site has evolved a bit in that time. Two years ago I was using vanilla blosxom (I've since hacked it up), no comments, no portrait (portraits increase reader trust), simpler logo, no color, no RSS (it was there, but I wasn't linking it), crappy index, no features but this blog and the "projects" section, no Creative Commons license (but I've always had the other simple copyleft statement), no lisp, and only one permalink per post. Since I started this site I lived in three different US states, had two different jobs, bought two computers, bought a car, adopted a cat, learned several programming languages, graduated with a bachelor's degree, had major surgery, and got married.

I've tried to make all my posts more than a trivial "hey, check this link out", and instead have some real content for each one. Thanks to this, if you concatenate all the posts together it's long enough to be a short book (about 170 pages). And, this blog is 100% Free Culture. All text and images are public domain or under a Free license (and none of that non-commercial-only crap). So I -- or anyone -- could easily publish it as a book. (Ha! As if anyone would buy it!) And I do maintain another completely separate blog, with a (currently) higher post frequency, under a pseudonym (so I don't get myself into trouble), which is why I won't tell you what it is. :-)

I'm really happy I kept kept this "log" of my activities. I've had some real misconceptions about my previous self, which the log has cleared up. I often think I came up with ideas or had viewpoints much more recently than I really did ("Oh, learning Scheme really was that long ago."). I wish it went back further, because, for example, I really am not sure at which point I became an atheist (early 2003?) or a libertarian (summer 2003?). I tend to think of my previous self as more stupid than I really was, which gives me this false impression of intellectual progress, when really I am as dumb as I ever was! I know one one thing for sure, my writing hasn't really improved!

So, thanks for reading! If you don't already have a blog, start one today. Keep up with it, and at two years I'm sure you'll be really glad you have it.

Here's to another two years!


An Entropy-Efficient Virtual Dice Rolling Algorithm

This is related to a project I am working on and will post here soon. I imagine that, with a little more effort, this algorithm could turn into a short amateur paper.

Suppose you want to use a computer to simulate the roll of two six-sided dice (notated 2d6). The simplest approach would be to replicate the results the same way you would roll dice: independently and randomly generate two numbers between 1 and 6 inclusively. We easily can do this for any number of dice, we just iterate and roll each die. Like this recursive function,

(defun roll (n sides)
  "Roll n dice with given number of sides."
  (if (zerop n) ()
    (cons (1+ (random sides))
          (roll (1- n) sides))))

However, generating a number between 1 and 6 wastes small amounts of entropy. A six-sided die only takes about 2.58 bits of entropy to generate. Since we can only use bits discretely we have to spend 3 bits, throwing out 0.42 bits. On top of that, when we pull out 3 bits and they are out of range (0 or 7) we have to throw them out and try again.

Let's say we wanted to roll 10 dice, or 100 dice, or 1000 dice? Do we really need to generate that many numbers individually? That's a lot of wasted entropy adding up, entropy which can be expensive to gather. Well, we could instead use the probability distribution of the roll so that only a single number needs to be generated.

For a 2d6 roll, there are 36 unique possible outcomes (6^2). We could select a number between 0 and 35, then choose that specific roll. This roll can be calculated with a series of division and modulus operations (u for a number from a uniform distribution) (also, note that the division is integer division),

(defun roll-perm (n s u)
  "Get uth permutation of n s-sided dice."
  (if (zerop n) ()
    (let ((perms (expt s (1- n))))
      (cons (1+ (/ u perms))
            (roll-perm (1- n) s (mod u perms))))))

If we're only interested in the sum, we could save memory by making this tail recursive -- or iterative -- and summing the dice as we calculate them. Ignoring the exponent, this is O(n), not better than the simple algorithm in terms of growth rate. This algorithm is more efficient when it comes to entropy, though.

Consider 3d6, with 216 possible outcomes, ideally with the simple algorithm takes 3 3-bit rolls, consuming 9 bits. About 1.25 bits was not actually used (0.42 * 3). In the entropy-efficient algorithm we need about 7.75 bits, so it only consumes 8 bits of entropy. We saved a bit. That gap only gets larger with more dice. For 100d6 the simple algorithm uses 41 more bits than necessary.

The efficient roll is basically defragmenting the individual rolls on the entropy stream.

In non-ideal world, though, some cases don't work out well. In 12d6, almost half the numbers (compared to 25% in the case of 1d6) from the uniform distribution will be out of range and a lot more bits would be needed. On average, rolling dice individually (or only some of them individually) for 12d6 will be more efficient.

The efficient algorithm is only more efficient above a point near where mod(log2(s), 1) < mod(log2(n^s), 1).

And, all of this doesn't come without a cost. You must pay the piper, and this algorithm is paid with CPU and memory. Notice that exponent there? That has to be done to exact precision (no floating point), and it grows very quickly. If you want to roll more than a handful of dice, you will be crunching some large numbers. Rolling just 100d6 means you have to work with a 78 digit integer. 10000d6 is a 7782 digit integer. These can't be done in floating point because the resolution of floating point is too low: some rolls would not be possible.

The exponent could be memoized to trade some of that CPU time for more memory usage. Still, pretty costly. If you don't value your entropy, the tradeoff might not be worth it.

I can't see a way around performing that calculation. We need to know that big number exactly. Perhaps a mathematician might be able to manipulate the formulas such that it's not so expensive.

If you're rolling lots of dice and you want to preserve binary entropy, try it out. If you want to be really efficient queue up rolls -- or generate them ahead of time -- so that the number of outcomes is just below a power of two. In the case of d6, some good number of dice to roll are 17 (~43.94 bits), 29 (~74.96), 41 (~105.983), 94 (~242.986 bits), 200 (~516.993), 253 (~653.995 bits), 306 (~791.99853 bits), and 971 (~2509.99859 bits). (Notice these get closer and closer to an integer number of bits.)


Lossless Optimizers

I've been using lossless optimizers for awhile now for PNGs, but more recently I have found some for other formats. Here's the ones I know about. These are all intended to be lossless, so running them should result in no information loss (well, except the SVG one).

For PNG, there are a number of choices, but my favorite is OptiPNG. It adjust the PNG parameters and recompresses to find the optimal parameters. I run it on almost all my images around here, and I tend to get around 10% to 30% reduction for images fresh off Gimp, Kolourpaint, and ImageMagick.

For JPEG, I use jpegoptim. It works by optimizing the Huffman tables (the lossless part of JPEG compression). I only found this one recently, but I will be using it all the time, like on our new thousands of wedding reception photos.

For PDF, I found something called QPDF. It's designed more for other PDF transformations, but without any other parameters it will simply losslessly optimize a PDF. From what I've seen so far it cuts PDFs down by about a third.

For SVG, Scour is a young project, only a few months old. I've been looking for an SVG optimizer for some time, so this was exciting to find. Due to the type of file it's working with, it's not quite entirely lossless. Visually, it is lossless, but it will toss all metadata (comments, etc.), which may be important. If you hand-crafted your SVG, you won't want to use this tool. It's good for removing Inkscape and Illustrator cruft, though.

I have yet to find a good (Free) GIF optimizer. Animated GIFs, with lots of redundancy between frames, have a lot of potential for optimization too. A video optimizer (for, say, Theora) would be interesting; I imagine it might work similarly to jpegoptim. Audio files (like Vorbis, FLAC, or MP3) probably don't have any room for optimization. I could be wrong. For XHTML there is tidy if you want to count that. All the other XML formats (ODF, RSS, etc.) could have their own too. Or optimizers for archives, like zip and tar. For tar it might rearrange things to better suit gzip, bzip2, or lzma. Executable optimizers? Postscript optimizers? It goes on and on.

If you know about any more, especially for other file formats, let me know.


IRC Random Number Generator

A year ago while I was reading Applied Cryptography I got an idea for using IRC as a random number generator. The book mentioned using a rolling car wheel to generate random numbers, by measuring its period. So why not IRC? Grab the code,

git clone http://git.nullprogram.com/ircrng.git

It's based entirely on event timing: it logs in and sits on several channels, then measures the time between events. When a new event occurs it compares the time passed between this event and the last event, and the time between the two events before it. If it's greater, emit a 1, if less emit a 0. Simple.

For skew removal I used transition mappings (rfc1750), invented by von Neumann. It looks at pairs of bits. If they differ, pass the first bit, otherwise toss both bits. So if it comes across "11" or "00", it tosses them. If it comes across "10" it emits "1", and "01" it emits 0.

Here's an analysis by ent of 476 output bytes I generated overnight: irc.random.bytes.

Entropy = 7.515132 bits per byte.

Optimum compression would reduce the size
of this 476 byte file by 6 percent.

Chi square distribution for 476 samples is 274.79, and randomly
would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.7248 (127.5 = random).
Monte Carlo value for Pi is 2.886075949 (error 8.13 percent).
Serial correlation coefficient is -0.062745 (totally uncorrelated = 0.0).

Eh, not awful, but not too great. It got a good score on the Chi-square test, which I attribute to the skew filter. This generator is also extremely slow, generating only a few bytes per hour. At best, each event only generates a half of a bit, after skew correction. It would probably be more profitable to take the hash of the actual event with a timestamp and use that as the RNG.

Also, someone else running this generator on the same channels would generate very similar numbers. Worse, someone eavesdropping on your network connection knows mostly what numbers you generated. Worse yet, someone actively modifying your connection could control your generator and force whatever output is desired.

So overall, I would say it's not very useful.


Flying Spaghetti Monster in the HTML 5 Spec

I was reading through the HTML 5 specification earlier tonight and I was delighted to come across an obvious reference to the Flying Spaghetti Monster, complete with notes about His "noodly appendages". I'm guessing that someone silently slipped it in there and the people that might be offended won't get the reference anyway.

It's in an example for the img tag,

In the following example, a picture representing the flying spaghetti monster emblem, with each of the left noodly appendages and the right noodly appendages in different images, so that the user can pick the left side or the right side in an adventure.

For those who are unaware, it is the deity in a parody religion Pastafarianism. It's often used to demonstrate how silly religion is.


Web Pages Are Liquids

Web pages aren't a static medium, like books, brochures, or pamphlets. The web is not print. Accordingly, the layout of web pages should not be locked to some static width, but instead flow to fill the width of the browser like a liquid. Web pages should normally have a liquid layout.

One of the most obvious problems with the fixed layout occurs when the browser window is stretched wider than the designer had intended. The page looks like this,

Image showing wasted space.

I, as a user, have little control my viewing of the website. I'm stuck reading through a keyhole. It gets much worse if the browser isn't as wide as the designer intended: a horizontal scrollbar appears and navigation becomes very difficult. My laptop runs at a resolution of 1024x768, and I frequently come across pages where this is an issue. And according to Jakob Nielsen, in 2006 77% of user's screens were 1024 pixels wide or less.

See the liquid for yourself right here: adjust the width of your browser and watch this text flow to fill the screen. You can also bring it in pretty far before you clip an image and the horizontal scrollbar appears. The exact width depends only on the widest image being displayed. This also comes into play if you adjust the font size.

Using a liquid layout allows the page to work well with a wide variety of screen widths, and most importantly, gives users lots of control over how they view the site. It's very unfortunate that (in my experience) most websites employ a poor, fixed layout. Even web design "expert" websites will ironically hand out web design tips from within these annoying confines. One of the biggest culprits driving this is Wordpress, which has this flawed layout by default.

The very worst offenders tend to be websites with little actual content, like corporate websites or "artist" portfolios. The less usable the page, the less I wanted to be there anyway.

So please drop the fancy, low-usability web designs for something with much better usability. Your users will probably appreciate it.


Ad-blocking and the Regrettable URL Format

I use Adblock Plus to block advertisements and, more importantly, invisible privacy-breaking trackers (most people aren't even aware of these). I think ad-blocking is actually easier than ever, because ads are served from a relatively small number of domains, rather than from the websites themselves. Instead of patterns matching parts of a path, I can just block domains.

Adblock Plus emphasizes this by providing, by default, a pattern matching the server root. Example,

http://ads.example.com/*

But sometimes advertising websites are trickier, and their sub-domain is a fairly unique string,

http://ldp38fm.example.com/*

That pattern isn't very useful. I want something more like,

http://*.example.com/*

Unfortunately Adblock Plus doesn't provide this pattern automatically yet, so I have to do it manually. I think this pattern is less obvious because the URL format is actually broken. Notice have have two matching globs (*) rather than just one, even though I am simply blocking everything under a certain level.

Tim Berners-Lee regrets the format of the URL, and I agree with him. This is what URLs like http://ads.example.com/spy-tracker.js should look like,

http://com/example/ads/spy-tracker.js

It's a single coherent hierarchy with each level in order. This makes so much more sense! If I wanted to block example.com and all it's sub-domains, the pattern is much simpler and less error prone,

http://com/example/*

To anyone who ever reinvents the web: please get it right next time.

Update: There is significant further discussion in the comments.


Three Rivers Stadium Implosion

As noted elsewhere here, I grew up in Pittsburgh, Pennsylvania. Back in February 2001 I got to witness Pittsburgh's Three Rivers Stadium demolition. My dad brought our old 1980's video camera to film it, and afterwards we dumped the recording to the computer. And so I present another low-quality video,


Original video dump

I put this up on YouTube in September 2006, and as of this writing it has gotten about 28,000 views. Now that I finally put it here, it's clearly available under any of the licenses at the bottom of this page.

We were standing nearly underneath the Fort Pitt Bridge (seen at the top of the video) facing directly north, next to some railroad tracks. You can see all three rivers: Ohio river to the left, Monongahela river to the right, and the Allegheny in the top middle behind the Point.

That thick dust cloud you see at the end rose up and engulfed the whole city, overtaking us as we headed back to the car. So I got a little bit of history in my lungs. If you're some distant-future archaeologist reading this maybe you can retrieve some particles from my decomposed rib cage. Put them in a museum or something.


Television Commercials

Old television First, let me note that I don't watch television. At least not in the sense of sitting on the couch, turning it on, and flipping through the stations. I can't stand the compressed audio, the constant, loud commercial interruptions, and general lack of control over my viewing. VCRs, and more recently PVRs, have mitigated these last two points, but not enough to grab my interest.

The way I see it, there are four ways to access television. Here is the matrix,

2x2 table thingy

For an "acceptable" situation we have cost-free television, but with advertising, in broadcast and streaming television. And in the opposite "acceptable" situation we have ad-free television, but with a monthly fee, in premium television. I think these two are acceptable compromises. Someone else can foot the bill, or you can foot the bill.

In a few cases, such as viewer-supported television like PBS, it's both cost-free and ad-free. This is pretty nice. You can have your cake and eat it too.

However, most television is only legitimately available in the worst case situation! Not only do you have to pay to access it, but one-third of it is annoying, unwanted advertising. This is awful, and it is one reason why I choose not to participate.

Luckily, there is another "best case" option which provides quick access to most television shows of the world: peer-to-peer file-sharing. Unfortunately, it doesn't include live television, and it's usually not quite legal. We have the technology to distribute large amounts of data to huge numbers of people at practically no cost, but a bunch of old, out-of-date laws stand in the way. It's a shame. I think this quote by "muuh-gnu" sums it up well,

We have 2009. Everybody and their dog has a computer, which is designed to copy stuff. Also we have broadband which is, again, designed to ... move stuff around the world. So is what you're actually pointlessly advocating is that we collectively should ... actually what? Abstain from using a common technology in order to make absurdly archaic 50's business models of "manufacturing and selling single copies" viable in day and age when everybody can manufacture and distribute those copies themselves?

It's a good thing some bad laws don't get in the way of progress too much.


Dry Ice Potato Gun

This was almost 6 years ago now. In my freshman year of college, three of us (Matt Stine and Matt Takach) designed and assembled a dry ice powered potato gun. We got the idea for using dry ice after purchasing some from the famed Penn State Creamery. They sold it for 50 cents per pound, no questions asked. The purpose was for keeping ice cream cold over a long car trip, but they'd sell it to you even if you didn't buy ice cream.

Here is a low-quality video of it in action,


Original video source.

I don't recommend doing this yourself unless you are going to take extreme caution. Like, more caution than we did. We weren't too careful ourselves and were really lucky no one got hurt. The dry ice "bombs" can be very powerful, enough to shatter PVC tubing. We did bring safety glasses with us (but they went mostly unused). We also did this in cold weather, which made things a bit safer.

On to the details, here is a diagram I drew up at the time,

Diagram of the potato gun.

It's just glued PVC tubes and a nail. The bottom of the "blast chamber" had a screw-on cap, which kept getting blown to smithereens. We bought several of those. The original plan also had some wires, connected to some kind of high-resistance wire, so we could control when the gun would fire. That didn't work out.

The "power cell" was a soda bottle with water and dry ice in it. Some brands worked better than others. We had to smash the dry ice into little pieces to get them in the bottle. Dry ice goes in first, then the water, then the cap goes on. After that it's a ticking time bomb, and you want to load it into the gun ASAP. In warm weather you have, maybe, a few seconds. In cold weather it could be 30 minutes.

Yeah, you don't want that in your hand when it explodes.

Once the bottle explodes, there is a lot of pressure behind that potato, so it gets quickly pushed down and out of the barrel. As I said, in that cold weather this could take about 30 minutes. In the video above we were adding warm water to speed up the process.

Gun construction

When we fired a golf ball out of the gun straight up in the air, it took about 10 seconds to hit the ground. This put it at maybe 50 m/s (110 mph) barrel exit speed. When launching into the field, we found the potatoes between 150 and 250 meters.

At night we stuck glow sticks in the potatoes so we could watch their entire trajectory. Pretty cool.

Some campus "security" person stopped us on our way home, since carrying a muddy pipe looks suspicious. We told her it was just a potato "cannon", which she told us "wasn't allowed on campus" (yeah, yeah). That's time to just walk away.

Here are some pictures of the construction, and my excessive beard,

Gun construction Gun construction Gun construction Gun construction Gun construction

Extra, less interesting video,


Lisp Fantasy Name Generator

Earlier this year I implemented the RinkWorks fantasy name generator in Perl. I think lisp lends itself even better for that, and so I have a partial elisp implementation for you.

What stands out for me is that the patterns can easily be represented as a S-expression. We represent substitutions with symbols, literals with strings, and groups with lists. For example, this pattern,

s(ith|<'C>)V

can be represented in code as,

(s ("ith" ("'" C)) V)

I want a function I can apply to this to generate a name. First, I set up an association list with symbols and its replacements,

(defvar namegen-subs
  '((s ach ack ad age ald ale an ang ar ard as ash at ath augh
       aw ban bel bur cer cha che dan dar del den dra dyn
       ech eld elm em en end eng enth er ess est et gar gha
       hat hin hon ia ight ild im ina ine ing ir is iss it
       kal kel kim kin ler lor lye mor mos nal ny nys old om
       on or orm os ough per pol qua que rad rak ran ray ril
       ris rod roth ryn sam say ser shy skel sul tai tan tas
       ther tia tin ton tor tur um und unt urn usk ust ver
       ves vor war wor yer)
    (v a e i o u y)
    ...
    (d elch idiot ob og ok olph olt omph ong onk oo oob oof oog
       ook ooz org ork orm oron ub uck ug ulf ult um umb ump umph
       un unb ung unk unph unt uzz))
  "Substitutions for the name generator.")

Since we will need this in a couple places, make a function to randomly select an element from a list,

(defun randth (lst)
  "Select random element from the given list."
  (nth (random (length lst)) lst))

A function for replacing a symbol,

(defun namegen-select (sym)
  "Select a replacement for the given symbol."
  (if (null (assoc sym namegen-subs))
      (throw 'bad-symbol
             (concat "Invalid substitution symbol: " (format "%s" sym)))
    (symbol-name (randth (cdr (assoc sym namegen-subs))))))

And finally, the generator. Find a string, pass it through, find a symbol, substitute it, find a list, pick one element and recurse on it.

(defun namegen (sexp)
  "Generate a name from the given sexp generator."
  (cond
   ((null sexp) "")
   ((stringp sexp) sexp)
   ((symbolp sexp) (namegen-select sexp))
   ((listp sexp)
    (concat (if (listp (car sexp)) (namegen (randth (car sexp)))
              (namegen (car sexp)))
            (namegen (cdr sexp))))))

That's it! We can apply it to the expression above,

(namegen '(s ("ith" ("'" C)) V))
-> "rynithi"

But that's really the easy part. The hard part would be converting the original pattern into the S-expression, which I don't plan on doing right now.

Something else to note: this is thousands of times faster than the Perl version I wrote earlier.

I threw the code in with the rest of my name generation code (namegen.el),

git clone http://git.nullprogram.com/fantasyname.git

S-expressions are handy anywhere.


Converted to HTML 5

I converted this website to HTML 5 from its original XHTML 1.0 Transitional. Because I was already valid, it was actually pretty easy. Well, I first converted it to XHTML 1.0 Strict by shifting some things into CSS that should have been that way in the first place (results of laziness). Then when that validated I just changed the DOCTYPE to HTML 5, added a new meta tag, checked that validation and that was it. Really simple.

One of the main reasons was so that I could use the new video tag. From now on any posts with videos will display them as in-line videos (Ogg Vorbis), provided you are using a browser with HTML 5 support. In fact, I already did it with two older posts: Knight's Tour and the Mandelbrot set zoom videos. Here, check it out,

HTML 5 is still a moving target, so I may have to do some keeping up over time. Probably not.

Video is finally a first class web citizen and it is quite exciting.


Browser URL Mangling

Public Domain angry face. All but one of the major browsers do special URL mangling as a "feature", but I think of it as bad behavior. Specifically, Firefox gets it right while IE, Opera, Safari, and Chrome all do it wrong. The last three are likely replicating the broken behavior of IE. I would rather not do this.

I ran into this specific issue work. An IT person had sent out a URL to an important internal website, which contained links to other resources. However, this person, despite being in IT, wasn't very tech-savvy and wrote all of the links with backslashes instead of slashes. When I followed with Firefox I got a 404 Not Found because Firefox asked for a filename with a backslash in it -- proper behavior.

As first I thought the page was out of date with dead links, but it was obviously a brand new page. That's when I noticed that the links had the wrong slashes. A quick check with IE showed the links as working, which is why the IT person didn't notice his error. IE was automatically, and silently, converting the backslashes to slashes.

According to RFC 1738, all URLs must encode their backslashes because of buggy transport agents (and, years later, because of buggy browsers like IE). So, if you want to use a backslash in a URL you are supposed to encode it anyway, but if you don't most major browsers will silently mangle the URL. I really, really don't like this. I know what I am doing, Mr. Browser.

It's done to replicate IE behavior, and IE did it because it's a common error for Windows users to use backslashes when they shouldn't (Windows directory separators), so it is anticipating user error (and assuming it knows best!). Let's not cripple our software to bring it down the same level as the major shortcomings of Windows, the toy OS.

I wrote a little test suite for testing if your browser does this annoying mangling. You can try it with your browser below,

URL Mangling Test

Other than the browsers listed above, the only other browser I know that fails is Links (not Lynx). Note that Opera passes the "automatic" test, but fails the "manual" test, because it doesn't mangle the http-equiv URLs used in the automatic tests.

So beware using the browsers I listed at the beginning. Not only may they be sending you to the wrong URL, but they are crippling the web for the sake of Microsoft's errors.


The Emacs Calculator

Did you know that Emacs comes with a calculator? Woop-dee-doo! Call the presses! Wow, a whole calculator! Sounds a bit lame, right?

Actually, it's much more than just a simple calculator. It's a computer algebra system! It is officially called a calculator, which isn't fair. It's an understatement, and I am sure has caused many people to overlook it. I finally ran into it during a thorough (re)reading of the Emacs manuals and almost skipped over it myself.

Ever see that demonstration by Will Wright for the game Spore several years ago? The player starts as a single-cell organism and evolves into a civilization with interstellar presence. When he started the demo he showed a cell through what looked like a microscope. No one had any idea yet what the game was about, so every time he increased the scope, from bacteria to animal, animal to civilization, civilization to space travel, interplanetary travel to interstellar travel, there was a huge reaction from the audience. It was like those infomercials: "But that's not all!!!"

As I made my way through the Emacs calc manual I was continually amazed by its power, with a similar constant increase in scope. Each new page was almost saying, "But that's not all!!!" Like an infomercial I'm going to run through some of its features. See the calc manual for a real thorough introduction. It has practice exercises that shows some gotchas and interesting feature interactions.

Fire it up with C-x * c or M-x calc. There will be two new windows (Emacs windows, that is), one with the calculator and the other with usage history (the "trail").

First of all, the calculator operates on a stack and so its basic use is done with RPN. The stack builds vertically, downwards. Type in numbers and hit enter to push them onto the stack. Operators can be typed right after the number, so no need to hit enter all the time. Because negative (-) is reserved for subtraction an underscore _ is used to type a negative number. An example stack with 3, 4, and 10,

3:  3
2:  4
1:  10
    .

10 is at the "top" of the stack (indicated by the "1:"), so if we type a * the top two elements are multiplied. Like so,

2:  3
1:  40
    .

The calculator has no limitations on the size of integers, so you work with large numbers without losing precision. For example, we'll take 2^200.

2:  2
1:  200
    .

Apply the ^ operator,

1:  1606938044258990275541962092341162602522202993782792835301376
    .

But that's not all!!! It has a complex number type, which is entered in pairs (real, imaginary) with parenthesis. They can be operated on like any other number. Take -1 + 2i minus 4 + 2i,

2:  (-1, 2)
1:  (4, 2)
    .

Subtract with -,

1:  -5
    .

Then take the square root of that using Q, the square root function.

1:  (0., 2.2360679775)
    .

We can set the calculator's precision with p. The default is 12 places, showing here 1 / 7.

1:  0.142857142857
    .

If we adjust the precision to 50 and do it again,

2:  0.142857142857
1:  0.14285714285714285714285714285714285714285714285714
    .

Numbers can be displayed in various notations, too, like fixed-point, scientific notation, and engineering notation. It will switch between these without losing any information (the stored form is separate from the displayed form).

But that's not all!!! We can represent rational numbers precisely with ratios. These are entered with a :. Push on 1/7, 3/14, and 17/29,

3:  1:7
2:  3:13
1:  17:29
    .

And multiply them all together, which displays in the lowest form,

1:  51:2842
    .

There is a mode for working in these automatically.

But that's not all!!! We can change the radix. To enter a number with a different radix, which prefix it with the radix and a #. Here is how we enter 29 in base-2,

2#11101

We can change the display radix with d r. With 29 on the stack, here's base-4,

1:  4#131
    .

Base-16,

1:  16#1D
    .

Base-36,

1:  36#T
    .

But that's not all!!! We can enter algebraic expressions onto the stack with apostrophe, '. Symbols can be entered as part of the expression. Note: these expressions are not entered in RPN.

1:  a^3 + a^2 b / c d - a / b
    .

There is a "big" mode (d B) for easier reading,

          2
     3   a  b   a
1:  a  + ---- - -
         c d    b

    .

We can assign values to variables to have the expression evaluated. If we assign a to 10 and use the "evaluates-to" operator,

          2
     3   a  b   a             100 b   10
1:  a  + ---- - -  =>  1000 + ----- - --
         c d    b              c d    b

    .

But that's not all!!! There is a vector type for working with vectors and matrices and doing linear algebra. They are entered with brackets, [].

2:  [4, 1, 5]
1:  [ [ 1, 2, 3 ]
      [ 4, 5, 6 ]
      [ 6, 7, 8 ] ]
    .

And take the dot product, then take cross product of this vector and matrix,

2:  [38, 48, 58]
1:  [ [ -14, -18, -22 ]
      [ -19, -18, -17 ]
      [ 15,  18,  21  ] ]
    .

Any matrix and vector operator you could probably think of is available, including map and reduce (and you can define your own expression to apply).

We can use this to solve a linear system. Find x and y in terms of a and b,

x + a y = 6
x + b y = 10

Enter it (note we are using symbols),

2:  [6, 10]
1:  [ [ 1, a ]
      [ 1, b ] ]
    .

And divide,

          4 a     4
1:  [6 + -----, -----]
         a - b  b - a

    .

But that's not all!!! We can create graphs if GNU Plot is installed. We can give it two vectors, or an algebraic expression. This plot of sin(x) and x cos(x) was made with just a few keystrokes,

Sample plot of given functions.

But that's not all!!! There is an HMS type for handling times and angles. For 2 hours, 30 minutes, and 4 seconds, and some others,

3:  2@ 30' 4"
2:  4@ 22' 13"
1:  1@ 2' 56"
    .

Of course, the normal operators work as expected. We can add them all up,

1:  7@ 55' 13"
    .

We can convert between this and radians, and degrees, and so on.

But that's not all!!! The calculator also has a date type, entered inside angled brackets, <> (in algebra entry mode). It is really flexible on input dates. We can insert the current date with t N.

1:  <6:59:34pm Tue Jun 23, 2009>
    .

If we add numbers they are treated as days. Add 4,

1:  <6:59:34pm Sat Jun 27, 2009>
    .

It works with the HMS format from before too. Subtract 2@ 3' 15".

1:  <4:56:32pm Sat Jun 27, 2009>
    .

But that's not all!!! There is a modulo form for performing modulo arithmetic. For example, 17 mod 24,

1:  17 mod 24
    .

Add 10,

1:  3 mod 24
    .

This is most useful for forms such as n^p mod M, which this will handle efficiently. For example, 3^100000 mod 24. The naive way would be to find 3^100000 first, then take the modulus. This involves a computationally expensive middle step of calculating 3^100000, a huge number. The modulo form does it smarter.

But that's not all!!! The calculator can do unit conversions. The version of Emacs (22.3.1) I am typing in right now knows about 159 different units. For example, I push 65 mph onto the stack,

1:  65 mph
    .

Convert to meters per second with u c,

1:  29.0576 m / s
    .

It is flexible about mixing type of units. For example, I enter 3 cubic meters,

       3
1:  3 m

    .

I can convert to gallons,

1:  792.516157074 gal
    .

I work in a lab without Internet access during the day, so when I need to do various conversions Emacs is indispensable.

The speed of light is also a unit. I can enter 1 c and convert to meters per second,

1:  299792458 m / s
    .

But that's not all!!! As I said, it's a computer algebra system so it understands symbolic math. Remember those algebraic expressions from before? I can operate on those. Let's push some expressions onto the stack,

3:  ln(x)

       2   a x
2:  a x  + --- + c
            b

1:  y + c

    .

Multiply the top two, then add the third,

                2   a x
1:  ln(x) + (a x  + --- + c) (y + c)
                     b

    .

Expand with a x, then simplify with a s,

                 2   a x y              2   a c x    2
1:  ln(x) + a y x  + ----- + c y + a c x  + ----- + c
                       b                      b

    .

Now, one of the coolest features: calculus. Differentiate with respect to x, with a d,

    1             a y             a c
1:  - + 2 a y x + --- + 2 a c x + ---
    x              b               b

    .

Or undo that and integrate it,

                       3      2                  3        2
                  a y x    a x  y           a c x    a c x       2
1:  x ln(x) - x + ------ + ------ + c x y + ------ + ------ + x c
                    3       2 b               3       2 b

    .

That's just awesome! That's a text editor ... doing calculus!

So, that was most of the main features. It was kind of exhausting going through all of that, and I am only scratching the surface of what the calculator can do.

Naturally, it can be extended with some elisp. It provides a defmath macro specifically for this.

I bet (hope?) someday it will have a functions for doing Laplace and Fourier transforms.


United States Hamiltonian Paths

Awhile ago I wanted to find every Hamiltonian path in the contiguous 48 states. That is, trips that visit each state exactly once. Writing a program to search for Hamiltonian paths is easy ( I did this already). The most time consuming part was actually putting together the data that specified the graph to be searched. I hope someone somewhere finds it useful. Here is a map for reference,

Map us the lower 48 United States

It took me several passes before I stopped finding errors. I think I have it all right now, but there could still be some mistakes. If you see one, leave a comment and I'll fix it here. Here is the graph as an S-expression alist; the car (first) element in each list is a state, and the cdr (rest) is the unordered list of states that can be reached from it.

((me nh)
 (nh vt ma me)
 (vt ny ma nh)
 (ma ri ct ny nh vt)
 (ny pa nj ma ct vt)
 (ri ma ct)
 (ct ri ma ny)
 (nj pa ny de)
 (de md pa nj)
 (pa nj ny de md wv oh)
 (md pa de va wv)
 (va md wv ky tn nc)
 (nc va tn ga sc)
 (sc nc ga)
 (ga fl sc al nc tn)
 (al ms fl ga tn)
 (ms la ar tn al)
 (tn ms al ga nc va ky mo ar)
 (ky wv va tn mo il in oh)
 (wv md pa oh ky va)
 (oh pa wv ky in mi)
 (fl al ga)
 (mi wi oh in)
 (wi mn ia il mi)
 (il in ky mo ia wi)
 (in oh ky il mi)
 (mo il ky tn ar ok ks ne ia)
 (ar mo tn ms la tx ok)
 (la ms ar tx)
 (tx ok nm ar la)
 (ok ks mo ar tx nm co)
 (ks ok co ne mo)
 (ne sd ia mo ks co wy)
 (sd nd mn ia ne wy mt)
 (nd mt sd mn)
 (ia ne mo il wi mn sd)
 (mn wi ia sd nd)
 (mt id wy sd nd)
 (wy id ut co ne sd mt)
 (co ne ks ok nm ut wy)
 (nm co ok tx az)
 (az nm ut ca nv)
 (ut nv id wy co az)
 (id mt wy ut nv or wa)
 (wa or id)
 (or wa id nv ca)
 (nv or id ut az ca)
 (ca az nv or))

Note that all paths must start or end in Maine because it connects to only one other state.


Javascript Distributed Computing

Diagram of Javascript/browser computing grid. I'm not the first to come across this idea: the browser could be used as part of a distributed computing system. A web server hands out Javascript and the browser runs the script and reports the results back to the server. The browser is a portable, widely available platform so just about anyone can easily contribute, possibly without even knowing.

Browsers aren't really expecting this sort of thing. They will complain if a script is running for too long. If you tell Firefox to continue running the script anyway, it will lock up until the script is done (or when it complains again). This can be worked around by writing a simple scheduler with setTimeout().

This could also potentially be used as an alternative to advertising. Instead of selling advertising space, a website operator could sell visitor's CPU time by including a little snippet of code. This may be more successful, because most visitors would be unaware of it, making it less intrusive. It will be less likely to be blocked. Of course, there ethical issues about this. In fact, there is already a company doing this with secret Java applets.

There are two serious constraints on using Javascript in a browser as a distributed computing platform:

Low bandwidth. There isn't a lot of opportunity to transfer data between the server and the node, and nodes can't talk to each other. The data needed by a node must be small. The results data must also be small.

Short computational units. The Javascript in the browser has no way to store its running state between browser sessions, so it must rely on the server for this. This means that the units of work must be able to be completed within a short period of time. A few minutes at the most on a normal computer.

A lot of problems won't fit inside these limitations. One that I thought might was a Mersenne prime search. A Mersenne prime is a prime of the form,

2n - 1

So even though the largest known Mersenne prime has nearly 13 million digits (about 5 MB just to store the entire number), it can be described by it's exponent, 43,112,609, which is small enough to fit in a 4-byte integer. The result of a calculation, a probabilistic primality test, is a "yes" or "no". One bit. It fits the first constraint very well.

However, the smallest amount of work a node can do is an entire primality test. If we break it down any further, the prime will have been expanded and we will not fit the first constraint. There will be too much data. To see how possible it might be, I implemented it, which you can try out here,

/download/mersenne/ (sloppy code warning!)

I modified an existing Javascript bigint library which allowed me to get it up running quickly. After you receive the page, your browser will run the Miller-Rabin primality test on 2^9941 - 1. You can edit the source HTML to try a different exponent. Running it on several of my computers on different browsers it took anywhere from an hour to 8 hours. And that's only with an exponent of 9941. It's an unsuitable problem.

It would be neat to see a browser computing grid in action, but I can't think of a problem to solve with it.


E-mail Obfuscater Perl One-liner

If you look at the page sources around here you might notice that there are no bare e-mail addresses around. This is because I obfuscate them into a series of HTML entities. So far this has been pretty effective at hiding from address-scraping, web-crawling spam bots. They don't seem to try very hard at decoding HTML entities.

When I added the comment system, I needed to obfuscate addresses automatically. I quickly realized that this is yet another perl one-liner (and implemented as one line in the comment system). It can be used on the command line to obfuscate a file/pipe containing a list of addresses.

perl -lpe '$_ = join "", map {"&#" . ord() . ";"} split //'

All of the spaces are really only there for us humans,

perl -lpe '$_=join"",map{"&#".ord.";"}split//'

I keep running into these one-liners.


Elisp Wishlist

I've been using elisp a lot lately, but unfortunately it's missing a lot of features that one would find in a more standard lisp. The following are some features I wish elisp had. Many of these could be fit into a generic "be more like Scheme or Common Lisp". Some of these features would break the existing mountain of elisp code out there, requiring a massive rewrite, which is likely the main reason they are being held back.

Closures, and maybe continuations. Closures are one of the features I miss the most when writing elisp. They would allow the implementation of Scheme-style lazy evaluation with delay and force, among other neat tools. Continuations would just be a neat thing to have, though they come with a performance penalty.

Closures would also pretty much require Emacs switch to lexical scoping.

Arbitrary precision. Really, any higher order language's numbers should be bignums. Emacs 22 does come with the Calc package which provides arbitrary precision via defmath. Perl does something like this with the bignum module.

Packages/namespaces. Without namespaces all of the Emacs packages prefix their functions and variables with its name (i.e. dired-). Some real namespaces would be useful for large projects.

C interface. This is something GNU Emacs will never have because Richard Stallman considers Emacs shared libraries support to be a GPL threat. If Emacs could be dynamically extended some useful libraries could be linked in and exposed to elisp.

Concurrency. If some elisp is being executed Emacs will lock up. This is a particular problem for Gnus. Again, Emacs would really need to switch to lexical scoping before this could happen. Threading would be nice.

Speed. Emacs lisp is pretty slow, even when compiled. Lexical scoping would help with performance (compile time vs. run time binding).

Regex type. I mention this last because I think this would be really cool, and I am not aware of any other lisps that do it. Emacs does regular expressions with strings, which is silly and cumbersome. Backslashes need extra escaping, for example. Instead, I would rather have a regex type like Perl and Javascript have. So instead of,

(string-match "\\w[0-9]+" "foo525")

we have,

(string-match /\w[0-9]+/ "foo525")

Naturally there would be a regexpp predicate for checking its type. There could also be a function for compiling a regexp from a string into a regexp object. As a bonus, I would also like to use it directly as a function,

(/\w[0-9]+/ "foo525")

I think a regexp price would really give elisp an edge, and would be entirely appropriate for a text editor. It could also be done without breaking anything (keep string-style regexp support).

There is more commentary over at EmacsWiki: Why Does Elisp Suck.


Elisp Running Time Macro

I wanted an elisp macro that could measure the running time of a block of code. Specifically, I wanted it to work like this,

(measure-time
  ...
  body
  ...)

And it would return the running time as seconds in floating point. Well, here's a macro that does it!

(defmacro measure-time (&rest body)
  "Measure and return the running time of the code block."
  (declare (indent defun))
  (let ((start (make-symbol "start")))
    `(let* ((,start (float-time)))
       ,@body
       (- (float-time) ,start))))

It's only good for up to around 18 hours, then the time integer overflows. If only Emacs had arbitrary precision numbers. Here it is in action using my binomial function from last week.

(measure-time
  (nck 20 10)
  (nck 30 7))

Which, just now, returned 3.643713 seconds when executed.


Knight's Tour

An Ugly  Chess Knight A Knight's Tour is a path on a chess board such that a knight moving according to the rules of chess will visit each square exactly once. Here is a program that solves the Knight's Tour for various board sizes.

git clone http://git.nullprogram.com/knight.git

I wrote it purely for speed, and so in cases where the algorithm doesn't blow up (taking possibly thousands of years to complete) it only takes a few milliseconds to complete. The algorithm is recursive and goes like so,

  1. Given a board marked with visited spaces and a position create a list of valid moves.
  2. If the list is empty, return failure.
  3. Sort of the list ascendingly according to the opportunities of the target space. That is, sort by the number of valid moves that are available from the target space.
  4. Mark the current space as visited, then recurse from the new position with the new board starting with the first target in the list.
  5. If the recursion returns success, return success.
  6. If the recursion returns failure, recurse with the next target in the list, and so on. If all the targets on the list have been tried, return failure.

The purpose of going towards targets with few opportunities is so that we don't leave any stranded spaces that will not be detected until much later. When it does run into a dead end, it usually only has to back out a short ways and try again.

The algorithm breaks down on certain board sizes. For example, try a 120 by 120 board. Instead of nearly instantaneous solution like 120 by 119 would get, it takes enormous amounts of time. I don't yet know why.

Also included in the repository is a Perl script for creating frames for a video. Here are two videos, one for an 8x8 board, one for a 80x80 board. They are in Ogg Theora. Update: if your browser supports HTML 5 you can now watch these videos inline thanks to the video tag.

It's interesting how on the large one the behavior seems to be random at some places, but all it's doing is following that simple algorithm.


Doing Comment Previews the Right Way

Many comment/discussion systems get previews wrong. This even includes major sites like Boing Boing and Slashdot. Sometimes they feed back a different comment in the textarea, so repeated previews slowly degrade the comment. Other times the comment preview isn't the same thing as the final result. A comment actually has four states,

Diagram showing the four states oand their transforms.

The raw comment is the unfiltered string of bytes from the user. This is not safe to give directly back to the user, as it could be exploited to feed an arbitrary page to an innocent user.

The escaped comment is created from the raw comment by filtering it through the escapeHTML() function. This function creates HTML entities out of some of the characters, like < and >. A browser will interpret the escaped comment as a simple string, and is safe to give back to the user. This function is actually provided by perl's CGI module, so perl programmers need not implement this.

Note that escapeHTML() is reversible, though the server side won't need to reverse it. The browser does.

The stripped comment is created from the raw comment by filtering it through stripHTML(), which removes non-whitelisted HTML tags. It also strips non-whitelisted attributes from allowed tags. It should probably add a rel="nofollow" to links. It also runs escapeHTML() on attribute values and content outside tags. This is safe to give back to the user because only safe tags are left.

If your comments use markup other than HTML, like BBCode, this function should strip all HTML (your whitelist is empty) and do the conversion from your markup to HTML.

It might also be a good idea for it to produce well-formed HTML. This will allow your comments/discussion pages to be XHTML compliant.

stripHTML() is irreversible because it dumps information.

The stored comment is the encoding of the comment in the system. This depends entirely on the storage system. In some cases it may be identical to the stripped comment (and store is the identity function). If the comment is going through SQL into database, some characters may need to be escaped as to not cause problems. It could even be a base 64 encoding.

store() must be unambiguously reversible, and the server should have an unstore() to do this. It should probably also be able to convert any arbitrary string of characters into a safe encoding for storage.

There should only be one version of all these functions for both previews and final posting of comments.

When doing a comment preview both the escaped comment and the stripped comment are given back to the user. The stripped comment is dropped in as HTML, and the escaped comment is put into the textarea of the form. It would probably be convenient for the user if you give them back any other form information, including the same captcha and their answer to it (or not charge them with a captcha for that comment anymore).

You may be tempted to store the raw comments (safely with store()) and do HTML stripping on the fly. This would allow you to upgrade your HTML stripping function in the future to "better" handle user input. I don't recommend it. That's extra processing for each page request, but worse, it breaks the concept of the preview, because the comment formatting is subject to change in the future.

The hardest function to implement is probably stripHTML() because it needs to be able to handle poorly formed HTML. If you are using perl, you will probably want to use the HTML::Parser module, which is what I did. This does everything noted above and also auto-links anything that looks like a URL, forces proper comment nesting, automatically makes paragraphs from blank-line-separated chunks, and almost produces well-formed HTML.

htmlclean.pm

The documentation is basically non-existent, but if you want to whitelist more tags add them to @allowed_tags. Use it, abuse it.

I use this code in my comment system, so you can play around with it by using my preview function.


Unquoted Let

The lisp macro/form let quotes variable's symbols. This is almost always more useful than not quoting it. Otherwise, simple use would look like,

(let (('var 100))
  body)

instead of,

(let ((var 100))
  body)

In elisp, this is analogous to setq, which quotes its first argument, and set, which doesn't. It could be used like so in elisp,

(setq foo 'bar)
(setq bar 25)

(uq-let ((foo 10))
  (+ 10 bar))

The unquoted let form evalues to 20, not 35, because foo evaluates to the symbol bar, which is bound to 10. The unquoted version is used to select variable names dynamically.

This isn't very useful in a lexically scoped lisp because the variables that can access it are decided, well, lexically. With dynamic scoping we can use it to temporarily mask variables and select what variables to mask dynamically. I found a use for it in one of my projects.

I had a recursive function that searched a graph made of symbols. The symbols globally stored the state of the node. The current node was passed into the function, which would change the state of that node, and recurse. Here's how I was doing it,

(defun search (graph node)
  ...
  (set node 'visited)
  recurse
  (set node 'unvisited))

That looks a lot like it is emulating a let form. If we use an unquoted let,

(defun search (graph node)
  ...
  (uq-let ((node 'visited))
    recurse))

There, that's a lot more lispy. Here is an elisp macro for uq-let,

(defmacro uq-let (vars &rest body)
  "Unquoted let."
  (declare (indent defun))
  `(let ,(mapcar 
          (lambda (a)
            (cons (eval (car a)) (cdr a))) vars)
     ,@body))

From this, making a uq-let* is trivial. The documentation string could use some work too.

Lisp macros are powerful.


Getting Lisp

I've been using lisp on and off for the past few years. I read some lisp books, went through The Little Schemer and some of SICP. But I could never really think in lisp. When I needed to write some code, I would prefer another language first. I was writing imperative code for 10 years before I saw lisp, and I was used to it.

Recently, I have found myself wanting to use lists (including alists, plists, etc.) as data structures for everything, even when I'm not even writing lisp. I think I finally got lisp and now I want to use lisp for everything.

For example, take this little problem from the other day on f(t). Katie Nowak gives us a math problem,

There is a 75% chance of rain on any given day in the next week. What is the probability that it will rain on at least 5 of the 7 days?

Rain cloud in parenthesis The purpose of the question was to point out a neat coincidence in the problem (explained at the link). I used a program to solve the problem and there are two reasons for this.

First, I wanted to use the program to explore the problem and find the "special" property. With a program, I could quickly try different parameters, which would take longer, and be more error prone, by hand.

Second, which is similar to the first, I hate evaluating a large expression by hand. It's slow and error prone. Writing a program to do the same work is faster and mistakes are easier to catch. Also, I can quickly try different parameters to make sure my program's output is reasonable. In this case, for any reasonable input, the output, a probability, shouldn't be greater than 1.

Well, let's see, this is a Bernoulli experiment: each day is independent, so it is like flipping a coin seven times and counting the heads. That means we need the choose function.

My first thought was Octave, as this is a simple program and Octave already provides nchoosek() for me.

# Rain at least n days
function sum = rain (n)
  sum = 0;
  for i = n:7
    sum += nchoosek(7,i) * 0.75 ^ i * 0.25 ^ (7 - i);
  end
end

Simple, but I actually made a couple little mistakes working it out, and it took me a little longer than it should have. If you asked someone to write this program in any imperative language, it would probably look a lot like this.

I then made a lisp version (elisp), but I first needed to define the binomial coefficient function since there wasn't one provided.

(defun nck (n k)
  "The binomial coefficient."
  (if (or (= k 0) (= k n)) 1
    (+ (nck (- n 1) (- k 1)) (nck (- n 1) k))))

This is the recursive version, based on Pascal's rule, so it doesn't need factorials.

In lisp, recursion is preferred to iteration, so that's the way I approached the program.

(defun p-rain (n)
  "Probability of rain on exactly n days."
  (* (nck 7 n) (expt 0.75 n) (expt 0.25 (- 7 n))))

(defun rain-at-least (n)
  "Probability it will rain on at least n days."
  (if (> n 7) 0
    (+ (p-rain n) (rain-at-least (+ 1 n)))))

I like the recursive version, and this code, much better. It presents the solution in a more straight forward way. And I got it right the first time, too. Yes, it was written after the Octave version, but I still think it counts for something.

The lisp version is also less complex. The Octave version has to use some temporary variables, i and sum, which is extra conceptual overhead. Sure, it could also be written recursively, but this is not really the way Octave is meant to be written.

Eh, not a great example, really. Lisp has so many powerful features, like macros (the powerful lisp kind), symbols, low-level access into the interpreter, and the REPL, that allow the programmer to do some really cool things. Many of these features are unique to lisps.

I'm really liking lisp.


Another Perl One-liner: Byte Order

At work right now I am using two different machines. One is a 32-bit SPARC and the other is a very powerful SMP x86-64. Sometimes data are generated on one machine and used in a simulation on the other. There is a problem of byte-order, though. The SPARC is big-endian and the other is little-endian, and the programs on both sides don't pay attention to that small detail.

Luckily, the data are all 4-byte aligned. That's perfect for a Perl one-liner byte order conversion,

perl -e 'print scalar reverse while read STDIN, $_, 4' < in.le > out.be

Perl is really great for concise hacks. I really like how this one-liner almost reads like a natural language sentence. Is there any other language that can do powerful one-liners like Perl?


Emacs Web Server

GNU head As part of my quest of developing solid knowledge of GNU Emacs lisp, I have implemented a pseudo-HTTP/1.0 web server within Emacs. Behold,

git clone http://git.nullprogram.com/emacs-httpd.git

To all other non-emacsen text editors, can your text editor do that?! Ha! Even though elisp is a slow, closure-less, dynamically scoped, ugly cousin of more popular lisps, it's still a lot of fun to write.

To fire it up, load it into Emacs and run the extended command (M-x) httpd-start. By default it will serve files from "~/public_html". To change this, change the variable httpd-root to the desired web root. You can stop the server with httpd-stop.

It's about 200 lines of code and can serve static websites made of small, static files. I say small files because it serves files from buffers, meaning it has to read the entire file in first.

For a simple, text editor based server it can hold up to a pretty decent load. At one point I hit it with 8 wget instances all making rapid recursive downloads and my manual navigation wasn't slowed down noticeably. Despite running in the slow elisp interpreter, I think it can have much better performance by caching commonly served files in buffers.

It should run, unmodified, anywhere a modern Emacs can run, so I expect that it's already very portable. I can imagine it being useful in a situation where someone needs to temporarily host some files but there isn't a web server on the machine. Just grab this script and throw it at Emacs.

Well, it only does IPv4 right now, though I expect IPv6 only requires changing one number (namely, 4 to 6). I don't have any IPv6 systems to test it on.

When writing it I also had security in mind so, as far as I know, it should be safe to use. It cleans up the GET from the client so that no files underneath the serving root can be accessed.

The server log is lisp itself. Here is an example log starting the server, serving one request, and halting,

'(log
  (start "Wed May 13 23:33:34 2009")
  (connection
   (date "Wed May 13 23:36:25 2009")
   (address "192.168.0.3")
   (get "/0001.html")
   (req
    ("Referer" "http://192.168.0.2:8080/")
    ("Connection" "keep-alive")
    ("Keep-Alive" "300")
    ("Accept-Charset" "ISO-8859-1,utf-8;q=0.7,*;q=0.7")
    ("Accept-Encoding" "gzip,deflate")
    ("Accept-Language" "en-us,en;q=0.5")
    ("Accept" "image/png,image/*;q=0.8,*/*;q=0.5")
    ("User-Agent" "Mozilla/5.0 [...] Iceweasel/3.0.9 (Debian-3.0.9-1)")
    ("Host" "192.168.0.2:8080")
    ("GET" "/0001.html" "HTTP/1.1"))
   (path "~/public_html/0001.html")
   (status 200))
  (stop "Wed May 13 23:38:17 2009"))

The log is alists of alists, making a hierarchical tree structure that can be explored with some simple lisp functions. Normally this sort of thing is done with XML, but lisp already has its own structured format: lists!

When GET is a directory, it looks for "index.html" and serves that if it exists. More indexes can be added to the variable httpd-indexes. This can actually be done in a special ".htaccess.el" file.

If a ".htaccess.el" exists in the directory from which a file is being served, Emacs will first load/execute it. You see, it's just a lisp program. If you wanted to add a new index file name, the hypertext access file could contain this,

(add-to-list 'httpd-indexes "0001.html")

It's a bit like a .emacs file.

But I think one of the coolest things about having a lisp-based server is that the server can be modified in place without disrupting or restarting it. In my Emacs web server, the only change that requires a restart is changing the server port. In fact, I wrote most of it while the server was running and tested my changes from a browser right as I made them -- all on the same instance of the server.

If you want to look into the AI side of this, the server could modify its own code in response to its use.

I also had the idea of creating dynamic websites with elisp, in the same way PHP or Perl does. If a .el file (or .elc) is accessed, the server would pass the GET/POST arguments as an alist to a function in the elisp file. The server would also provide some nifty HTML generation macros. A dynamic script might look like this,

(defun script (get)
  (html
   (head
    (title "My Script"))
   (body
    (h1 "Your Query")
    (p (concat "Your query was "
               (html-sanitize (cdr (assoc "q" get)) "."))))))

However, this is not (yet?) implemented. Just an idea.

I will continue to work on it, though I don't expect to add much more to it. I will mostly improve the code and documentation.


I Finally Have Comments

I finally have a comment system, thanks to pollxn, a blosxom comment system that actually works. There is a link to it, indicating the number of comments, in the bottom of each post. Try it out and say hello.

Unfortunately, pollxn doesn't have any sort of anti-spam or CAPTCHA system. If you look around the Interwebs where other people are using pollxn, you will see everyone has their own little CAPTCHA thing. Well, I am not different. I hacked together my own to keep away automated spammers.

It selects words from the dictionary (of 40,000 words in this case) and encrypts them with Blowfish in CBC mode, with a unique IV each time. This is to passed to the user, who passes it to an image generator which decrypts the word and uses GD in Perl to render it, apply some transforms, and drop a line randomly over it. The user submits the guess of the image along with the encrypted version (hidden field), which is decrypted and compared on the other end. The same encrypted ID cannot be used twice, but thanks to the IV the same word can be used twice.

Here are some samples. If you hit refresh, they will render differently.

CAPTCHA sample: ormolus CAPTCHA sample: morons CAPTCHA sample: softer CAPTCHA sample: zanucks CAPTCHA sample: grumble CAPTCHA sample: nozzle

It's not a great CAPTCHA, but it should be good enough for the low volume of traffic I see here. As I inevitably collect small amounts of spam (by spammers manually passing the CAPTCHA), I will gradually create the needed tools to combat it. I can also easily update the CAPTCHA image algorithm without disrupting the functioning of the website.

I'm sure I will be making improvements to the comment system over time as well. I should make it obfuscate e-mail addresses, for one. Maybe add a preview. And better blosxom integration.

So say hello below! I am excited to finally have a real blog.


Greasemonkey User Scripts

Old manuscript I have recently been playing around with Greasemonkey, a great Firefox add-on that gives users a lot of control over how a website is displayed. It works by having the user providing a bit of Javascript that runs when a website is rendered. The user doesn't actually have to write the script, but can find them in various script repositories.

When I looked around, I couldn't find scripts to do some things I wanted, so I started writing my own. Now that I have started that habit I see uses for user scripts all over the place. Suddenly I can fix anything I find annoying on the web. It's very empowering.

Of course, Firefox add-ons can do anything a user script can do. But Greasemonkey user scripts are lightweight, more secure (due to being less powerful), easier to write, and don't require a browser restart to install and uninstall.

I posted my scripts on userscripts.org so that people could find them easily, but I always like to host these things locally, too. You can find it under /userscripts here. Don't forget to review the source before you install it! That is, unless you automatically trust me and my website's security. I, or an infiltrator, could slip something sneaky in there.

I actually first used Greasemonkey back in 2005, but they had some very serious security issues back in those days. It was bad enough that I just uninstalled it, which was actually recommended by the Greasemonkey people themselves. So, four years later I am back to check it out.

The first user script I wrote was in response to a "feature" on TV Tropes. In addition to the overall cruddiness of the website, they started adding "folders" to the information on long pages. It folds up all of the information behind little clickable widgets. It uses CSS to hide the information and Javascript to reveal it by adjusting those styles on the fly. If you have a browser with CSS support but not Javascript (or have it disabled like I did), you won't ever be able to see the information in the browser. As of this writing the Jumping the Shark article uses this. Take a look.

This is an awful idea! It critically breaks the usability of the page. And what's the point? We already have a vertical scrollbar to control the display. Unfortunately, a lot of clueless people seem to like this sort of behavior -- because it's flashy -- so we will probably only see more of it on the web in the future.

My TV Tropes user script is simple: it scrapes off some of the CSS. Specifically, the "folder" CSS. That's it! Someone who already knows how to make user scripts could probably put this together in less than 15 minutes.

If you want to tackle web annoyances, learn Greasemonkey!


Wikipedia Flu Time-lapse

Here's something interesting I saw on Wikipedia. There is a map of the US with states colored according to the spread of the H1N1 flu epidemic. Take a look: H1N1_USA_Map.svg. The interesting part is actually at the bottom of that image page.

Wikipedia, being a wiki, versions everything. It has to. No one actually changes a page, they just add a new version that is a derivative of the "current" version. Because of this, Wikipedia has incidentally created a time-lapse version of the map in its version control system. In case you can't see it when you are reading this, here's what part of it looks like,

Wikipedia screenshot

That's the timestamp and the state of the spread at that time, presented in an extremely useful way. And this was created by accident. Pretty cool, eh?


Clay Klein Bottle

A few years ago I made my wife -- girlfriend at the time -- a Klein bottle (well, the three-dimensional projection of one) out of clay. Since I hadn't used clay before I used some assistance from my dad. Here's how it was done,

Bottle diagram Bottom diagram image

As you can see, it's not quite the same as the generally depicted Klein bottle. The form you see here was easier to make with clay. After it was done, we baked it in a kiln. It's a bad idea to put sealed items in a kiln because they will burst as they heat. It took some time to convince the staff that our Klein bottle was actually unsealed.

Here are some pictures,

Front Side Bottom Top


Arcfour and CipherSaber in Emacs Lisp

I have previously talked about arcfour and CipherSaber and provided implementations in C. For fun, I made an implementation of arcfour in Emacs lisp (elisp), and built upon that to make a CipherSaber implementation in elisp. Check it out with Git,

git clone http://git.nullprogram.com/arcfour.git

If you don't have Git (yet), you can follow that link to use the web interface. The relevant files are arcfour.el and ciphersaber.el. There are some test vectors in there that I won't show here. Here is the arcfour implementation,

(defun rc4-init-state ()
  "Initialize the arcfour state vector."
  (interactive)
  (setq rc4-state (make-vector 256 0))
  (setq rc4-i 0)
  (setq rc4-j 0)
  (let (i)
    (dotimes (i 256 rc4-state)
      (aset rc4-state i i))))

(defun rc4-swap (i j)
  "Swap two elements in the state vector."
  (let ((temp (aref rc4-state i)))
    (aset rc4-state i (aref rc4-state j))
    (aset rc4-state j temp)))

(defun rc4-key-sched (key)
  "Arcfour key-scheduler: initialize state from key."
  (interactive "sEnter key: ")
  (let ((j 0) i)
    (dotimes (i 256 rc4-state)
      (setq j (% (+ j 
                    (aref rc4-state i) 
                    (aref key (% i (length key)))) 256))
      (rc4-swap i j))))

(defun rc4-gen-byte ()
  "Generate a single byte."
  (interactive)
  (setq rc4-i (% (1+ rc4-i) 256))
  (setq rc4-j (% (+ rc4-j (aref rc4-state rc4-i)) 256))
  (rc4-swap rc4-i rc4-j)
  (aref rc4-state (% (+ (aref rc4-state rc4-i) 
                        (aref rc4-state rc4-j)) 256)))

For the sake of simplicity it uses some global variables to store the cipher state. It would be better if the state was returned as a list, continuation, or closure. That way we could run a bunch of different ciphers at the same time.

And this provides interactive functions so that they can be called by a user right on the buffer being used, with M-x rc4-buffer.

(defun rc4-region (start end key)
  "Encrypt/decrypt region with arcfour using given key."
  (interactive "r\nsEnter key: ")
  (rc4-init-state)
  (rc4-key-sched key)
  (save-excursion
    (let (c)
      (goto-char start)
      (while (< (point) end)
        (setq c (char-after))
        (delete-char 1)
        (insert-char (logxor c (rc4-gen-byte)) 1)))))

(defun rc4-buffer (key)
  "Encrypt/decrypt entire buffer with arcfour."
  (interactive "sEnter key: ")
  (rc4-region (point-min) (point-max) key))

You may run into encoding issues with encrypted regions. The CipherSaber implementation below gets around this problem by turning off multi-byte encoding with set-buffer-multibyte.

In ciphersaber.el we apply these functions. The CipherSaber functions can be called with M-x cs-encrypt-buffer and M-x cs-decrypt-buffer. Note that this is only CipherSaber-1, and I leave CipherSaber-2 as an exercise for the reader :-P.

(defun cs-encrypt-buffer (key)
  "Encrypt buffer with CipherSaber-1."
  (interactive "sEnter key: ")
  (set-buffer-multibyte nil)
  (let* ((iv (cs-generate-iv))
         (cs-key (concat key iv)))
    (rc4-buffer cs-key)
    (beginning-of-buffer)
    (insert iv)))

(defun cs-decrypt-buffer (key)
  "Decrypt buffer with CipherSaber-1."
  (interactive "sEnter key: ")
  (let* ((iv (cs-retrieve-iv))
         (cs-key (concat key iv)))
    (rc4-buffer cs-key)))

(defun cs-generate-iv ()
  "Generate a 10-byte initialization vector."
  (let ((iv "") i)
    (dotimes (i 10 iv)
      (setq iv (concat iv (char-to-string (random 256)))))))

(defun cs-retrieve-iv ()
  "Remove initialization vector from buffer and return it."
  (beginning-of-buffer)
  (let ((iv "") i)
    (dotimes (i 10 iv)
      (setq iv (concat iv (char-to-string (char-after))))
      (delete-char 1))))

I didn't bother with save-excursion because I didn't think it would be important where the point is in the middle of an encrypted file. Feel free to add it, though!

These functions could be used to make a minor mode to transparently encrypt and decrypt CipherSaber files. It could also be modified to take advantage of Emacs' batch mode to handle CipherSaber processing right from the shell. But those are other projects!


CipherSaber

If you are a crypto-anarchist type like me, you should definitely take a look at CipherSaber. It is an extremely simple encryption protocol that even beginner programmers can implement. The protocol can also easily be memorized and quickly implemented from memory on the fly. In the case that cryptography was completely outlawed, CipherSaber would be a useful tool in allowing its users to continue to communicate privately.

I think the name is just perfect and captures everything CipherSaber is about. Here is the description right from the CipherSaber page,

In George Lucas' Star Wars trilogy, Jedi Knights were expected to make their own light sabers. The message was clear: a warrior confronted by a powerful empire bent on totalitarian control must be self-reliant.

CipherSaber is based on the arcfour stream cipher, but goes beyond it by defining the use of an initialization vector (IV) and how it is stored with the ciphertext. There are actually two versions: CipherSaber-1 and CipherSaber-2. The second one exists because of vulnerabilities in the first. The difference between them is small.

You want to make sure you generate a long enough passphrase for your encryption key. A normal password isn't good enough because an adversary will be able to throw all his available processing power at your ciphertext. Using Diceware would be a good idea here.

Here is the protocol.

CipherSaber diagram

Generate a 10-byte random IV. This need not be done using a very strong random number generator. It is only important that the same IV is not used more than once. Concatenate a secret user selected key (i.e. passphrase) with the IV and use that concatenation as the key for an arcfour cipher. Encrypt the message using the cipher. Concatenate the IV and the arcfour ciphertext to create the CipherSaber ciphertext.

To decipher, remove the first ten bytes of the ciphertext and use it as an IV. Concatenate the secret passphrase with the IV, and use it as the key for an arcfour cipher. Decrypt the remaining ciphertext with the arcfour cipher.

Because of vulnerabilities in the arcfour cipher, CipherSaber-2 is an updated version that runs the arcfour key scheduler at least 20 times. The exact number of times is a secret that the sender and receiver must agree on. Notice that CipherSaber-1 is CipherSaber-2 with only 1 key schedule iteration.

Using a large number of iterations could be considered a form of key strengthening. An adversary who is making a brute force attack on the ciphertext has that much more work to do for each passphrase trial.

You should really implement your own, but here is one of my implementations, written in C. I put it in with the rest of my arcfour stuff. Get it with git,

git clone http://git.nullprogram.com/arcfour.git

You can use it as a reference to make sure your first implementation is correct. You can use these two ciphertexts to test your implementation as well,

ciphersaber.png.cs
ciphersaber.png.cs2

This is the diagram image above (ciphersaber.png) encrypted with the key "nullprogram". The first one is CipherSaber-1 and the second is CipherSaber-2 with 20 key schedule iterations.


Emacs Htmlize and Highlighted Source Code

In the past I have always posted my code plainly without any sort of syntax highlighting. Boring! I finally got around to using the Emacs htmlize tool written by Hrvoje Niksic. So from now on I will try to provide syntax highlighting for inline code.

Htmlize leverages the syntax highlighting of Emacs by converting the faces into HTML. It's more or less what I see in my Emacs buffers. The exact output format can be in HTML + CSS, HTML with inline CSS, or HTML with old font tags. It can provide highlighting for any language that Emacs supports, which is just about everything.

I threw the CSS stuff in my main stylesheet so the XHTML of my posts remains simple. Here are some examples in action. Perl,

## Find the maximum total from top to bottom of the triangle.

use warnings;
use strict;
use List::Util qw/reduce max/;

my @triangle;
open my $datafile, "data67.txt";
while (<$datafile>) { push @triangle, [split/\s+/] }

my $result = reduce {
    for my $i (0..$#{$b}) {
        @$b[$i] += max(@$a[$i], @$a[$i+1]);
    }
    $b
} reverse @triangle;

print @{$result}[0] . "\n";

Scheme,

(define (sqrt x)
  (sqrt-iter 0 1.0 x))

(define (sqrt-iter last-guess guess x)
  (if (good-enough? last-guess guess)
      guess
      (sqrt-iter guess (improve guess x)
                 x)))

(define (good-enough? last-guess guess)
  (< (/ (abs (- last-guess guess)) guess) 0.001))

Octave,

# npoly()
function a = npoly (x, y)  
  X = repmat (x', 1, length(x));
  
  for i = 1:length(x)
    X(:,i) = X(:,i) .^ (i - 1);
  end
  
  a = X \ y';
end

C,

#define SWAP(a, b) if (a ^ b) {a ^= b; b ^= a; a ^= b;}

/* Initialize a keystream.  */
void init_keystream (keystream *k, int n)
{
  k->i = 0;
  k->j = 0;

  int i;
  for (i = 0; i < 256; i++)
    k->S[i] = i;

  int s;
  byte j = 0;
  for (s = 0; s < n; s++)
    for (i = 0; i < 256; i++)
      {
        j += k->S[i] + k->key[i % k->keylen];
        SWAP(k->S[i], k->S[j]);
      }
}

elisp,

(defun count-words-buffer ()
  "Print a message the number of words in the current buffer."
  (interactive)
  (save-excursion
    (let ((count 0))
      (goto-char (point-min))
      (while (< (point) (point-max))
        (forward-word 1)
        (setq count (1+ count)))
      (message "%d words." count))))

And even HTML,

<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<h1>Hello World!</h1>
<!-- This is a comment. -->
<p>
  This is my HTML sample.
</p>
</body>

Custom Webcomic RSS Feeds

HTML to RSS I actually didn't start using RSS feeds until recently, and I wish I had started earlier. However, when I added my favorite webcomics to my feed I noticed most of them didn't actually include the comic itself. Personally, I would rather read the comic right in the feed.

Well, I later came across an old post on a blog I frequent: Terminally Incoherent. Luke Maciak posted some code to generate an RSS feed by scraping the comic's website. He did it in order to provide feeds for webcomics that don't have one. I took it, added a web front-end to it, and made the feeds available to anyone. You can find them under /feeds/ or at the link in the main left bar.

The scraper works by searching the comic's "latest comic" page for image sources matching a regular expression. So to teach the scraper about a new comic, all that needs to be provided is a URL and a regular expression matching only the image of the comic.

My front-end has a database of URL/regex pairs and takes care of caching results. This way it only queries the comic's website once every few hours at the most.

Here's where you can find the code,

feeds.pl
grab.pl

I also use an .htaccess file to write the links to look much cleaner,

RewriteEngine on
RewriteRule ^([\w\/]+)$ feeds.pl?q=$1
RewriteRule ^$ list

There are some comics that are immune. Specifically, some comics don't allow hot-linking, which includes links from an RSS feed. There is nothing I can really do about these.

If you enjoy these comics, go ahead and add these feeds if you want to read them right in your RSS reader.


A Not So Stupid C Mistake

I was reading through a website of " computer stupidities" today when I came across this,

if (a)
  {
    /* do something */
    return x;
  }
else if (!a)
  {
    /* do something else */
    return y;
  }
else
  {
    /* do something entirely different */
    return z;
  }

This was quickly dismissed as being an obvious beginner mistake. I don't think this can be dismissed so quickly without thinking it through for a moment. Yes, in the example above we will never reach the last condition where we return z, but consider the following,

if (a < b)
  printf ("foo\n");
else if (a > b)
  printf ("bar\n");
else if (a == b)
  printf ("baz\n");
else
  printf ("faz\n");

The same quick dismissal might drop the last "faz" print statement as being an impossible condition. Can you think of a situation where the program would print "faz"?

Our final condition will be reached if a or b is equal to NAN, which is defined by the IEEE floating-point standard. It is available in C99 from math.h. A NAN in any of the comparisons above will evaluate to false.

So don't be so quick to dismiss code like this.


SWF Decompression Perl One-liner

Magnifying glass Flash seems to be the popular way of playing videos online. This is a bit better than the bad old days of online video where a user had to select from a few buggy media player plug-ins. Things have improved.

However, if you don't use Flash, or if you want to watch the videos in your own media player, you are stuck. A download link for the video is almost never provided. The video is always somewhere, though, to be fetched via http. I mentioned this before for downloading YouTube videos using youtube-dl.

The trick is finding the URL. Sometimes you can derive it from the HTML code, sometimes you have to dig a little deeper by inspecting the Flash player itself. strings can be invaluable here.

There could be an extra layer of stuff to work out, which is explained below. My main reason for posting this is so I can refer back to it later when I need to do it again.

So, the other day I ran into a Flash video player that contained part of the URL of its video. I began by studying the embed tag in the HTML, which gave me some information about where to find the video (the video ID number). I downloaded the Flash player SWF file for the purpose of running strings on it.

I ran into a problem here. I wasn't finding any non-garbage strings inside the file. file told me it was compressed.

$ file player.swf
player.swf: Macromedia Flash data (compressed), version 9

Searching online quickly revealed that a compressed Flash file is just zlib compression after an 8-byte header. Decompression can actually be done with a Perl one-liner,

perl -MCompress::Zlib -0777 -e \
      'print uncompress substr <>, 8;' player.swf > player

I ran strings and greped for "http", revealing the location of the video. That was it!

I actually came across a Java program that does the same thing. It is 115 lines of code. Java programs always seem to be bloated like this.

I hope you find this useful!


URL Shortening

Short URL diagram There has been a lot of talk online about the fragility of URL shortening services, particularly in relation to Twitter and its 140 character limit on posts (based on SMS limits). These services create a single point of failure and break mechanisms of the web that we rely on. Several solutions have been proposed, so over the next couple years we get to see which ones end up getting adopted.

There are many different URL shortening services out there. They take a large URL, generate a short URL, and store the pair in a database. Several of these services have already shut down in response to abuse by spammers who hide fraudulent URLs behind shortened ones. If these services ever went down all at once, these shortened URLs would rot, destroying many of the connections that make up the world wide web. This is called the rot link apocalypse, and it has some people worried.

I am not very worried about this, though. I don't use Twitter, or any other service that puts such ridiculous restrictions on message sizes. Nor do I think information on Twitter is very important. Also, this mass link rot will occur gradually, slow enough to be dealt with.

In any case, short URLs may be useful sometimes, especially if a URL needs to be memorized or if the URL is extremely long. Or, it could be used to get around a design flaw in an inferior browser.

One idea that I have not yet seen implemented is simple data compression. When a short URL is needed, a user can apply a compression algorithm to the URL. The original URL can be recovered from this alone, so we don't have to rely on third parties to store any data.

I have doubts this would work in practice, though. Generic compression algorithms cannot compress such a small amount of data because their overhead is too large in relation. Go ahead, try pushing a URL through gzip. It will only get longer. We would need a special URL compression algorithm.

For example, I could harvest a large number of URLs from around the web, probably sticking to a single language, and use it to make a Huffman coding frequency table. Then I use this to break URLs into symbols to encode. The ".com/" symbol would likely be mapped to one or two bits. Finally, this compressed URL is encoded in base 64 for use. The client, who already has the same URL frequency table, would use it to decode the URL.

URLs don't seem to have too many common bits, so I doubt this would work well. I should give it a shot to see how well it works.

We probably need to stick with lookup tables mapping short strings to long strings. Instead of using a third party, which can disappear with the valuable data, we do the URL shortening at the same location as the data. If the URL shortening mechanism disappears, so did the data. The URL shortening loss wouldn't matter thanks to this coupling. Getting the shortened URL to users can be tricky, though.

One proposal wants to set the rev attribute of the link tag to "canonical" and point to the short URL.

<link rev="canonical" href="http://example.com/FbVT">

To understand this one must first understand the rel attribute. rel defines how the linked URL is related to the current document. rev is the opposite, describing how the current page is related to the linked page. To say rev="canonical" means "I am the canonical URL for this page".

However, I don't think this will get far. Several search engines, including Google, have already adopted a rel="canonical" for regular use. It's meant to be placed with the short URL and will cause search engines to treat it as if it was a 301 redirect. This won't help someone find the short URL from the long URL, though. It is also likely to be confused with the rev attribute by webmasters.

The rev attribute is also considered too difficult to understand, which is why it was removed from HTML5.

Another idea rests in just using the rel attribute by setting it to various values: "short", "shorter", "shortlink", "alternate shorter", "shorturi", "shortcut", "short_url". This website does a good job of describing why they are all not very good (misleading, ugly, or wrong), and it goes on to recommend "shorturl".

I went with this last one and added a "short permalink" link in all of my posts. This points to a 28 letter link that will 301 direct to the canonical post URL. In order to avoid trashing my root namespace, all of the short URLs begin with an asterisk. The 4 letter short code is derived from the post's internal name.

I also took the time to make a long version of the URL that is more descriptive. It contains the title of the post in the URL so a user has an idea of the destination topic before following through. The title is actually complete fluff and simply ignored. Naturally this link's rel attribute is set to "longurl".

Keep your eyes open to see where this URL shortening stuff ends up going.


Hashapass Password Management

Lock The author of a tool named Hashapass contacted me some time ago to bring his tool to attention. It is a way to mitigate the problem of having to memorize and generate many different passwords.

Good security practice is for users to have a different password with each web site and system they use. Should one of them be compromised, your other accounts will still be safe. The problem is that passwords tend to both be hard to remember and difficult to generate.

Hashapass allows a user to have just one password (ideally, passphrase) that is used to generate many different passwords. Provide the master passphrase and the name of the website (parameter) needing a password and Hashapass generates an 8-character password worth 48 bits.

The website works entirely in Javascript, so you don't have to worry about transmitting your password or master passphrase. This also makes it easy to see how the hashing is done. If this was a secret, I wouldn't recommend using it.

It works by applying HMAC, with the SHA-1 hash, to the the parameter and passphrase as to stir them together into a hash. Then it outputs the 48 most significant bits in base-64 as the password.

I mentioned before that you should really use a master passphrase instead of a master password, because a compromised hash password can be brute forced to reveal the master password. Unfortunately, the Hashapass website says "password" instead of "passphrase".

I made a Hashapass password cracker to test how practical this attack would be. You can grab it with Git,

git clone http://git.nullprogram.com/hashapass.git

The idea is that if a malicious website operator peeked at your password, knew you used Hashapass, and properly guessed the parameter (which isn't a secret), he could use a tool like this to brute force attack the password to retrieve the master passphrase. A short master password could easily be discovered.

Running on one machine with one instance of the program, my tool can break any password with five or less characters in a matter of hours. A 6-character password could take a month or two. A 7-character password would take a decade. Each character in the password increases the search time by a factor of 100.

If multiple computers/cores/processors are put to use on the attack, these times can be shortened: 2 computers would halve the time, for example. The attack is easy to parallelize.

My tool assumes a strong, but short, master password was chosen, as it checks against all printable ASCII characters. If a weaker password was used, and the attacker assumed this, the above time table would be much shorter.

So, for the master passphrase, use at least 8 characters generated using a strong random number generator. I recommend generating the passphrase with Diceware using 5 words.


Brainfuck Halting Problem

Stop sign On my brainfuck compiler project, I proposed pre-calculation as an optimization technique. The idea can work, but it has an issue that will always be unsolvable: how do you know that the pre-calculation will halt? This is called the halting problem and it has been proven impossible to solve.

The idea was that the compiler would run the brainfuck program up until the first input operation -- if there even was one. It would record all output and the final state of the memory. Instead of compiling the code was was run, it would compile code that would print all of the output and set the memory at the final state.

I has mistakenly assumed that it would be possible to detect a non-halting program and avoid doing pre-calculation on it. I described how it would be done and left it at that. Recently, someone kindly sent me an email containing only 5 letters:

+[--]

This defeated my ill-conceived idea.

Because brainfuck is Turing complete, it is actually impossible to determine whether or not an arbitrary brainfuck loop will halt. A computer can't do it. A human brain (a fancy computer) can't do it either. It cannot be done, at least not in this universe.

So, if implemented, this pre-calculation measure will always be flawed.


The Lazy Fibonacci List

GFDL licensed: found at /fdl-1.3.txt In a project I am working on, I want to implement a large list using lazy evaluation in Scheme. The list is large enough to be too unwieldy to store entirely in memory, but I still want to represent it in my program as if it was. The solution is lazy evaluation.

One use of lazy evaluation is allowing a program to have infinitely sized data structures without going into the impossible task of actually creating them. Instead, the structure is created on the fly as needed. As a prototype for getting it right, I made an infinitely long list in Scheme that contains the entire Fibonacci series.

This function, given two numbers from the series, returns the lazy list. It uses delay to delay evaluation of the list.

(define (fib f)
  (cons (cadr f)
        (delay (fib (list (cadr f)
                          (apply + f))))))

Notice the recursion here as no base case, so without lazy evaluation it would continue along forever without halting. Now run it,

> (fib '(0 1))
(1 . #<promise>)

The rest of the list is stored as a promise, which will later be teased out using force. This forces evaluation of the promise. Here is a function to traverse the list to the nth element and return it. Notice, this does have a base case.

(define (nth-fib f n)
  (if (= n 1) (car f)
      (nth-fib (force (cdr f)) (- n 1))))

Here it is in action. It is retrieving the 30th element.

> (define f (fib '(0 1)))
> f
(1 . #<promise>)
> (nth-fib f 30)
832040

If you examine f, it contains the first 30 numbers until running into an unevaluated promise. This behavior is very similar to memoization, as calculated values are stored instead of being recalculated later.

These two functions are also behaving as coroutines. When nth-fib reaches a promise, it yields to fib, which continues its non-halting definition. After producing a new value in f, it yields back to nth-fib.

The way I called these functions above, however, can lead to problems. We are storing all the calculated values in f, which can take up a lot of memory. For example, this probably won't work,

> (nth-fib f 1000000)

We will run out of memory before it halts. Instead, we can do this,

> (nth-fib (fib '(0 1)) 1000000)

Because nth-fib uses tail recursion as it traverses the list, unneeded calculated values are tossed (which the garbage collector will handle) and no additional function stack is used. All Scheme implementations optimize tail recursion in this way. This will continue along until it hits the millionth Fibonacci number, all while using a constant amount of memory.

It turns out that Scheme calls this type of data structure a stream, and some implementations have functions and macros defined so that they are ready to use.

So there you go: memoization, lazy evaluation, and coroutines all packed into one example.


Apartment Balcony Gardening

My wife was interested in growing a garden this summer. She has never done it before and wanted to learn. However, we live in an apartment so we don't exactly have a yard. Instead, we got some pots and started growing a balcony garden.

I thought we were being clever, but it turns out this is a common practice called container gardening. Searches online for "balcony gardening", "apartment gardening", and "container gardening" will bring up lots of useful information.

I think it's more convenient than regular gardening, as the plants are practically in the apartment. In fact, it's about 4 feet from our bed, through the balcony door. We can move the plants around to the other balcony, or even inside, if needed.

Gardening is a lot of fun, really.

This year, we are growing (or trying to grow) carrots, peppers, strawberries, catnip, and dactylis (cat grass). We'll see how things turn out in a couple months. Here's how it looks right now,

Balcony garden Top balcony garden Calvin inspection


Vimperator Firefox Add-on

Mouse crossed out I recently learned about an excellent Firefox add-on called Vimperator, which I have been using for a few days now. It creates an extremely efficient Vim-like interface to Firefox. One of the main functions is to be able to browse completely mouseless.

Why mouseless? Because the mouse is a bad input device for many uses of a computer. It's a good choice for many games, like first-person shooters, or graphic design, like Inkscape or GIMP. But for tasks like text editing, word processing, and data entry, the mouse one of the worst kinds of input device. The less you touch the mouse, the better.

Vimperator's argument is that the browser is better as a pure keyboard interface.

I am an Emacs person myself, which I use for text editing, file management, and IRC, but I appreciate the vi/Vim interface and accept it as being almost as good. Most of my vi experience actually comes from NetHack and Less. My main use for vi is editing my Debian sources.list so I can install Emacs.

Vimperator removes your toolbar, menu bar, and address bar. Then it transforms the status bar into the standard Vim status lines. This is because you don't need any of this stuff anymore with the Vim interface. It's traded for more browser real estate. This also creates the fun situation of watching your friends try to use your browser. At first, it really is pretty disorienting.

There is handy built-in documentation, found by pressing F1 or calling the :help command. You'll want to read through these before trying to do anything.

My problem right now is breaking my old Firefox keyboard muscle memory. Before Vimperator, my browsing was already fairly mouseless. I used keyboard shortcuts for everything. I had the Mouseless Browsing add-on installed, and occasionally used. When not using Mouseless Browsing, it worked out well because my right hand did the mouse, while most of the keyboard shortcuts could easily be performed with my left hand (C-tab, C-S-tab, C-t, C-w).

I think Vimperator has the potential to be even more efficient than that.

Probably one of the biggest adjustments is following links without a mouse. Like the Mouseless Browsing add-on, Vimperator assigns numbers to the links to be typed out. It is less intrusive though, because it doesn't reformat the page to show the numbers. It has a "hint" mode you go into for that. This mode displays the numbers over the links as red markers.

But even better than that, you don't generally even need those numbers. You can enter hint mode and begin typing the type of the link out. As soon as you reach a unique string prefix, it follows the link. This is the primary way I follow links, and I started doing this completely by accident. I wasn't even aware this was possible until I did it. Vimperator was completely natural in this respect.

Probably my favorite feature so far is automatic page advancement. I use these all the time now. One set of commands is C-a and C-x. These increment and decrement the last number in a URL, handy for those annoying multi-page articles. If they number the pages in the URL, this should handle it automatically. The other form of page turning is [[ and ]]. These search for links labeled "next", ">", "prev", "previous", and "<" and follow them. This works in Google searches and many web comics.

A potential use for macros is quick data scraping. You can write a macro to automatically perform a series of actions, like save the current page and move the next one, and have them repeat a number of times. It could also help in rapidly filling out the same form over and over, leaving only the CAPTCHA for manual input, if you were up to something mischievous.

For example, here is a macro to open in a new tab the first result of a Google search on the current page, then move to the next page. If you repeat it, it will open the first result on page 1, then the first result on page 2, and so on.

q s F 2 8 ] ] q

Note, the "28" may be different for you. To open the first result on the next 15 search result pages,

1 5 @ s

It is pretty cool watching it work away.

It's not perfect, though. Like Vim, you can prefix commands with numbers to repeat them, but this won't work with many commands, such as the page turning one above. You can get around it sometimes by placing the command in a macro.

Also, Vimperator still requires a mouse for many actions, like saving images. The worst part about it is these actions cannot be used as part of a macro. Hopefully Vimperator will improve in the future and fix this.

Give it a shot sometime. Like learning a good text editor for the first time, after you are set up, move your mouse out of reach so you are forced to use the keyboard. It slows you down in the short run, but you will be very fast later on down the road.


Avoid Zip Archives

An onion In a previous post about the LZMA compression algorithm, I made a negative comment about zip archives and moved on. I would like to go into more detail about it now.

A zip archive serves three functions all-in-one: compression, archive, and encryption. On a unix-like system, these functions would normally provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The unix philosophy says to "write programs that do one thing and do it well".

So in the case of zip archives, we are doing three things poorly when, instead, we should be using three separate tools that each do one thing well.

When we use three different tools, our encrypted archive is a lot like an onion. On the outside we have encryption. After we peel that off by decrypting it, we have compression, and after removing that lair, finally the archive. This is reflected in the filename: .tar.gz.gpg. As a side note, if GPG didn't already support it, we could add base-64 encoding if needed as another layer on the onion: .tar.gz.gpg.b64.

By using separate tools, we can also swap different tools in and out without breaking any spec. Previously I mentioned using LZMA, which could be used in place of gzip or bzip2. Instead of .tar.gz.gpg you can have .tar.lzma.gpg. Or you can swap out GPG for encryption and use, say, CipherSaber as .tar.lzma.cs2. If we use a single one-size-fits-all format, we are limited by the spec.

Compression

Both zip and gzip basically use the same compression algorithm. The zip spec actually allows for a variety of other compression algorithms, but you cannot rely on other tools to support them.

Zip archives are also inside out. Instead of solid compression, which is what happens in tarballs, each file is compressed individually. Redundancy between different files cannot be exploited. The equivalent would be an inside out tarball: .gz.tar. This would be produced by first individually gzipping each file in a directory tree, then archiving them with tar. This results in larger archive sizes.

However, there is an advantage to inside out archives: random access. We can access a file in the middle of the archive without having to take the whole thing apart. In general use, this sort of thing isn't really needed, and solid compression would be more useful.

Archive

In a zip archive, timestamp resolution is limited to 2 seconds, which is based on the old FAT filesystem time resolution. If your system supports finer timestamps, you will lose information. But really, this isn't a big deal.

It also does not store file ownership information, but this is also not a big deal. It may even be desirable as a privacy measure.

Actually, the archive part of zip seems to be pretty reasonable, and better than I thought it was. There don't seem to be any really annoying problems with it.

Tar is still has advantages over zip. Zip doesn't quite allow the same range of filenames as unix-like systems do, but it does allow characters like * and ?. What happens when you extract files with names containing these characters on an inferior operating system that forbids them will depend on the tool.

Encryption

Encryption is where zip has been awful in the past. The original spec's encryption algorithm had serious flaws and no one should even consider using them today.

Since then, AES encryption has been worked into the standard and implemented differently by different tools. Unless the same zip tool is used on each end, you can't be sure AES encryption will work.

By placing encryption as part of the file spec, each tool has to implement its own encryption, probably leaving out considerations like using secure memory. These tools are concentrating on archiving and compression, and so encryption will likely not be given a solid effort.

In the implementations I know of, the archive index isn't encrypted, so someone could open it up and see lots of file metadata, including filenames.

When you encrypt a tarball with GnuPG, you have all the flexibility of PGP available. Asymmetric encryption, web of trust, multiple strong encryption algorithms, digital signatures, strong key management, etc. It would be unreasonable for an archive format to have this kind of thing built in.

Conclusion

You are almost always better off using a tarball rather than a zip archive. Unfortunately the receiver of an archive will often be unable to open anything else, so you may have no choice.


LZMA Tarballs Are Coming

Compression Any developer that uses a non-toy operating system will be familiar with gzip and bzip2 tarballs (.tar.gz, .tgz, and .tar.bz2). Most places will provide both versions so that the user can use his preferred decompresser.

Both types are useful because they make tradeoffs at different points: gzip is very fast with low memory requirements and bzip2 has much better compression ratios at the cost of more memory and CPU time. Users of older hardware will prefer gzip, because the benefits of bzip2 are negated by the long decompression times, around 6 times longer. This is why OpenBSD prefers gzip.

But there is a new compression algorithm in town. Well, it has been around for about 10 years now, but, if I understand correctly, was patent encumbered (aka useless) for awhile. It is called the Lempel-Ziv-Markov chain algorithm (LZMA). It is still maturing and different software that uses LZMA still can't quite talk to each other. 7-zip and LZMA Utils are a couple examples.

GNU tar added an --lzma option just last April, and finally gave it a short option, -J, this past December. I take this as a sign that LZMA tarballs (.tar.lzma) are going to become common over the next several years. It also would seem that the GNU project has officially blessed LZMA.

And not only that, I think LZMA tarballs will supplant bzip2 tarballs. The reason is because it is even more asymmetric than bzip2.

According to the LZMA Utils page, LZMA compression ratios are 15% better than those of bzip2, but at the cost of being 4 to 12 times slower on compression. In many applications, including tarball distribution, this is completely acceptable because decompression is faster than bzip2! There is an extreme asymmetry here that can readily be exploited.

So, when a developer has a new release he tells his version control system, or maybe his build system, to make a tar archive and compress it with LZMA. If he has a computer from this millennium, it won't take a lifetime to do, but it will still take some time. Since it could take as much as two orders of magnitude longer to make than a gzip tarball, he could make a gzip tarball first and put it up for distribution. When the LZMA tarball is done, it will be about 30% smaller and decompress almost as fast as the gzip tarball (but while using a large amount of memory).

At this point, why would someone download a bzip2 archive? It's bigger and slower. Right now possible reasons may be a lack of an LZMA decompresser and/or lack of familiarity. Over time, these will both be remedied.

Don't get me wrong. I don't hate bzip2. It is a very interesting algorithm. In fact, I was breathless when I first understood the Burrows-Wheeler transform, which bzip2 uses at one stage. I would argue that bzip2 is more elegant than gzip and LZMA because it is less arbitrary. But I do think it will become obsolete.

Unfortunately, the confused zip archive is here to stay for now because it is the only compression tool that a certain popular, but inferior, operating system ships with. I say "confused" because it makes the mistake of combining three tools into one: archive, compression, and encryption. As a result, instead of doing one thing well it does three things poorly. Cell phone designers also make the same mistake. Fortunately I don't have to touch zip archives often.

Finally, don't forget that LZMA is mostly useful where the asymmetry can be exploited: data is compressed once and decompressed many times. Take the gitweb interface, which provides access to a git repository through a browser. I run one myself. It will provide a gzip tarball of any commit on the fly. It doesn't do this by having all these tarballs lying around, but creates them on demand. Data is compressed once and decompressed once. Because of this, gzip is, and will remain, the best option for this setting.

In conclusion, consider creating LZMA tarballs next time, and don't be afraid to use them when you come across them.


GNU Screen

Byobu Another useful program I use every week is GNU Screen. It provides virtual terminals at a single terminal. It's a bit like a window manager for a text terminal. If you are a command line junkie (and if you are at all serious about computing, you should be), this is an essential piece of software.

The main reason I use screen is for its persistence. If I am running a long-running job on a remote machine (i.e. over ssh), like a large apt-get upgrade, I'll put it in a screen session. This way I can log out and, later, log in from anywhere and check on it. I have even used it to persist nethack sessions, though this isn't really necessary.

The only annoying part is that all of it's mappings are underneath C-a (ctrl+a), which is a very common Emacs/bash command, which I use a lot. To get the effect of C-a inside screen, you have to do it twice in a row because screen captures the first one.

If you don't already use it, try it out sometime.


Readline Wrap

I came across a very interesting tool the other day called rlwrap. It wraps the readline library over just about any interactive text input program. The readline library provides basic editing and history. It's handy for those programs that don't provide their own line editing facilities.

It tries to be as transparent as possible, detecting yes/no prompts and passwords, so it should still be reasonable under those conditions.

If you can't think of anything to try it with, try it with cat. Instant line editor!

rlwrap cat > some-file.txt

Or with Festival.

rlwrap festival --tts

It will also turn incorrectly compiled shells on your system into something usable. On my system (Debian GNU/Linux), csh isn't usable without rlwrap.

rlwrap csh

Distributed Issue Tracking

Oublic domain tacks image In a previous post I discussed decentralized version control systems, Git in particular. Because decentralized version control is becoming so popular, we now have an exciting new area of development: distributed issue tracking.

Decentralized issue tracking seems to have popped into existance in the last year or so. A number of projects have appeared (cil, ticgit, ditz, to name some), but the one that really stands out for me is ditz. Keep an eye on that one. It's fairly active and mostly usable.

Decentralized trackers generally work by storing the issue tracking database within the repository itself. One possibility is to have it sit in its own branch, which I think is the Wrong Way. A second possibility is to have it sit right next to the code in its own directory. Yet another possibility is to put the issue tracker in its own repository. Git could even include this repository as a submodule (this is a lot like the Wrong Way, though).

First of all, everyone gets their own copy of the issue tracker database and its history. Second of all, it has history. It's tracked the same way the code is. And, in the second case usage, one of the coolest advantages is that issues follow the code very closely.

When a branch is created, it takes its own copy of the issue tracking database with it. If a bug is fixed in the main branch, the issue tracking database in the main branch is updated. The bug will remain in the side branch and the issue will still be open in the side branch reflecting this. If a merge occurs later, the issue tracking database also gets merged automatically. I think that's damn cool.

There are some issues that still need to be hammered out. How does a non-developer enter a ticker? They would need to work the version control system to do this, then be able to share that change. That's a pretty large barrier.

Perhaps a web interface could be set up for setting up issues. Developers could then cherry-pick/pull the issues from that repository and push ticket updates back out.

Then there is overhead incurred by moving tickets around with code. How bad is this overhead? How can this be dealt with in the most transparent way? This all needs to be tested still.

Could the issue database get too big? People like to attach screenshots to issues. Having many screenshots would make the repository very big. How do we deal with this?

It's an exciting, new realm to explore.


Git is Better

I finally finished dumping the rest of my lingering Subversion repositories. I have converted them all to Git repositories. If you manage a Subversion (or CVS, or Perforce, etc) repository, you should consider doing the same. Git became my version control system (VCS) of choice in June and I haven't looked back since.

Why? Because Git is better.

Yes, it really is. Much better.

Git is faster, smaller, more secure, and more powerful. This is a virtue of decentralized version control systems. Subversion is Blub.

It all starts with to Source Code Control System (SCCS) and Revision Control System (RCS). These systems could only track single files and created headaches for projects with multiple files being worked on by multiple people.

Then came Concurrent Versions System (CVS), which improved things slightly, but still sucked. It still really only tracks individual files.

Now, CVS did anonymous reads, allowing anyone to access the repository and see code history. OpenBSD was the first code base to take advantage of this. These days, coming across projects that don't give public read access to their repository seems backwards.

Not using any of these systems would probably be better than using them. Their flaws are obvious as soon as you start using them.

Finally in the year 2000, Subversion arrives. It's a huge step up from CVS, fixing many of its problems. It has a much better interface and uses atomic commits -- finally tracking more than one file at a time. We still need to talk to some server every time we want to do something. Branching and merging also sucks so much no one wants to use it. But branching is overrated, right? Wrong. I use branches all the time, now that they are easy.

The reason branching sucks in Subversion can be explained with a famous quote by Albert Einstein,

Make everything as simple as possible, but not simpler.

Instead of implementing tagging, branching, and merging, the Subversion guys just implemented "cheap copy". It's a pretty clever idea, but in practice it doesn't work out well. It's too simple.

CVS solves the wrong problem, and Subversion solves the right problem wrongly.

Since Subversion, a number of decentralized VCSs have arisen. We have GNU arch (2001), monotone (2002), darcs (2003), Bazaar (2005), Git (2005), Mercurial (2005), and fossil (2007). I played around with all of these when looking for a distributed VCS (except fossil) and none struck me the same way that Git did. I would recommend most of them over Subversion.

Distributed VCS has gotten a lot of attention in the past couple years, which had much to do with the Linux kernel switching to one (Git). In fact, Git and Mercurial were written precisely for this event. Since then, some major projects have been switching to Git, or at least to some sort of distributed VCS: Perl, Ruby on Rails, Android, WINE, Fedora, X.org, and VLC to name a few.

You can also see the chatter on the Internet about Git. It's is really popular with fresh, innovative projects, like the Arc programming language. It's pretty easy to accidentally run into various Git tutorials on the web. It has a real presence.

No Authority. But why distributed VCS? Why are they better?

First of all, when you "checkout" a distributed VCS, it's really a "clone" operation, which is what most of them call it. You get everything. After that, the only reason you need to talk upstream, which really isn't "up" anymore, is if either end has updates to the code they wish to share. The only way one clone might more important than another is human politics. Technically they are equals.

Small. But won't this be huge? A Subversion repository can easily be several gigabytes. That would be a lot to transfer on the initial clone.

Actually, distributed VCSs are extremely efficient. A Git clone will usually be smaller than a Subversion checkout. For example, I once cloned Freeciv's Subversion repository using Git (converting it to Git). It was about 15000 revisions. The bare version of the Git repository, containing all ~15000 commits, was half the size of the Subversion repository, which contained only a singly commit! The non-bare version was still smaller by a few megabytes. I can't even imagine how much space the server was using.

I would have some numbers on this example, but, alas, that clone was lost on a failed hard drive and it took me a week to make. Note, Git clones of Git repositories aren't that slow: Subversion isn't optimized for cloning, and the Freeciv Subversion server is extremely overloaded.

Update: I managed to get another clone, and it only took me a couple hours. The Freeciv Subversion checkout at revision 15574 is 281MB. Remember, this contains just one single revision. The Git clone after a repack and garbage collection, which contains all 15574 revisions, is 225MB. It's 56MB smaller! If I told it to leave out the Subversion metadata it would be even smaller than that. On the server side, the Subversion repository likely takes up gigabytes. And finally, to add insult to injury, the Git "bare" clone is 144MB.

Someone does have an example over here: Git's Major Features Over Subversion.

The Mozilla project's CVS repository is about 3 GB; it's about 12 GB in Subversion's fsfs format. In Git it's around 300 MB.

Git's packing format is fairly simple, yet so effective.

Fast. Well, duh. With everything being local, operations that work on multiple revisions will be fast. Beyond this, decentralized VCS is generally faster on all operations, except the initial clone.

Reduced politics. With a central repository, someone or some group has to decide who has write access and who doesn't. Developers without write access are basically stuck without version control, unless they hack in their own. In the decentralized model, everyone has write access to their own personal repository, and others can choose, on their own, to pull revisions from it.

Secure. A centralized VCS has a central, single point of failure. If that single point is compromised, the server needs to be restored from backups. Or worse, the compromise goes unnoticed and the repository history is modified without anyone ever being able to tell.

In a distributed model, each revision (and in Git's case, the files themselves) is referenced by a hash (SHA-1 in Git's case) of it's contents, a content-addressable storage system. Thanks to this, a file, no matter where it is in the tree or in history, is stored only once. The main purpose is to avoid collisions between revision identifiers on parallel lines of development. It also happens to make the repository tamper-proof.

If you know the revision ID of your HEAD no one will be able to change any of its history. This is because each revision contains the ID of its immediate ancestor, all the way back to the initial commit. If a previous commit changed, it would change the ID of every following commit. An attacker would have to find a desired collision for each one: simply impossible.

The hash addressing also provides integrity, as corruption in the repository is easily detected.

Another security gain, related to the reduced politics note, is the web of trust. This is the same way PGP handles key authentication. In a large project, a single developer may only trust a handful of people to be competent programmers, and therefore only pull from these developer's repositories. Those developers they pull from also have their own set of people they trust. In this way, revisions can safely be pulled from distant strange repositories through the web of trust.

The only reason to interact with a Subversion repository is for legacy reasons. Luckily, you don't have to use Subversion to use Subversion.

That wasn't a typo. Git has a Subversion/Git interfaced called git-svn. I used it to convert my Subversion repositories to Git, but it can be used as a fully functional Subversion client. It can clone the Subversion repository and continue to pull changes from it as it updates.

On your end, you can make commits to your local repository, use cheap branches, and so on,all of which stay local. Changes can be pushed back upstream to Subversion with the dcommit command, which would be done after rebasing any changes on top of the current Subversion HEAD. This provides most of the advantages of Git without worrying about having the central repository change.

One of the major complaints about Git is that it once lacked a plethora of GUIs, like CVS and Subversion have. Git does have GUIs. I looked at a couple of them out of curiosity, so I am not sure how good they are by comparison. I also have barely use any other VCS GUIs. The ones I have used I find incredibly annoying.

I don't understand why people insist on using them anyway. It's like using training wheels on a bike and claiming that it's better that way. No, those training wheels just get in the way.

To be brutally honest, if you don't want to use Git because you are afraid of the command line, what are you doing coding in the first place?

One topic left is issue tracking. In a centralized VCS, you have a centralized tracker. Subversion has Trac, for example. Well, what about distributed VCSs? They should have distributed issue tracking right?

I will go into this in my next post.


Creating Simple Dice with GIMP

Final image.

In my previous article I drew those red dice myself, using GIMP. Since I really enjoyed figuring out how to do it, and actually doing it, here is a little tutorial.

The numbers and sizes are arbitrary, so feel free to adjust things if you think they look better. I am no artist. I am sure someone could take this further to make it look better, perhaps by making the pips look indented, or adding some transparency effect so the dice look clear. I am a GIMP newbie.

Step 1: Make a Face

In GIMP, create a new 300x300 image and fill it with a dark red. I used c30808 for this. This will be the base color of the dice, so if you want differently colored dice, choose whatever color you like.

Plain 300x300 image.

Step 2: Make Pips

Next we use the ellipse selection tool (e) to make pips. In the settings, set the ellipses to a fixed size of 75x75.

Ellipse size fix.

Create the ellipse and move it to the upper left-hand corner. Use the arrow keys to nudge it to position (5, 5) -- or just type in these values.

Pip selection.

Use bucket fill (shift+b) to fill the selected area with white, or whatever color you want your pips to be. Keep doing this to make pips in each corner. The positions should be (5, 220), (220, 5), and (220, 220). This makes the 4 face. Put a fifth pip in the middle, (112, 112), to turn it into a 5 face.

5 pips

Step 3: Make More Faces

You now have one face of your die. The 1, 2, 3, and 4 faces are the same as the 5 face, but fewer pips. In the layers dialog name the current layer "5". Now, duplicate the layer (shift+ctrl+d) and name this new layer 4. Use either the paintbrush tool (p) to paint your base color over the middle pip, or use the selection tools to remove it.

Keep duplicating layers and removing pips until you have 5 faces: 1, 2, 3, 4, and 5.

Layers dialog showing 5 faces.

Duplicate the "4" layer and create two more pips to make the 6 face. You now have 6 layers, each containing a single face. Here is my .xcf when I was done: dice-faces.xcf.

Step 4: Map it to a Cube

Now comes the fun part, the real guts of the drawing. You are going to map these layers onto a cube. Go to Filters -> Map -> Map Object. Map to "Box" and select Transparent background and Create new image.

The mapping interface.

Under the Orientation tab adjust the rotation. For the first die, try something like (20, 40, -5). If you enable Show preview wireframe you can see your adjustments live. Just don't make these values too high or it will make the next step more difficult.

Under the Box tab set the Front, Top, and Left faces to different layers. Note that the opposite sides of a die always add up to 7. That is, 1 is opposite to 6, 2 is opposite 5, and 3 is opposite 4. Here is how a typical die looks.

Die faces.

If you are really picky, you might want to pay attention to the orientation of the 3's, 2's, and 6's and flip those layers accordingly.

Hit Preview! to see your work. If you are happy, click OK. Autocrop the new image with Image -> Autocrop Image.

A single die.

Step 5: Make More Dice

Do this a few more times with different faces at different orientations. I will make just one more for the example.

Create a new 640x480 image with transparent background. Copy and paste your dice into this image. After each paste, make a new layer (shift+ctrl+n), so each die gets its own layer. Use the Move tool (m) to adjust the dice into a sort-of mid-roll. Whatever looks good.

Dice are positioned.

Step 5: Make Shadows

The last part left is the shadow. First, merge the visible layers (shift+m), then duplicate the remaining layer. Call this new layer "Shadow".

Go to Colors -> Brightness-Contrast. Set contrast to -127. This will be the shadow. If you want a darker or lighter shadow, open the same dialog again and adjust the brightness. Next scale the shadow layer vertically by 50%. You want the width to remain the same.

Select the Sheer tool (shift+s) and sheer the layer in the X direction -100 pixels. Move the shadow layer to the bottom. Now use the Move tool (m) to move the shadow into an appropriate position.

You can add a penumbra by applying a Gaussian blur to the shadow layer: Filters -> Blur -> Gaussian Blur. I blurred mine by 5 pixels.

Finally, you might want to autocrop the layers, then fit the image canvas to the layers, which will get rid of the excess border.

Final step.


Diceware Passphrases

Casino dice I GIMPed Diceware is a method of easy-to-remember, easy-to-type, secure passphrase and password generation. It works completely off-line and requires no computer whatsoever, apart from retrieving the Diceware list. By taking the passphrase generation off-line there is less room for mistakes to be made.

The reason these password are easy to remember is that they are simply a series of words in your native language. This also tends makes them easier to type without lots of practice as you should already be used to typing words.

Because the official Diceware website is frequently down or unusable, I have mirrored the original page here,

http://nullprogram.com/download/diceware/diceware.html

You can grab the word lists directly,

Diceware Word List
Beale's Diceware Word List

The lists are cryptographically signed by Arnold G. Reinhold so you can verify that I have not tampered with it. I must also note that, unfortunately, the author requires that this list only be distributed non-commercially, which limits its usefulness but allows me to distribute it here.

I also came across another list called DialDice, which I have mirrored here and signed with my own key,

DialDice Word List

Diceware works by rolling five 6-sided dice (or rolling one 6-sided die five times, etc.) and using the result to look up the word in one of the above lists of 6^5, or 7776, words. Each word is worth about 12.9 bits of entropy,

log2(7776) =~ 12.9248

So if you want a password worth about 40 bits -- which is about 7 letters of a random alphanumeric password -- you would generate three Diceware words. They can be concatenated in any order and in any fashion. When I use Diceware, I just mash them together, like "lancealertgrow". Note that the space bar makes a distinctive sound when pressed, so if you put spaces between your words a listener will be able to tell how many words you use.

If you don't like what you rolled the first time, DO NOT generate a new one as an attempt to get someting "better". If you do this, you will greatly weaken your passwords because you are selecting passwords from a much smaller pool of possibilities (a very small pool that contains only passwords you like).

The number of possible three-word passwords is 470,184,984,576. That's right: 470 billion. Because you are selecting passwords with dice, each password is as equally likely as the next. Even if an attacker knew you used Diceware and knew what word list you used, that still leaves a handful of guesses out of that 470 billion possibilities.

At first it may be confusing, but it actually doesn't matter what the words are or how long they are. It doesn't matter that there are no capitals or special characters. It is the simple fact that there are 7776 words, and one was selected three times.

7776 * 7776 * 7776 = 470184984576

The Diceware website goes into a bit more detail on this.

If the computer system you use annoyingly requires passwords to contain special characters (which is done to increase entropy in passwords that actually are poor), Diceware also provides a method of adding some of those, which adds a couple more bits worth of entropy. If you don't care about those extra bits, you can throw your own in.

For passphrases, Diceware recommends 6 words, about 77.5 bits, which it claims should be out of range of brute-force attacks from anyone for at least the next 20 years. If you really think you need more than 6 words, you should consider hiring guards for all your computer equipment.

I was working on my own Diceware word list to release into the public domain. The purpose was to provide a word list without any distribution restrictions, unlike all the Diceware lists I have found. But making word lists is hard! I wrote a number a little filters -- word length, no sub-words, spell checking, no special characters, etc -- to pull out good words from a large word list, then used some sample English text from Wikipedia to get some frequency information so that I am was selecting common words over less common words. I still need to go over the final list by hand to make sure it all looks good. This is a long tedious process. Carefully examining 7776 words is quite a lot of work. Someday I will finish it.

In the event that you don't have any dice available, or you want to be able to generate Diceware passwords on the fly automatically, I have written a little program that will use /dev/random, assuming there is a true RNG behind it, to roll virtual dice. It can use a Diceware word list or your local dictionary word list which contains many more words (usually found at /usr/share/dict/words). Grab it here,

passgen.pl

To get help, just run it with --help. So, to generate a two word password using the local dictionary,

$ ./passgen.pl -w2
Bits per word: 14.4421011596755
Key length: 28.8842023193511
grisly cog

It also tells you how many bits the password is worth.

I highly recommend Diceware for your password and passphrase generation.


Shamus Young's Twenty Sided Tale

Dungeons and Dragons When I played a Dungeons and Dragons campaign back in college, I thought it would be interesting to take notes and chronicled the campaign as a short story. I have shared a piece of it already, but I can't remember too many more details. It would have been a lot of work to do it all.

Well, a couple of years ago, Shamus Young put in the huge effort and did document a campaign that he ran. It is called The Twenty Sided Tale. If you enjoy tabletop games, or are interested in them, I strongly recommend reading through it. Hell, even if you want to read a great story, I recommend it.

Sprinkled in between the story's paragraphs are DM notes where Shamus gives some in-game context to the events. It provides some information about what the players know and reasoning about their decisions. The most interesting are his insights as a DM. It is really quite inspiring. Shamus knows his games.

The story is really amazing as well. He actually had two unrelated stories going that the players managed to merge together perfectly all by accident. The end was so fantastic that it gave me chills the first time I read it.

If you're still here, go click that link and start reading.


AES Random Number Generator

Doubling Die I came across the Ultra High Security Password Generator the other day, which uses a very high quality pseudo-random number generator to generate passwords and keys. The idea is not to use the full 63 characters as a password, but rather a contiguous subset, such as the first 8 characters.

The website is served securely, so no middle man can sniff at it. If you trust the maintainer not to store the generated numbers somewhere, which he claims not to be doing, then you can use it as a nice password generation service. If you are nervous about matching password possibilities to your IP, then grab a batch through the Tor network.

In terms of brute-force attacks, each letter from hexadecimal characters is worth 4 bits, from random printable ASCII characters is worth about 6.6 bits, and alpha-numeric is worth about 6 bits.

If you go with the set of printable ASCII characters, 6 to 8 characters (39 to 52 bits) should be good enough for an account password, where an attacker must guess through an attempt-limiting authentication system. 8 to 12 characters (53 to 79 bits) should be plenty for a passphrase used as an encryption key, where an attacker can brute force passphrase attempts at his leisure.

I wouldn't use this website for any serious encryption, though. If he is logging generated passwords, he will have a nice short list of possible passphrases to try against your encryption. So don't use it for your GPG passphrase. Generate those locally.

This is where the purpose of my post comes in! He describes the pseudo-random number generator he uses to generate the random numbers at the top of the page. It's the AES block cipher in cipher-block chaining (CBC) mode encrypting a 128-bit counter. A picture (well, diagram) is worth a thousand words,

AES RNG diagram

Note that the diagram is actually being explicit about CBC mode, so the AES cipher in the diagram is in electronic codebook (ECB) mode. I missed this myself when initially interpreting the diagram and writing my implementation. Here is the same diagram with the AES cipher already in CBC mode,

AES (CBC) RNG diagram

I don't know if he designed this himself or not. I implemented it in C to study it a bit. You can grab it here with git (or follow the link and get a snapshot),

git clone http://git.nullprogram.com/aesrng.git

To build it, you will need libgcrypt installed (with headers).

Here is about 100 megabytes analyzed with ent, a pseudo-random number sequence test program.

Entropy = 7.999998 bits per byte.

Optimum compression would reduce the size
of this 100000000 byte file by 0 percent.

Chi square distribution for 100000000 samples is 275.58, and randomly
would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.5095 (127.5 = random).
Monte Carlo value for Pi is 3.141182766 (error 0.01 percent).
Serial correlation coefficient is 0.000005 (totally uncorrelated = 0.0).

Wow! These results are great! This is exactly what you would see if you ran 100 megabytes of /dev/random, a true random number generator, through ent. It is also pretty fast, generating that 100 megabytes on my laptop in about 7 seconds. That's much faster than Linux's /dev/urandom (over a minute here), whose ent results aren't quite as good, either.

Note: Before you go using this somewhere important you should make sure this PRNG has been peer reviewed and carefully studied by professionals with cryptanalysis. It may have fundamentals flaws. I only found this on a website somewhere!

Still, that's a pretty damn cool pseudo-random number generator.

The generator is only useful if you want to generate more than 512 bits worth of numbers, because it takes 512 bits of randomly generated numbers to initialize the generator. If you want to generate a single password if the same strength, give this a shot,

head -c 50 /dev/random | tr -cd "A-Za-z0-9@#\!\$%^&*()_+=-~;,.<>/[]{}|?:'\\\`" && echo

It uses a true random number generator and selects from 94 printable ASCII characters (space not included).


Fantasy Name Generator : Request for Patterns

Try it out right now if you don't feel like reading: my fantasy name generator. I need your help designing some name patterns.

Medieval writing desk Whether choosing a name for my character in a fantasy game or populating a world which I pretend to myself that I will one day DM, I have always gone to the RinkWorks Fantasy Name Generator. The author of this tool, Samuel Stoddard, gives some history on how he came to design and develop it.

It works by using pattern to select sets of letters to put together into a name. There is a thorough, long description on the website. Unfortunately, he didn't share his source code and so we can see how he did it.

Therefore, I used his description to duplicate his generator.

You can grab a copy here with git,

git clone http://git.nullprogram.com/fantasyname.git

It includes a command line interface as well as a web interface, which I am running and linked to at the beginning of this post for you to use. The code is available under the same license as Perl itself.

I used Perl and the Parse::RecDescent parser generator. Thanks to this module, it essentially comes down to about 40 lines of code. The name pattern is executed, just like a computer program, to generate a name. Here is the BNF grammar I came up with,

LITERAL ::= /[^|()<>]+/

TEMPLATE ::= /[-svVcBCimMDd']/ 

literal_set ::= LITERAL | group

template_set ::= TEMPLATE | group

literal_exp ::= literal_set literal_exp | literal_set

template_exp ::= template_set template_exp | template_set

literal_list ::= literal_exp "|" literal_list | literal_exp "|" | literal_exp

template_list ::= template_exp "|" template_list | template_exp "|" | template_exp

group ::= "<" template_list ">" | "(" literal_list ")"

name ::= template_list | group

The program is just that, decorated with some bits of Perl. Since I came up with it, I have found that it is slightly different than Mr. Stoddard's generator, in that his allows empty sets anywhere. Mine only allows them at the end of lists. For example, this is valid for his generator,

<|B|C>(ikk)

But to work in mine, the empty item must be moved to the end,

<B|C|>(ikk)

This can be adjusted my making the proper changes to the grammar, which I haven't figured out yet.

Another problem with mine is that Parse::RecDescent is slooooowwwww. Ridiculously slow. Maybe I designed the grammar poorly? This is probably the biggest problem. Even simple patterns can take several seconds to generate names, specifically with deeply nested patterns. For example, this can take minutes,

<<<<<<<s>>>>>>>

Before you go thinking you are going to tank my server, I have written the web interface so that it limits the running time of the generator. If you want to do something fancy, use your own hardware. ;-)

There is also a problem that it will silently drop invalid pattern characters at the end of the pattern. This has to do with me not quite understanding how to apply Parse::RecDescent yet.

And this is where I need your help. I have had some trouble coming up with good patterns. I don't even have a good default, generic fantasy name pattern. Here are some of mine,

<s|B|Bv|v><V|s|'|V><s|V|C> # default
<i|Cd>D<d|i> # idiot
<V|B><V|vs|Vs> # short

None of which I am very satisfied.

You can design patterns for Nordic names, Gallic names, Tolkienesque Middle Earth names, orc names, idiot names, dragon names, dwarf names, elf names, Wheel of Time names, and so on. There is so much potential available with this tool.

You can see what patterns people have been trying in the log file. To suggest one to me, e-mail me some patterns, or even better, clone my git repository and add one to it yourself (then ask me to pull from you). This way your credit will stay directly attached to it with a commit.

Good luck!


Don't stop here! This isn't everything. Check out the archives (on the left) for more posts. Or just have a look at the index.