null program

Java is Death By A Thousand Paper Cuts

At least it is for me. This past week at work I've been furiously rushing work on a project written in Java. It's completely my fault that I'm using Java since I'm the one who picked the language. I wanted to use Java the platform, but I don't know any other JVM languages well enough to use them in place of Java the language for a project at work.

It's all sorts of little things that make writing Java so exhausting for me. At the end of the day it I just feel irritated. I hate the absolute lack of functional programming and that I have to specify everything at such a low level. The whole reason we program in high-level languages is so we can express algorithms more concisely, but Java fails at this.

Here's an example of what I'm talking about, something I basically did a few times today. Let's say you have an array of floats, nums and you want to sum them and return the result (or maybe use it in another expression). In Lisp it's very straightforward.

(reduce '+ nums)

"Reduce the sequence nums by addition." Notice that it's more about saying what I want to do rather than how to do it. I don't have to introduce any temporary storage or iterators. To do the same thing in Java it will look something like this.

int sum = 0;
for (double num : nums) {
    sum += num;
}
return sum;

If you're using an older Java without the enhanced looping construct it gets even uglier. I had to introduce a variable for accumulation and a second variable for iteration. This sort of thing has to be done all over the place in Java, and it greatly increases the cognitive overload when reading Java code.

This instruction is more about telling the computer the how rather than my overall intention. One problem with telling it the how is that I've unnecessarily locked in a specific algorithm and ordering. The literal instruction says that the numbers must be added in order, sequentially. My Lisp instruction doesn't do that.

It gets even worse when you complicate it slightly by adding a second array and, say, multiplying it pairwise with the first.

for (int i = 0; i < nums1.length; i++) {
    sum += nums1[i] * nums2[i];
}
return sum;

Now the loop gets more complex. I have to tell it how to increment the iterator. I have to tell it to check the bounds of the array. The iterator is a misdirection because the actual number stored in it isn't what's important. Again, the Lisp method is much more concise.

(reduce '+ (map 'list '* nums1 nums2))

"Map the two sequences by multiplication into a list, then reduce it by addition." Unfortunately we start to leak a little bit into the how here. I am telling it that the intermediate structure should be a list, because map forces me to pick a representation. Besides that, I am only describing my overall intention and not the obvious details.

So with Java my days become filled with the tedious low-level algorithm descriptions that I have to hammer out over and over and over. Death by a thousand paper cuts.

Lisp isn't the only language that has a (generally) much better approach; it's just my favorite. :-) Most languages with at least some decent functional facilities will also do the above concisely.


Distributed Computing with Emacs

I got an Elisp idea today and even went as far as implementing a proof of concept for it: distributed computing with Emacs Lisp. As usual for me the idea takes advantage of Lisp features to make the task pretty simple, very specifically Elisp's implementation. In this case it's the Lisp reader, printer, and the fact that Elisp functions have a printed representation, both byte-compiled and not.

Here's the proof of concept code: dist-emacs.el

A central server listens for TCP connections. Clients offering their CPU for use connect to the server and await instructions. The server sends a single, no-argument, anonymous function to the client. The client calls the function, returning the resulting form back to the server. In order to transmit the function it's encoded into a string using the Lisp printer, and the client turns it back into an executable function with the Lisp reader.

For some simple security there is a shared password between the client and server. When the server sends a function it includes a signature, and the client only runs code that matches the signature. To create a signature the string encoded version of the function is appended with the password (both strings) and hashed with a secure hashing algorithm. Only someone who knows the password -- including other clients -- can create the signature.

(defun sign-sexp (password sexp)
  "Return signature of the given s-exp."
  (sha1 (format "%s%s" password sexp))

To make it easy for the client to read in both the signature and the function we just cons them together before encoding them as text.

(defun encode (password sexp)
  "Encode a s-exp for transmission to client."
  (prin1-to-string (cons (sign-sexp password sexp) sexp)))

The client calls the Lisp reader on the string, then checks the signature in the car cell against the s-expression in the cdr cell. This will return the function if it's legitimate, otherwise nil.

(defun decode (password str)
  "Decode string into s-exp, checking the signature in the process."
  (let* ((cons (read str))
         (sig  (car cons))
         (sexp (cdr cons)))
    (if (equal sig (sign-sexp password sexp))
        sexp
      nil)))

And that's the core of it. It just needs some network code to move the string between computers. That part can be found in the linked source above.

To demo this, I'll use the whiten function from my previous post. I'll run it with three different strings on three different computers. Assume we started the dist-emacs server (dist-start) and connected three clients (dist-connect) from three computers to it. The clients were fired up from scratch so there's no whiten function on them yet, but there is one defined on the server. First we'll send the function definition to the clients. The dist-dist function takes a list of functions and passes each one to a client. Ideally I'd want this function to be more intelligent, managing a work queue so that an arbitrary length list of functions will be fed one at a time to each client. That's not the case here.

(dist-dist (mapcar (lambda (p)
                          `(lambda ()
                             (fset 'whiten ,(symbol-function 'whiten))))
                      dist-clients))

Also like in the previous post, this is an abstraction leak with the Emacs implementation. But I like this trick so I'm going to use it anyway. :-) Next we call it on each client with a different string.

(dist-dist (list (lambda () (whiten "good"))
                 (lambda () (whiten "news"))
                 (lambda () (whiten "everyone"))))

The way I have it set up for my proof of concept the results are just spit back into the server's *Messages* buffer. If we watch that buffer we can see each results come back in one at a time as each machine finishes. I can watch Emacs saturate the CPU on every client machine simultaneously as it works.

"2577343027adf7817185db876032d8ed"
"46a65dac2c0040afde175adf1e9a81fd"
"f39baf9e74475dd5be7d5495a025fe84"

This isn't the same order as the clients, but the order in which the jobs were completed.

As for the practicality, I doubt there really is one. It's really only a neat concept (or maybe not even that). For almost the exact same reasons as my distributed JavaScript idea, this is a solution looking for a problem. The problem needs to be able to be broken into small computation units, because Emacs has no threading, and it has to be low bandwidth, because it has to be parsed all at once from a string. If you want to pass large data sets it needs to be done out-of-band, which probably defeats the purpose. There seem to be few to no problems that fit these limitations.


Elisp Memoize

Memoization is something I think should be packaged as a standard function for just about every language. That's not generally the case, but luckily this is easy to fix in Lisps. I needed memoization recently for an Elisp project I'm working on. I could have hand-written one but a generic memoization function would have worked just fine. Since I didn't find any generic Elisp memoization on-line I wrote my own.

Download: memoize.el

Just put it in your path and (require 'memoize) it. Here's the core function.

;; ID: 83bae208-da65-3e26-2ecb-4941fb310848
(defun memoize-wrap (func)
  "Return the memoized version of the given function."
  (let ((table-sym (gensym))
        (val-sym (gensym))
        (args-sym (gensym)))
    (set table-sym (make-hash-table :test 'equal))
    `(lambda (&rest ,args-sym)
       ,(concat (documentation func) "\n(memoized function)")
       (let ((,val-sym (gethash ,args-sym ,table-sym)))
          (if ,val-sym
              ,val-sym
            (puthash ,args-sym (apply ,func ,args-sym) ,table-sym))))))

Normally the hash table would be stored inside of a closure, but Elisp doesn't have closures. Instead, the hash table is stored in an uninterned symbol, which will only (generally) be accessible to the memoization wrapper (in order for external code to access the table it would need to directly inspect the memoized function object). It's not quite a scope but it's close enough.

Note that it keeps the original function documentation intact. I want the memoization wrapper to be an unobtrusive as possible.

This is a bit of an abstraction leak. It's creating new code at run time taking advantage of the specific low-level way Emacs executes s-expressions. If Emacs ever stopped directly evaluating s-expressions then this code would break. A more proper way to do this would be at compile time with a macro, where this exact sort of code generation is expected. Rather than defun we would use a macro called something like defmemoized.

Here's a demo of it in action. This whiten function is computationally expensive: it performs key whitening. It repeats a hash function thousands of times to produce an expensive value. This isn't something you generally want to memoize, but stick with me.

(defun whiten (key)
  "Perform key whitening with the md5 hash function."
  (dotimes (i 100000 key)
    (setq key (md5 key))))

(whiten "password")   ; takes a couple of seconds

On my laptop that takes a couple of seconds to run. Increase that counter if it's quick on your computer. My memoize package provides a memoize function which will create a new function that wraps the original, then installs the new function in place of the old one if we give it the function symbol.

(memoize 'whiten)

The first time you run it after memoization it will be slow, but after that the memoization kicks in for a quick return.

There are two Elisp specific issues at hand. First is that memoizing an interactive function will produce a non-interactive function. It would be easy to fix this problem when it comes to non-byte-compiled functions, but recovering the interactive definition from a byte-compiled function is more complex than I care to deal with. Besides, interactive functions are always used for their side effects so there's no reason to memoize them.

Second is a limitation of Elisp hash tables. There's no way to distinguish a nil value and no value. The hash table returns nil for both. This means you cannot memoize nil returns. But a computationally expensive function shouldn't be returning nil anyway.

Hopefully someone else will find good use for this as well.


GIMP Painting

I drew the magic space elevator a few days back after some practice. Here's my very first attempt at this art style,


Full Size Video

And the image itself,


Pen and Paper RPG Wishlist

As I get more involved with tabletop RPGs, specifically Dungeons and Dragons, I find there are some related attributes that I wish these game systems had. While I'm sure there are systems do have some of these, I wish whatever I happen to be using had all of them.

Print friendly. The source material tends to be very colorful and graphical. While this can be a good thing, especially when illustrating monsters (Show, not tell!), it's bad if you want to print out your own materials. I want the crucial information available in a crisp, clean monochrome form of some sort. Not only could I reproduce material for use in notes and handouts, but I could create my own condensed sets of information by composing these crisp forms.

For example, in the D&D monster manual each monster has a nice concise block containing all the information -- defenses, health, abilities, etc -- needed to use that monster. This is great, but it's on a brownish background, in a red-ish box. So close to being what I want. But even then, do I have legal permission to reproduce this information? And so ...

Licensing. The closest thing tabletop gaming has to a Free Software license would be the Open Game License (OGL), which is still pretty restrictive. I would love for the source materials to be licensed at least loosely enough that I could print out my own copies for cheap (assuming they are print friendly, per above). Have some new players sitting down at the table? To get them started, give them that stapled-together player handbook you printed out. There's RPG evangelism for you.

The Fudge role-playing game system has both these attributes down pretty well. The Fudge manual is very print friendly PDF with explicit permission to share it with your friends. However, just as yacc is a compiler compiler, Fudge is really a game system system, a system for creating game systems, so it's only part of what is needed to play a game.

Useful software tools. One specific example is character creation software. Creating a new character can be burdensome, especially for a new player. Software that allows a player to select some basic options from a menu and produce a printable, error-free character sheet can save a lot of time.

Fourth edition Dungeons and Dragons has a character builder, but it is a humongous piece of junk. It's proprietary, Windows-only, bulky, and slow. For a program that merely generates printouts based on a few user selections from some simple menus, it has some extremely excessive system requirements (much higher than the ones they claim). And it requires a reboot to install too. A human can produce the same results by hand inside of a half hour, so for a computer there is virtually no computation involved. So what is it doing? Worse of all, the fourth edition license expressly forbids competing character creation software, so no one can legally produce a reasonable one. All this thing should be is a database of available character abilities, some character sheet logic, and a postscript printer.

Fortunately there are some decent, generic world generation tools for GMs out there, such as random inn generators, random dungeon generators, and so on. And another one. I've mentioned this before.

If you know any systems that fit the above descriptions well, go ahead and link them in the comments!


GIMP Space Elevator Drawing

I've been looking for a nearby tabletop gaming group for awhile now. I asked people I knew. I asked around at work. I just couldn't find anyone. Luckily, a new thing that Wizards of the Coast has been doing is D&D Encounters where prepared adventures are run every week openly at local gaming stores. The purpose is for casual players or beginners to be able to freely to play some D&D without needing any commitment, preparation, or equipment. Characters are pre-generated, so no spending a half hour creating some new player characters every week.

I hopped into it for season two, which just started 6 weeks ago. Each weekly session is a 1.5 to 2 hour combat encounter. I've been having fun, but honestly it's not all that exciting compared to what a real campaign can bring. There is no role-playing, practically no NPC interaction, no puzzles, and no exploration. It also doesn't help that the adventures and characters are riddled with mistakes and very unbalanced. For an example of unbalanced, the character I've been playing, Barcan (or Barqan depending on where you are in the character sheet), could be killed -- and I'm not talking about unconscious dying but negative bloodied value dead -- by a monster critical strike in just about every encounter so far. Every time a monster attacked me there was a 1 in 20 chance, even at full health, that I might be done playing for the week.

But the great part is that it got me connected to other players in my area, which I think is the most valuable part of Encounters. One of my fellow players was just starting a regular gaming group and invited me to come along, so we've been playing on weekends now, with the intention of taking turns as the DM among those who are interested. And for a little irony, everyone except one person in the group also works at the lab. I guess I didn't ask around enough!

So I'm going to be DMing a 4e Dungeons and Dragons campaign sometime in the near future, and I'm quite excited about it!

Anyway, what does that have to do with drawing a space elevator in the GIMP? First of all, if you're one of the players in my group who found their way here stop reading now. You'll find out all this in the first session, so come back after that. So, I've had this campaign idea in my head for a couple of years now, and it involves a skyhook of sorts constructed by a combination of careful engineering and powerful arcane magic. It leads somewhere, not space, but somewhere. Since that somewhere is part of the mystery I won't reveal where that is, but if the campaign goes well I'll write about it more in the future.

When I run the first session I want to illustrate the skyhook to the players. Show, not tell they say. I like some of the work people do imitating Bob Ross in the GIMP. I've done a few of these to try it out, and it's surprising how well it can turn out even from a beginner. I employed this new art education to draw my skyhook for my game. The GIMP undo history reveals how I did it,


Full Size Video

And the result,

In retrospect I should have drawn the skyhook right after I finished those first clouds, since it's behind everything else in the scene. Oh, and that thing on the bottom left is a twisted scar left behind from a previous attempt at building the skyhook, but it collapsed. It's a dangerous place to be.


Emacs Byte Compilation

A feature unique to some Lisps is the ability to compile functions individually at any time. This could be to a bytecode or native code, depending on the dialect and implementation. In a Lisp implementations where compilation matters (such as CLISP), there are typically two forms in which code can be evaluated: a slower, unoptimized uncompiled form and a fast, efficient compiled form. The uncompiled form would have some sort of advantage, even if it's merely not having to spend time on compilation.

In Emacs Lisp, the uncompiled form of a function is just a lambda s-expression. The only thing that gives it a name is the symbol it's stored in. The compiled form is a (special) vector, with the actual byte codes stored in a string as the second element. Constants, the docstring, and other things are stored in this function vector as well. The Elisp function to compile functions is byte-compile. It can be given a lambda function or a symbol. In the case of a symbol, the compiled function is installed over top of the s-expression form.

(byte-compile (lambda (x) (* 2 x)))
  => #[(x) "^H\301_\207" [x 2] 2]

The compiler will not only convert the function to bytecode and expand macros, but also perform optimizations such as removing dead code, evaluating safe constant forms, and inline functions. This provides a nice performance boost (testing using my measure-time macro),

(defun fib (n)
  "Fibonacci sequence."
  (if (<= n 2) 1
    (+ (fib (- n 1)) (fib (- n 2)))))

(measure-time
 (fib 30))
  => 1.0508708953857422

(byte-compile 'fib)

(measure-time
 (fib 30))
  => 0.4302399158477783

Most of the installed functions in a typical Emacs instance are already compiled, since they are loaded already compiled. But a number of them aren't compiled. So, I thought, why not spend a few seconds to do this?

In Common Lisp, there is a predicate for testing whether a function has been compiled or not: compiled-function-p. For whatever reason, there is no equivalent predefined in Elisp, so I wrote one,

(defun byte-compiled-p (func)
  "Return t if function is byte compiled."
  (cond
   ((symbolp   func) (byte-compiled-p (symbol-function func)))
   ((functionp func) (not (sequencep func)))
   (t nil)))

My idea was to iterate over every interned symbol and, if the function slot contains an uncompiled function, using the test above, I would call byte-compile on it. Well, it turns out that byte-compile is very flexible and will ignore symbols with no function and symbols with already compiled functions.

So next, how do we iterate over every interned symbol? There is a mapatoms function for this. Provide it a function and it calls it on every interned symbol. Well, that's simple and anticlimactic.

(mapatoms 'byte-compile)

That's it! It will take only a few seconds and spew a lot of warnings. I haven't found a way to disable those warnings, so this isn't something you'd want to have run automatically, unless you like having an extra window thrown in your face. I've only discovered this recently, so I'm not sure what sort of bad things this may do to your Emacs session. Not every function was written with compilation in mind. There are interactions with macros to consider.

I doubt there will be a noticeable performance difference. Like I said before, most everything is already compiled, and those are the functions that get used the most. There's just something nice about knowing all your functions are compiled and optimized.


Emacs ParEdit and IELM

ParEdit is a powerful extension to Emacs that I've just begun using recently. It's a minor mode that forces all parenthesis, square brackets, and quotes to be balanced at all times. While it's useful for any programming language it's especially suited for Lisps, because it's designed for manipulating nested parenthesis -- i.e. s-expressions. It's not currently part of Emacs so you have to drop the script in your load-path somewhere.

I've frequently thought that a Lisp-based shell would be an interesting and powerful tool, much like a normal Lisp REPL. Programs would be treated like Lisp functions. For example,

wellons@luna:~$ (ls -l .emacs)
-rw------- 1 wellons wellons 4859 2010-06-10 23:20 .emacs
wellons@luna:~$

But typing all those parenthesis all the time would be quite the nuisance. I know this from experience typing at Lisp REPLs. I imagined something that works exactly like ParEdit would be needed to make all that work go away. To save even more time each prompt would begin with a nested pair, with the cursor placed between them. Then typing a quick command is no different than a normal shell.

wellons@luna:~$ ()

Well, in Emacs we have both ParEdit and REPLs, so we can compose these features together with just a little advice. Here's how to do it with the Interactive Emacs-Lisp Mode (IELM) REPL. First tell IELM to use ParEdit,

(add-hook 'ielm-mode-hook (lambda () (paredit-mode 1)))

The function in IELM that spits out the next prompt is ielm-eval-input, so we give it the advice to call the ParEdit function afterwards to insert a parenthesis pair.

(defadvice ielm-eval-input (after ielm-paredit activate)
  "Begin each IELM prompt with a ParEdit parenthesis pair."
  (paredit-open-round))

And that's it! Note that the first IELM prompt is not placed by this function so it won't appear until the second prompt.

*** Welcome to IELM ***  Type (describe-mode) for help.
ELISP> 
ELISP> ()

If you want to enter a single atom and don't need parenthesis, just hit backspace once. This is much less common so it gets the extra keystroke.

This can be done for inferior-lisp and SLIME to enhance those REPLs as well. You just have to figure out which defun to advise.


Elisp Printed Hash Tables

A printed hash table representation is pretty new to Elisp, and a bit late. As far as I know Elisp didn't come with a way to print, and read back in, a hash table without rolling your own (like Jared Dilettante was doing with a Data::Dumper style output), until 23.1 in July 2009. This is when json.el was first included with Emacs, for dumping to and reading from JSON.

(require 'json)

(setq hash (make-hash-table))
(puthash "key1" "data1" hash)
(puthash "key2" "data2" hash)

(insert "\n;; " (json-encode hash))
;; {"key2":"data2", "key1":"data1"}

Just a month ago Emacs 23.2 came out, very silently including a new printed representation for hash tables with a #s hash notation.

#s(hash-table data ("key1" "data1" "key2" "data2"))

With this hash tables can be printed and read as part of normal s-expressions with the standard lisp reader and printer functions. It seems heavy, having to write out "hash-table" in there, but I think it's because the #s notation will be used to create printed forms of other lisp objects that currently do not have one.


Identifying Files

At work I spend about a third of my time doing data reduction, and it's one of my favorite things to do at the laboratory (I've done it on my own too). Data come in from various organizations and sponsors in all sorts of strange formats. We have a bunch of fancy analysis tools to work on the data, but they aren't any good if they can't read the format. So I'm tasked with writing tools to convert incoming data into a more useful format.

If the source file is a text-based file it's usually just a matter of writing a parser -- possibly including a grammar -- after carefully studying the textual structure. Binary files are trickier. Fortunately, there are a few tools that come in handy for identifying the format of a strange binary file.

The first is the standard utility found on any unix-like system: file . I have no idea if it has an official website because it's a term that's impossible to search for. It tries to identify a file based on the magic numbers and other tests, none based on the actual file name. I've never been to lucky to have file recognize a strange format at work. But silence speaks volumes: it means the data are not packed into something common, like a simple zip archive.

Next, I take a look at the file with ent, a pseudo-random number sequence test program. This will reveal how compressed (or even encrypted) data are. If ent says the data are very dense, say 7 bits per byte or more, the format is employing a good compression algorithm. The next step would be tackling that so I can start over on the uncompressed contents. If it's something like 4 bits per byte there's no compression. If it's in between then it might be employing a weak, custom compression algorithm. I've always seen the latter two.

Next I dive in with a hex editor. I use a combination of Emacs' hexl-mode and the standard BSD tool hexdump (for something more static). One of the first things I like to identify is byte order, and in a hex dump it's often obvious.

In general, better designed formats use big endian, also known as network order. That's the standard ordering used in communication, regardless of the native byte ordering of the network clients. The amateur, home-brew formats are generally less thoughtful and dump out whatever the native format is, usually little endian because that's what x86 is. Worse, they'll also generate data on architectures that are big endian, so you can get it both ways without any warning. In that case your conversion tool has to be sensitive to byte order and find some way to identify which ordering a file is using. A time-stamp field is very useful here, because a 64-bit time-stamp read with the wrong byte order will give a very unreasonable date.

For example, here's something I see often.

eb 03 00 00 35 00 00 00 66 1e 00 00

That's most likely 3 4-byte values, in little endian byte order. The zeros make the integers stand out.

eb 03 00 00 35 00 00 00 66 1e 00 00

We can tell it's little endian because the non-zero digits are on the left. This information will be useful in identifying more bytes in the file.

Next I'd look for headers, common strings of bytes, so that I can identify larger structures in the data. I've never had to reverse engineer a format ... yet. I'm not sure if I could. Once I got this far I've always been able to research the format further and find either source code or documentation, revealing everything to me.

If the file contains strings I'll dump them out with strings. I haven't found this too useful at work, but it's been useful at home.

And there's something still useful beyond these. Something I made myself at home for a completely different purpose, but I've exploited its side effects: my PNG Archiver. The original purpose of the tool is to store a file in an image, as images are easier to share with others. The side effect is that by viewing the image I get to see the structure of the file. For example, here's my laptop's /bin/ls, very roughly labeled.

The different segments of the ELF binary are easily visible.

It's easy to spot the different segments of the ELF format. Higher entropy sections are more brightly colored. Strings, being composed of ASCII-like text, have their MSB's unset, which is why they're darker. Any non-compressed format will have an interesting profile like this. Here's a Word doc, an infamously horrible format,

And here's some Emacs bytecode. You can tell the code vectors apart from the constants section below it.

If you find yourself having to inspect strange files, keep these tools around to make the job easier.


Don't stop here! This isn't everything. Check out the archives (on the left) for more posts. Or just have a look at the index.