null program

Elisp Wishlist

Update: It looks like all these wishes, except the last one, may actually be coming true! Guile can run Elisp better than Emacs! The idea is that the Elisp engine is replaced with Guile -- the GNU project's Scheme implementation designed to be used as an extension language -- and written in Scheme is an Elisp compiler that targets Guile's VM. The extension language of Emacs then becomes Scheme, but Emacs is still able to run all the old Elisp code. At the same time Elisp itself, which I'm sure many people will continue to use, gets an upgrade of arbitrary precision, closures, and better performance.

I've been using elisp a lot lately, but unfortunately it's missing a lot of features that one would find in a more standard lisp. The following are some features I wish elisp had. Many of these could be fit into a generic "be more like Scheme or Common Lisp". Some of these features would break the existing mountain of elisp code out there, requiring a massive rewrite, which is likely the main reason they are being held back.

Closures, and maybe continuations. Closures are one of the features I miss the most when writing elisp. They would allow the implementation of Scheme-style lazy evaluation with delay and force, among other neat tools. Continuations would just be a neat thing to have, though they come with a performance penalty.

Closures would also pretty much require Emacs switch to lexical scoping.

Arbitrary precision. Really, any higher order language's numbers should be bignums. Emacs 22 does come with the Calc package which provides arbitrary precision via defmath. Perl does something like this with the bignum module.

Packages/namespaces. Without namespaces all of the Emacs packages prefix their functions and variables with its name (i.e. dired-). Some real namespaces would be useful for large projects.

C interface. This is something GNU Emacs will never have because Richard Stallman considers Emacs shared libraries support to be a GPL threat. If Emacs could be dynamically extended some useful libraries could be linked in and exposed to elisp.

Concurrency. If some elisp is being executed Emacs will lock up. This is a particular problem for Gnus. Again, Emacs would really need to switch to lexical scoping before this could happen. Threading would be nice.

Speed. Emacs lisp is pretty slow, even when compiled. Lexical scoping would help with performance (compile time vs. run time binding).

Regex type. I mention this last because I think this would be really cool, and I am not aware of any other lisps that do it. Emacs does regular expressions with strings, which is silly and cumbersome. Backslashes need extra escaping, for example. Instead, I would rather have a regex type like Perl and Javascript have. So instead of,

(string-match "\\w[0-9]+" "foo525")

we have,

(string-match /\w[0-9]+/ "foo525")

Naturally there would be a regexpp predicate for checking its type. There could also be a function for compiling a regexp from a string into a regexp object. As a bonus, I would also like to use it directly as a function,

(/\w[0-9]+/ "foo525")

I think a regexp price would really give elisp an edge, and would be entirely appropriate for a text editor. It could also be done without breaking anything (keep string-style regexp support).

There is more commentary over at EmacsWiki: Why Does Elisp Suck.


Elisp Running Time Macro

I wanted an elisp macro that could measure the running time of a block of code. Specifically, I wanted it to work like this,

(measure-time
  ...
  body
  ...)

And it would return the running time as seconds in floating point. Well, here's a macro that does it!

;; ID: 6a3f3d99-f0da-329a-c01c-bb6b868f3239
(defmacro measure-time (&rest body)
  "Measure and return the running time of the code block."
  (declare (indent defun))
  (let ((start (make-symbol "start")))
    `(let ((,start (float-time)))
       ,@body
       (- (float-time) ,start))))

It's only good for up to around 18 hours, then the time integer overflows. If only Emacs had arbitrary precision numbers. Here it is in action using my binomial function from last week.

(measure-time
  (nck 20 10)
  (nck 30 7))

Which, just now, returned 3.643713 seconds when executed.


Knight's Tour

An Ugly  Chess Knight A Knight's Tour is a path on a chess board such that a knight moving according to the rules of chess will visit each square exactly once. Here is a program that solves the Knight's Tour for various board sizes.

git clone http://git.nullprogram.com/knight.git

I wrote it purely for speed, and so in cases where the algorithm doesn't blow up (taking possibly thousands of years to complete) it only takes a few milliseconds to complete. The algorithm is recursive and goes like so,

  1. Given a board marked with visited spaces and a position create a list of valid moves.
  2. If the list is empty, return failure.
  3. Sort of the list ascendingly according to the opportunities of the target space. That is, sort by the number of valid moves that are available from the target space.
  4. Mark the current space as visited, then recurse from the new position with the new board starting with the first target in the list.
  5. If the recursion returns success, return success.
  6. If the recursion returns failure, recurse with the next target in the list, and so on. If all the targets on the list have been tried, return failure.

The purpose of going towards targets with few opportunities is so that we don't leave any stranded spaces that will not be detected until much later. When it does run into a dead end, it usually only has to back out a short ways and try again.

The algorithm breaks down on certain board sizes. For example, try a 120 by 120 board. Instead of nearly instantaneous solution like 120 by 119 would get, it takes enormous amounts of time. I don't yet know why.

Also included in the repository is a Perl script for creating frames for a video. Here are two videos, one for an 8x8 board, one for a 80x80 board. They are in Ogg Theora. Update: if your browser supports HTML 5 you can now watch these videos inline thanks to the video tag.

It's interesting how on the large one the behavior seems to be random at some places, but all it's doing is following that simple algorithm.


Doing Comment Previews the Right Way

Many comment/discussion systems get previews wrong. This even includes major sites like Boing Boing and Slashdot. Sometimes they feed back a different comment in the textarea, so repeated previews slowly degrade the comment. Other times the comment preview isn't the same thing as the final result. A comment actually has four states,

Diagram showing the four states oand their transforms.

The raw comment is the unfiltered string of bytes from the user. This is not safe to give directly back to the user, as it could be exploited to feed an arbitrary page to an innocent user.

The escaped comment is created from the raw comment by filtering it through the escapeHTML() function. This function creates HTML entities out of some of the characters, like < and >. A browser will interpret the escaped comment as a simple string, and is safe to give back to the user. This function is actually provided by perl's CGI module, so perl programmers need not implement this.

Note that escapeHTML() is reversible, though the server side won't need to reverse it. The browser does.

The stripped comment is created from the raw comment by filtering it through stripHTML(), which removes non-whitelisted HTML tags. It also strips non-whitelisted attributes from allowed tags. It should probably add a rel="nofollow" to links. It also runs escapeHTML() on attribute values and content outside tags. This is safe to give back to the user because only safe tags are left.

If your comments use markup other than HTML, like BBCode, this function should strip all HTML (your whitelist is empty) and do the conversion from your markup to HTML.

It might also be a good idea for it to produce well-formed HTML. This will allow your comments/discussion pages to be XHTML compliant.

stripHTML() is irreversible because it dumps information.

The stored comment is the encoding of the comment in the system. This depends entirely on the storage system. In some cases it may be identical to the stripped comment (and store is the identity function). If the comment is going through SQL into database, some characters may need to be escaped as to not cause problems. It could even be a base 64 encoding.

store() must be unambiguously reversible, and the server should have an unstore() to do this. It should probably also be able to convert any arbitrary string of characters into a safe encoding for storage.

There should only be one version of all these functions for both previews and final posting of comments.

When doing a comment preview both the escaped comment and the stripped comment are given back to the user. The stripped comment is dropped in as HTML, and the escaped comment is put into the textarea of the form. It would probably be convenient for the user if you give them back any other form information, including the same captcha and their answer to it (or not charge them with a captcha for that comment anymore).

You may be tempted to store the raw comments (safely with store()) and do HTML stripping on the fly. This would allow you to upgrade your HTML stripping function in the future to "better" handle user input. I don't recommend it. That's extra processing for each page request, but worse, it breaks the concept of the preview, because the comment formatting is subject to change in the future.

The hardest function to implement is probably stripHTML() because it needs to be able to handle poorly formed HTML. If you are using perl, you will probably want to use the HTML::Parser module, which is what I did. This does everything noted above and also auto-links anything that looks like a URL, forces proper comment nesting, automatically makes paragraphs from blank-line-separated chunks, and almost produces well-formed HTML.

htmlclean.pm

The documentation is basically non-existent, but if you want to whitelist more tags add them to @allowed_tags. Use it, abuse it.

I use this code in my comment system, so you can play around with it by using my preview function.


Unquoted Let

Update: While the basic idea is sound, my implementation below is definitely not. At the time I obviously did not fully understand Lisp macros. Using eval is usually a stupid idea, and it's even more stupid to try to use it inside of a macro where it will most likely be used, causing an error, at compile time. I doubt this would work well with byte-compiled code. I'll leave this post here for historical purposes.

The lisp macro/form let quotes variable's symbols. This is almost always more useful than not quoting it. Otherwise, simple use would look like,

(let (('var 100))
  body)

instead of,

(let ((var 100))
  body)

In elisp, this is analogous to setq, which quotes its first argument, and set, which doesn't. It could be used like so in elisp,

(setq foo 'bar)
(setq bar 25)

(uq-let ((foo 10))
  (+ 10 bar))

The unquoted let form evalues to 20, not 35, because foo evaluates to the symbol bar, which is bound to 10. The unquoted version is used to select variable names dynamically.

This isn't very useful in a lexically scoped lisp because the variables that can access it are decided, well, lexically. With dynamic scoping we can use it to temporarily mask variables and select what variables to mask dynamically. I found a use for it in one of my projects.

I had a recursive function that searched a graph made of symbols. The symbols globally stored the state of the node. The current node was passed into the function, which would change the state of that node, and recurse. Here's how I was doing it,

(defun search (graph node)
  ...
  (set node 'visited)
  recurse
  (set node 'unvisited))

That looks a lot like it is emulating a let form. If we use an unquoted let,

(defun search (graph node)
  ...
  (uq-let ((node 'visited))
    recurse))

There, that's a lot more lispy. Here is an elisp macro for uq-let,

(defmacro uq-let (vars &rest body)
  "Unquoted let."
  (declare (indent defun))
  `(let ,(mapcar 
          (lambda (a)
            (cons (eval (car a)) (cdr a))) vars)
     ,@body))

From this, making a uq-let* is trivial. The documentation string could use some work too.

Lisp macros are powerful.


Getting Lisp

I've been using lisp on and off for the past few years. I read some lisp books, went through The Little Schemer and some of SICP. But I could never really think in lisp. When I needed to write some code, I would prefer another language first. I was writing imperative code for 10 years before I saw lisp, and I was used to it.

Recently, I have found myself wanting to use lists (including alists, plists, etc.) as data structures for everything, even when I'm not even writing lisp. I think I finally got lisp and now I want to use lisp for everything.

For example, take this little problem from the other day on f(t). Katie Nowak gives us a math problem,

There is a 75% chance of rain on any given day in the next week. What is the probability that it will rain on at least 5 of the 7 days?

Rain cloud in parenthesis The purpose of the question was to point out a neat coincidence in the problem (explained at the link). I used a program to solve the problem and there are two reasons for this.

First, I wanted to use the program to explore the problem and find the "special" property. With a program, I could quickly try different parameters, which would take longer, and be more error prone, by hand.

Second, which is similar to the first, I hate evaluating a large expression by hand. It's slow and error prone. Writing a program to do the same work is faster and mistakes are easier to catch. Also, I can quickly try different parameters to make sure my program's output is reasonable. In this case, for any reasonable input, the output, a probability, shouldn't be greater than 1.

Well, let's see, this is a Bernoulli experiment: each day is independent, so it is like flipping a coin seven times and counting the heads. That means we need the choose function.

My first thought was Octave, as this is a simple program and Octave already provides nchoosek() for me.

# Rain at least n days
function sum = rain (n)
  sum = 0;
  for i = n:7
    sum += nchoosek(7,i) * 0.75 ^ i * 0.25 ^ (7 - i);
  end
end

Simple, but I actually made a couple little mistakes working it out, and it took me a little longer than it should have. If you asked someone to write this program in any imperative language, it would probably look a lot like this.

I then made a lisp version (elisp), but I first needed to define the binomial coefficient function since there wasn't one provided.

(defun nck (n k)
  "The binomial coefficient."
  (if (or (= k 0) (= k n)) 1
    (+ (nck (- n 1) (- k 1)) (nck (- n 1) k))))

This is the recursive version, based on Pascal's rule, so it doesn't need factorials.

In lisp, recursion is preferred to iteration, so that's the way I approached the program.

(defun p-rain (n)
  "Probability of rain on exactly n days."
  (* (nck 7 n) (expt 0.75 n) (expt 0.25 (- 7 n))))

(defun rain-at-least (n)
  "Probability it will rain on at least n days."
  (if (> n 7) 0
    (+ (p-rain n) (rain-at-least (+ 1 n)))))

I like the recursive version, and this code, much better. It presents the solution in a more straight forward way. And I got it right the first time, too. Yes, it was written after the Octave version, but I still think it counts for something.

The lisp version is also less complex. The Octave version has to use some temporary variables, i and sum, which is extra conceptual overhead. Sure, it could also be written recursively, but this is not really the way Octave is meant to be written.

Eh, not a great example, really. Lisp has so many powerful features, like macros (the powerful lisp kind), symbols, low-level access into the interpreter, and the REPL, that allow the programmer to do some really cool things. Many of these features are unique to lisps.

I'm really liking lisp.


Another Perl One-liner: Byte Order

At work right now I am using two different machines. One is a 32-bit SPARC and the other is a very powerful SMP x86-64. Sometimes data are generated on one machine and used in a simulation on the other. There is a problem of byte-order, though. The SPARC is big-endian and the other is little-endian, and the programs on both sides don't pay attention to that small detail.

Luckily, the data are all 4-byte aligned. That's perfect for a Perl one-liner byte order conversion,

perl -e 'print scalar reverse while read STDIN, $_, 4' < in.le > out.be

Perl is really great for concise hacks. I really like how this one-liner almost reads like a natural language sentence. Is there any other language that can do powerful one-liners like Perl?


Emacs Web Server

GNU head As part of my quest of developing solid knowledge of GNU Emacs lisp, I have implemented a pseudo-HTTP/1.0 web server within Emacs. Behold,

git clone http://git.nullprogram.com/emacs-httpd.git

To all other non-emacsen text editors, can your text editor do that?! Ha! Even though elisp is a slow, closure-less, dynamically scoped, ugly cousin of more popular lisps, it's still a lot of fun to write.

To fire it up, load it into Emacs and run the extended command (M-x) httpd-start. By default it will serve files from "~/public_html". To change this, change the variable httpd-root to the desired web root. You can stop the server with httpd-stop.

It's about 200 lines of code and can serve static websites made of small, static files. I say small files because it serves files from buffers, meaning it has to read the entire file in first.

For a simple, text editor based server it can hold up to a pretty decent load. At one point I hit it with 8 wget instances all making rapid recursive downloads and my manual navigation wasn't slowed down noticeably. Despite running in the slow elisp interpreter, I think it can have much better performance by caching commonly served files in buffers.

It should run, unmodified, anywhere a modern Emacs can run, so I expect that it's already very portable. I can imagine it being useful in a situation where someone needs to temporarily host some files but there isn't a web server on the machine. Just grab this script and throw it at Emacs.

Well, it only does IPv4 right now, though I expect IPv6 only requires changing one number (namely, 4 to 6). I don't have any IPv6 systems to test it on.

When writing it I also had security in mind so, as far as I know, it should be safe to use. It cleans up the GET from the client so that no files underneath the serving root can be accessed.

The server log is lisp itself. Here is an example log starting the server, serving one request, and halting,

'(log
  (start "Wed May 13 23:33:34 2009")
  (connection
   (date "Wed May 13 23:36:25 2009")
   (address "192.168.0.3")
   (get "/0001.html")
   (req
    ("Referer" "http://192.168.0.2:8080/")
    ("Connection" "keep-alive")
    ("Keep-Alive" "300")
    ("Accept-Charset" "ISO-8859-1,utf-8;q=0.7,*;q=0.7")
    ("Accept-Encoding" "gzip,deflate")
    ("Accept-Language" "en-us,en;q=0.5")
    ("Accept" "image/png,image/*;q=0.8,*/*;q=0.5")
    ("User-Agent" "Mozilla/5.0 [...] Iceweasel/3.0.9 (Debian-3.0.9-1)")
    ("Host" "192.168.0.2:8080")
    ("GET" "/0001.html" "HTTP/1.1"))
   (path "~/public_html/0001.html")
   (status 200))
  (stop "Wed May 13 23:38:17 2009"))

The log is alists of alists, making a hierarchical tree structure that can be explored with some simple lisp functions. Normally this sort of thing is done with XML, but lisp already has its own structured format: lists!

When GET is a directory, it looks for "index.html" and serves that if it exists. More indexes can be added to the variable httpd-indexes. This can actually be done in a special ".htaccess.el" file.

If a ".htaccess.el" exists in the directory from which a file is being served, Emacs will first load/execute it. You see, it's just a lisp program. If you wanted to add a new index file name, the hypertext access file could contain this,

(add-to-list 'httpd-indexes "0001.html")

It's a bit like a .emacs file.

But I think one of the coolest things about having a lisp-based server is that the server can be modified in place without disrupting or restarting it. In my Emacs web server, the only change that requires a restart is changing the server port. In fact, I wrote most of it while the server was running and tested my changes from a browser right as I made them -- all on the same instance of the server.

If you want to look into the AI side of this, the server could modify its own code in response to its use.

I also had the idea of creating dynamic websites with elisp, in the same way PHP or Perl does. If a .el file (or .elc) is accessed, the server would pass the GET/POST arguments as an alist to a function in the elisp file. The server would also provide some nifty HTML generation macros. A dynamic script might look like this,

(defun script (get)
  (html
   (head
    (title "My Script"))
   (body
    (h1 "Your Query")
    (p (concat "Your query was "
               (html-sanitize (cdr (assoc "q" get)) "."))))))

However, this is not (yet?) implemented. Just an idea.

I will continue to work on it, though I don't expect to add much more to it. I will mostly improve the code and documentation.


I Finally Have Comments

I finally have a comment system, thanks to pollxn, a blosxom comment system that actually works. There is a link to it, indicating the number of comments, in the bottom of each post. Try it out and say hello.

Unfortunately, pollxn doesn't have any sort of anti-spam or CAPTCHA system. If you look around the Interwebs where other people are using pollxn, you will see everyone has their own little CAPTCHA thing. Well, I am not different. I hacked together my own to keep away automated spammers.

It selects words from the dictionary (of 40,000 words in this case) and encrypts them with Blowfish in CBC mode, with a unique IV each time. This is to passed to the user, who passes it to an image generator which decrypts the word and uses GD in Perl to render it, apply some transforms, and drop a line randomly over it. The user submits the guess of the image along with the encrypted version (hidden field), which is decrypted and compared on the other end. The same encrypted ID cannot be used twice, but thanks to the IV the same word can be used twice.

Here are some samples. If you hit refresh, they will render differently.

CAPTCHA sample: ormolus CAPTCHA sample: morons CAPTCHA sample: softer CAPTCHA sample: zanucks CAPTCHA sample: grumble CAPTCHA sample: nozzle

It's not a great CAPTCHA, but it should be good enough for the low volume of traffic I see here. As I inevitably collect small amounts of spam (by spammers manually passing the CAPTCHA), I will gradually create the needed tools to combat it. I can also easily update the CAPTCHA image algorithm without disrupting the functioning of the website.

I'm sure I will be making improvements to the comment system over time as well. I should make it obfuscate e-mail addresses, for one. Maybe add a preview. And better blosxom integration.

So say hello below! I am excited to finally have a real blog.


Greasemonkey User Scripts

Old manuscript I have recently been playing around with Greasemonkey, a great Firefox add-on that gives users a lot of control over how a website is displayed. It works by having the user providing a bit of Javascript that runs when a website is rendered. The user doesn't actually have to write the script, but can find them in various script repositories.

When I looked around, I couldn't find scripts to do some things I wanted, so I started writing my own. Now that I have started that habit I see uses for user scripts all over the place. Suddenly I can fix anything I find annoying on the web. It's very empowering.

Of course, Firefox add-ons can do anything a user script can do. But Greasemonkey user scripts are lightweight, more secure (due to being less powerful), easier to write, and don't require a browser restart to install and uninstall.

I posted my scripts on userscripts.org so that people could find them easily, but I always like to host these things locally, too. You can find it under /userscripts here. Don't forget to review the source before you install it! That is, unless you automatically trust me and my website's security. I, or an infiltrator, could slip something sneaky in there.

I actually first used Greasemonkey back in 2005, but they had some very serious security issues back in those days. It was bad enough that I just uninstalled it, which was actually recommended by the Greasemonkey people themselves. So, four years later I am back to check it out.

The first user script I wrote was in response to a "feature" on TV Tropes. In addition to the overall cruddiness of the website, they started adding "folders" to the information on long pages. It folds up all of the information behind little clickable widgets. It uses CSS to hide the information and Javascript to reveal it by adjusting those styles on the fly. If you have a browser with CSS support but not Javascript (or have it disabled like I did), you won't ever be able to see the information in the browser. As of this writing the Jumping the Shark article uses this. Take a look.

This is an awful idea! It critically breaks the usability of the page. And what's the point? We already have a vertical scrollbar to control the display. Unfortunately, a lot of clueless people seem to like this sort of behavior -- because it's flashy -- so we will probably only see more of it on the web in the future.

My TV Tropes user script is simple: it scrapes off some of the CSS. Specifically, the "folder" CSS. That's it! Someone who already knows how to make user scripts could probably put this together in less than 15 minutes.

If you want to tackle web annoyances, learn Greasemonkey!


Wikipedia Flu Time-lapse

Here's something interesting I saw on Wikipedia. There is a map of the US with states colored according to the spread of the H1N1 flu epidemic. Take a look: H1N1_USA_Map.svg. The interesting part is actually at the bottom of that image page.

Wikipedia, being a wiki, versions everything. It has to. No one actually changes a page, they just add a new version that is a derivative of the "current" version. Because of this, Wikipedia has incidentally created a time-lapse version of the map in its version control system. In case you can't see it when you are reading this, here's what part of it looks like,

Wikipedia screenshot

That's the timestamp and the state of the spread at that time, presented in an extremely useful way. And this was created by accident. Pretty cool, eh?


Don't stop here! This isn't everything. Check out the archives (on the left) for more posts. Or just have a look at the index.