Articles tagged reddit at null program

A Showerthoughts Fortune File

2016-12-01T23:58:15Z

I have created a fortune file for the all-time top 10,000 /r/Showerthoughts posts, as of October 2016. As a word of warning: Many of these entries are adult humor and may not be appropriate for your work computer. These fortunes would be categorized as “offensive” (fortune -o).

Download: showerthoughts (1.3 MB)

The copyright status of this file is subject to each of its thousands of authors. Since it’s not possible to contact many of these authors — some may not even still live — it’s obviously never going to be under an open source license (Creative Commons, etc.). Even more, some quotes are probably from comedians and such, rather than by the redditor who made the post. I distribute it only for fun.

Installation

To install this into your fortune database, first process it with strfile to create a random-access index, showerthoughts.dat, then copy them to the directory with the rest.

$ strfile showerthoughts
"showerthoughts.dat" created
There were 10000 strings
Longest string: 343 bytes
Shortest string: 39 bytes

$ cp showerthoughts* /usr/share/games/fortunes/

Alternatively, fortune can be told to use this file directly:

$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
        ―AndItsDeepToo, Aug 2016

If you didn’t already know, fortune is an old unix utility that displays a random quotation from a quotation database — a digital fortune cookie. I use it as an interactive login shell greeting on my ODROID-C2 server:

if shopt -q login_shell; then
    fortune ~/.fortunes
fi

How was it made?

Fortunately I didn’t have to do something crazy like scrape reddit for weeks on end. Instead, I downloaded the pushshift.io submission archives, which is currently around 70 GB compressed. Each file contains one month’s worth of JSON data, one object per submission, one submission per line, all compressed with bzip2.

Unlike so many other datasets, especially when it’s made up of arbitrary inputs from millions of people, the format of the /r/Showerthoughts posts is surprisingly very clean and requires virtually no touching up. It’s some really fantastic data.

A nice feature of bzip2 is concatenating compressed files also concatenates the uncompressed files. Additionally, it’s easy to parallelize bzip2 compression and decompression, which gives it an edge over xz. I strongly recommend using lbzip2 to decompress this data, should you want to process it yourself.

cat RS_*.bz2 | lbunzip2 > everything.json

jq is my favorite command line tool for processing JSON (and rendering fractals). To filter all the /r/Showerthoughts posts, it’s a simple select expression. Just mind the capitalization of the subreddit’s name. The -c tells jq to keep it one per line.

cat RS_*.bz2 | \
    lbunzip2 | \
    jq -c 'select(.subreddit == "Showerthoughts")' \
    > showerthoughts.json

However, you’ll quickly find that jq is the bottleneck, parsing all that JSON. Your cores won’t be exploited by lbzip2 as they should. So I throw grep in front to dramatically decrease the workload for jq.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts")'
    > showerthoughts.json

This will let some extra things through, but it’s a superset. The -a option is necessary because the data contains some null bytes. Without it, grep switches into binary mode and breaks everything. This is incredibly frustrating when you’ve already waited half an hour for results.

To further reduce the workload further down the pipeline, I take advantage of the fact that only four fields will be needed: title, score, author, and created_utc. The rest can — and should, for efficiency’s sake — be thrown away where it’s cheap to do so.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts") |
               {title, score, author, created_utc}' \
    > showerthoughts.json

This gathers all 1,199,499 submissions into a 185 MB JSON file (as of this writing). Most of these submissions are terrible, so the next step is narrowing it to the small set of good submissions and putting them into the fortune database format.

It turns out reddit already has a method for finding the best submissions: a voting system. Just pick the highest scoring posts. Through experimentation I arrived at 10,000 as the magic cut-off number. After this the quality really starts to drop off. Over time this should probably be scaled up with the total number of submissions.

I did both steps at the same time using a bit of Emacs Lisp, which is particularly well-suited to the task:

https://github.com/skeeto/showerthoughts

This Elisp program reads one JSON object at a time and sticks each into a AVL tree sorted by score (descending), then timestamp (ascending), then title (ascending). The AVL tree is limited to 10,000 items, with the lowest items being dropped. This was a lot faster than the more obvious approach: collecting everything into a big list, sorting it, and keeping the top 10,000 items.

Formatting

The most complicated part is actually paragraph wrapping the submissions. Most are too long for a single line, and letting the terminal hard wrap them is visually unpleasing. The submissions are encoded in UTF-8, some with characters beyond simple ASCII. Proper wrapping requires not just Unicode awareness, but also some degree of Unicode rendering. The algorithm needs to recognize grapheme clusters and know the size of the rendered text. This is not so trivial! Most paragraph wrapping tools and libraries get this wrong, some counting width by bytes, others counting width by codepoints.

Emacs’ M-x fill-paragraph knows how to do all these things — only for a monospace font, which is all I needed — and I decided to leverage it when generating the fortune file. Here’s an example that paragraph-wraps a string:

(defun string-fill-paragraph (s)
  (with-temp-buffer
    (insert s)
    (fill-paragraph)
    (buffer-string)))

For the file format, items are delimited by a % on a line by itself. I put the wrapped content, followed by a quotation dash, the author, and the date. A surprising number of these submissions have date-sensitive content (“on this day X years ago”), so I found it was important to include a date.

April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
        ―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
        ―MKLV, Aug 2015
%

There’s the potential that a submission itself could end with a lone % and, with a bit of bad luck, it happens to wrap that onto its own line. Fortunately this hasn’t happened yet. But, now that I’ve advertised it, someone could make such a submission, popular enough for the top 10,000, with the intent to personally trip me up in a future update. I accept this, though it’s unlikely, and it would be fairly easy to work around if it happened.

The strfile program looks for the % delimiters and fills out a table of file offsets. The header of the .dat file indicates the number strings along with some other metadata. What follows is a table of 32-bit file offsets.

struct {
    uint32_t str_version;  /* version number */
    uint32_t str_numstr;   /* # of strings in the file */
    uint32_t str_longlen;  /* length of longest string */
    uint32_t str_shortlen; /* shortest string length */
    uint32_t str_flags;    /* bit field for flags */
    char str_delim;        /* delimiting character */
}

Note that the table doesn’t necessarily need to list the strings in the same order as they appear in the original file. In fact, recent versions of strfile can sort the strings by sorting the table, all without touching the original file. Though none of this important to fortune.

Now that you know how it all works, you can build your own fortune file from your own inputs!

Emacs Lisp Reddit API Wrapper

2013-12-16T23:27:23Z

A couple of months ago I wrote an Emacs Lisp wrapper for the reddit API. I didn’t put it in MELPA, not yet anyway. If anyone is finding it useful I’ll see about getting that done. My intention was give it some exercise and testing before putting it out there for people to use, locking down the API. You can find it here,

https://github.com/skeeto/emacs-reddit-api

Except for logging in, the library is agnostic about the actual API endpoints themselves. It just knows how to translate between Elisp and the reddit API protocol. This makes the library dead simple to use. I had considered supporting OAuth2 authentication rather than password authentication, but reddit’s OAuth2 support is pretty rough around the edges.

Library Usage

The reddit API has two kinds of endpoints, GET and POST, so there are really only three functions to concern yourself with.

reddit-login
reddit-get
reddit-post

And one variable,

reddit-session

The reddit-login function is really just a special case of reddit-post. It returns a session value (cookie/modhash tuple) that is used by the other two functions for authenticating the user. Just as you get automatically with almost all Elisp data structures — probably more so than any other popular programming language — it can be serialized with the printer and reader, allowing a reddit session to be maintained across Emacs sessions.

The return value of reddit-login generally doesn’t need to be captured. It automatically sets the dynamic variable reddit-session, which is what the other functions access for authentication. This can be bound with let to other session values in order to switch between different users.

Both reddit-get and reddit-post take an endpoint name and a list of key-value pairs in the form of a property list (plist). (The api-type key is automatically supplied.) They each return the JSON response from the server in association list (alist) form. The actual shape of this data matches the response from reddit, which, unfortunately, is inconsistent and unspecified, so writing any sort of program to operate on the API requires lots of trial and error. If the API responded with an error, these functions signal a reddit-error.

Typical usage looks like so. Notice that values need not be only strings; they just need to print to something reasonable.

;; Login first
(reddit-login "your-username" "your-password")

;; Subscribe to a subreddit
(reddit-post "/api/subscribe" '(:sr "t5_2s49f" :action sub))

;; Post a comment
(reddit-post "/api/comment/" '(:text "Hello world." :thing_id "t1_cd3ar7y"))

For plists keys I considered automatically converting between dashes and underscores so that the keywords could have Lisp-style names. But the reddit API is inconsistent, using both, so there’s no correct way to do this.

To further refine the API it might be worth defining a function for each of the reddit endpoints, forming a facade for the wrapper library, hiding way the plist arguments and complicated responses. That would eliminate the trial and error of using the API.

(defun reddit-api-comment (parent comment)
  (if (null reddit-session)
      (error "Not logged in.")
    ;; TODO: reduce the return value into a thing/struct
    (reddit-post "/api/comment/" '(:thing_id parent :text comment))))

Furthermore there could be defstructs for comments, posts, subreddits, etc. so that the “thing” ID stuff is hidden away. This is basically what was already done for sessions out of necessity. I might add these structs and functions someday but I don’t currently have a need for it.

It would be neat to use this API to create an interface to reddit from within Emacs. I imagine it might look like one of the Emacs mail clients, or like Elfeed. Almost everything, including viewing image posts within Emacs, should be possible.

Background

For the last 3.5 years I’ve been a moderator of /r/civ, starting back when it had about 100 subscribers. As of this writing it’s just short of 60k subscribers and we’re now up to 9 moderators.

A few months ago we decided to institute a self-post-only Sunday. All day Sunday, midnight to midnight Eastern time, only self-posts are allowed in the subreddit. One of the other moderators was turning this on and off manually, so I offered to write a bot to do the job. There weren’t any Lisp wrappers yet (though raw4j could be used with Clojure), so I decided to write one.

As mentioned before, the reddit API leaves a lot to be desired. It randomly returns errors, so a correct program needs to be prepared to retry requests after a short delay, depending on the error. My particular annoyance is that the /api/site_admin endpoint requires that most of its keys are supplied, and it’s not documented which ones are required. Even worse, there’s no single endpoint to get all of the required values, the key names between endpoints are inconsistent, and even the values themselves can’t be returned as-is, requiring massaging/fixing before returning them back to the API.

I hope other people find this library useful!

My Grading Process

2013-10-13T02:56:31Z

My GitHub activity, including this blog, has really slowed down for the past month because I’ve spent a lot of free time grading homework for a design patterns class, taught by a colleague at the Whiting School of Engineering. Conveniently for me, all of my interaction with the students is through e-mail. It’s been a great exercise of my new e-mail setup, which itself has definitely made this job easier. It’s kept me very organized through the whole process.

Each assignment involves applying two or three design patterns to a crude (in my opinion) XML parsing library. Students are given a tarball containing the source code for the library, in both Java and C++. They pick a language, modify the code to use the specified patterns, zip/archive up the result, and e-mail me their zipfile/tarball.

It took me the first couple of weeks to work out an efficient grading workflow, and, at this point, I can accurately work my way through most new homework submissions rapidly. On my end I already know the original code base. All I really care about is the student’s changes. In software development this sort of thing is expressed a diff, preferably in the unified diff format. This is called a patch. It describes precisely what was added and removed, and provides a bit of context around each change. The context greatly increases the readability of the patch and, as a bonus, allows it to be applied to a slightly different source. Here’s a part of a patch recently submitted to Elfeed:

diff --git a/tests/elfeed-tests.el b/tests/elfeed-tests.el
index 31d5ad2..fbb78dd 100644
--- a/tests/elfeed-tests.el
+++ b/tests/elfeed-tests.el
@@ -144,15 +144,15 @@
   (with-temp-buffer
     (insert elfeed-test-rss)
     (goto-char (point-min))
-    (should (eq (elfeed-feed-type (xml-parse-region)) :rss)))
+    (should (eq (elfeed-feed-type (elfeed-xml-parse-region)) :rss)))
   (with-temp-buffer
     (insert elfeed-test-atom)
     (goto-char (point-min))
-    (should (eq (elfeed-feed-type (xml-parse-region)) :atom)))
+    (should (eq (elfeed-feed-type (elfeed-xml-parse-region)) :atom)))
   (with-temp-buffer
     (insert elfeed-test-rss1.0)
     (goto-char (point-min))
-    (should (eq (elfeed-feed-type (xml-parse-region)) :rss1.0))))
+    (should (eq (elfeed-feed-type (elfeed-xml-parse-region)) :rss1.0))))

 (ert-deftest elfeed-entries-from-x ()
   (with-elfeed-test

I’d really prefer to receive patches like this as homework submissions but this is probably too sophisticated for most students. Instead, the first thing I do is create a patch for them from their submission. Most students work off of their previous submission, so I just run diff between their last submission and the current one. While I’ve got a lot of the rest of the process automated with scripts, I unfortunately cannot script patch generation. Each student’s submission follows a unique format for that particular student and some students are not even consistent between their own assignments. About half the students also include generated files alongside the source so I need to clean this up too. Generating the patch is by far the messiest part of the whole process.

I grade almost entirely from the patch. 100% correct submissions are usually only a few hundred lines of patch and I can spot all of the required parts within a few minutes. Very easy. It’s the incorrect submissions that consume most of my time. I have to figure out what they’re doing, determine what they meant to do, and distill that down into discrete discussion items along with point losses. In either case I’ll also add some of my own opinions on their choice of style, though this has no effect on the final grade.

For each student’s submission, I commit to a private Git repository the raw, submitted archive file, the generated patch, and a grade report written in Markdown. After the due date and once all the submitted assignments are graded, I reply to each student with their grade report. On a few occasions there’s been a back and forth clarification dialog that has resulted in the student getting a higher score. (That’s a hint to any students who happen to read this!)

Even ignoring the time it takes to generate a patch, there are still disadvantages to not having students submit patches. One is the size: about 60% of my current e-mail storage, which goes all the way back to 2006, is from this class alone from the past one month. It’s been a lot of bulky attachments. I’ll delete all of the attachments once the semester is over.

Another is that the students are unaware of the amount of changes they make. Some of these patches contain a significant number of trivial changes — breaking long lines in the original source, changing whitespace within lines, etc. If students focused on crafting a tidy patch they might try to avoid including these types of changes in their submissions. I like to imagine this process being similar to submitting a patch to an open source project. Patches should describe a concise set of changes, and messy patches are rejected outright. The Git staging area is all about crafting clean patches like this.

If there was something else I could change it would be to severely clean up the original code base. When compiler warnings are turned on, compiling it emits a giant list of warnings. The students are already starting at an unnecessary disadvantage, missing out on a very valuable feature: because of all the existing noise they can’t effectively use compiler warnings themselves. Any new warnings would be lost in the noise. This has also lead to many of those trivial/unrelated changes: some students are spending time fixing the warnings.

I want to go a lot further than warnings, though. I’d make sure the original code base had absolutely no issues listed by PMD, FindBugs, or Checkstyle (for the Java version, that is). Then I could use all of these static analysis tools on student’s submissions to quickly spot issues. It’s as simple as using my starter build configuration. In fact, I’ve used these tools a number of times in the past to perform detailed code reviews for free (1, 2, 3). Providing an extensive code analysis for each student for each assignment would become a realistic goal.

I’ve expressed all these ideas to the class’s instructor, my colleague, so maybe some things will change in future semesters. If I’m offered the opportunity again — assuming I didn’t screw this semester up already — I’m still unsure if I would want to grade a class again. It’s a lot of work for, optimistically, what amounts to the same pay rate I received as an engineering intern in college. This first experience at grading has been very educational, making me appreciate those who graded my own sloppy assignments in college, and that’s provided value beyond the monetary compensation. Next time around wouldn’t be as educational, so my time could probably be better spent on other activities, even if it’s writing open source software for free.

Web Distributed Computing Revisited

2013-01-26T00:00:00Z

Four years ago I investigated the idea of using browsers as nodes for distributed computing. I concluded that due to the platform’s constraints there were few problems that it was suited to solve. However, the situation has since changed quite a bit! In fact, this weekend I made practical use of web browsers across a number of geographically separated computers to solve a computational problem.

What changed?

Web workers came into existence, not just as a specification but as an implementation across all the major browsers. It allows for JavaScript to be run in an isolated, dedicated background thread. This eliminates the setTimeout() requirement from before, which not only caused a performance penalty but really hampered running any sort of lively interface alongside the computation. The interface and computation were competing for time on the same thread.

The worker isn’t entirely isolated; otherwise it would be useless for anything but wasting resources. As pubsub events, it can pass structured clones to and from the main thread running in the page. Other than this, it has no access to the DOM or other data on the page.

The interface is a bit unfriendly to live development, but it’s manageable. It’s invoked by passing the URL of a script to the constructor. This script is the code that runs in the dedicated thread.

var worker = new Worker('script/worker.js');

The sort of interface that would have been more convenient for live interaction would be something like what is found on most multi-threaded platforms: a thread constructor that accepts a function as an argument.

/* This doesn't work! */
var worker = new Worker(function() {
    // ...
});

I completely understand why this isn’t the case. The worker thread needs to be totally isolated and the above example is insufficient. I’m passing a closure to the constructor, which means I would be sharing bindings, and therefore data, with the worker thread. This interface could be faked using a data URI and taking advantage of the fact that most browsers return function source code from toString().

Another difficulty is libraries. Ignoring the stupid idea of passing code through the event API and evaling it, that single URL must contain *all* the source code the worker will use as one script. This means if you want to use any libraries you'll need to concatenate them with your script. That complicates things slightly, but I imagine many people will be minifying their worker JavaScript anyway.

Libraries can be loaded by the worker with the importScripts() function, so not everything needs to be packed into one script. Furthermore, workers can make HTTP requests with XMLHttpRequest, so that data don’t need to be embedded either. Note that it’s probably worth making these requests synchronously (third argument false), because blocking isn’t an issue in workers.

The other big change was the effect Google Chrome, especially its V8 JavaScript engine, had on the browser market. Browser JavaScript is probably about two orders of magnitude faster than it was when I wrote my previous post. It’s incredible what the V8 team has accomplished. If written carefully, V8 JavaScript performance can beat out most other languages.

Finally, I also now have much, much better knowledge of JavaScript than I did four years ago. I’m not fumbling around like I was before.

Applying these Changes

This weekend’s Daily Programmer challenge was to find a “key” — a permutation of the alphabet — that when applied to a small dictionary results in the maximum number of words with their letters in alphabetical order. That’s a keyspace of 26!, or 403,291,461,126,605,635,584,000,000.

When I’m developing, I use both a laptop and a desktop simultaneously, and I really wanted to put them both to work searching that huge space for good solutions. Initially I was going to accomplish this by writing my program in Clojure and running it on each machine. But what about involving my wife’s computer, too? I wasn’t going to bother her with setting up an environment to run my stuff. Writing it in JavaScript as a web application would be the way to go. To coordinate this work I’d use simple-httpd. And so it was born,

https://github.com/skeeto/key-collab

Here’s what it looks like in action. Each tab open consumes one CPU core, allowing users to control their commitment by choosing how many tabs to keep open. All of those numbers update about twice per second, so users can get a concrete idea of what’s going on. I think it’s fun to watch.

(I’m obviously a fan of blues and greens on my web pages. I don’t know why.)

I posted the server’s URL on reddit in the challenge thread, so various reddit users from around the world joined in on the computation.

Strict Mode

I had an accidental discovery with strict mode and Chrome. I’ve always figured using strict mode had an effect on the performance of code, but had no idea how much. From the beginning, I had intended to use it in my worker script. Being isolated already, there are absolutely no downsides.

However, while I was developing and experimenting I accidentally turned it off and left it off. It was left turned off for a short time in the version I distributed to the clients, so I got to see how things were going without it. When I noticed the mistake and uncommented the "use strict" line, I saw a 6-fold speed boost in Chrome. Wow! Just making those few promises to Chrome allowed it to make some massive performance optimizations.

With Chrome moving at full speed, it was able to inspect 560 keys per second on Brian’s laptop. I was getting about 300 keys per second on my own (less-capable) computers. I haven’t been able to get anything close to these speeds in any other language/platform (but I didn’t try in C yet).

Furthermore, I got a noticeable speed boost in Chrome by using proper object oriented programming, versus a loose collection of functions and ad-hoc structures. I think it’s because it made me construct my data structures consistently, allowing V8’s hidden classes to work their magic. It also probably helped the compiler predict type information. I’ll need to investigate this further.

Use strict mode whenever possible, folks!

What made this problem work?

Having web workers available was a big help. However, this problem met the original constraints fairly well.

It was low bandwidth. No special per-client instructions were required. The client only needed to report back a 26-character string.
There was no state to worry about. The original version of my script tried keys at random. The later version used a hill-climbing algorithm, so there was some state but it was only needed for a few seconds at a time. It wasn’t worth holding onto.

This project was a lot of fun so I hope I get another opportunity to do it again in the future, hopefully with a lot more nodes participating.

Moving to Openbox

2012-06-25T00:00:00Z

With my dotfiles repository established I now have a common configuration and environment for Bash, Git, Emacs (separate repository), and even Firefox! This wouldn’t normally be possible because Firefox doesn’t have tidy dotfiles by default, but the wonderful Pentadactyl made it possible. My script sets up keybindings, bookmark keywords, and quickmarks so that my browser feels identical across all my computers. Now that it’s easy to add tweaks, I’m sure I’ll be putting more in there in the future.

However, one major application remained and I was really itching to capture its configuration too, since even my web browser is part of the experience. I could drop my dotfiles into a new computer within minutes and be ready to start hacking, except for my desktop environment. This was still a tedious, manual step, plagued by the configuration propagation issue. I wouldn’t to get too fancy with keybindings since I couldn’t rely on them being everywhere.

The problem was I was using KDE at the time and KDE’s configuration isn’t really version-friendly. Some of it is binary, making it unmergable, it doesn’t play well between different versions, and it’s unclear what needs to be captured and what can be ignored.

I wasn’t exactly a happy KDE user and really felt no attachment to it. I had only been using it a few months. I’ve used a number of desktops since 2004, the main ones being Xfce (couple years), IceWM (couple years), xmonad (8 months), and Gnome 2 (the rest of the time). Gnome 2 was my fallback, the familiar environment where I could feel at home and secure — that is, until Gnome 3 / Unity. The coming of Gnome 3 marked the death of Gnome 2. It became harder and harder to obtain version 2 and I lost my fallback.

I gave Gnome 3 and Unity each a couple of weeks but I just couldn’t stand them. Unremovable mouse hotspots, all new alt-tab behavior, regular crashing (after restoring old alt-tab behavior), and extreme unconfigurability even with a third-party tweak tool. I jumped for KDE 4, hoping to establish a comfortable fallback for myself.

KDE is pretty and configurable enough for me to get work done. There’s a lot of bloat (“activities” and widgets), but I can safely ignore it. The areas where it’s lacking didn’t bother me much, like the inability/non-triviality of custom application launchers.

My short time with Gnome 3 and now with KDE 4 did herald a new, good change to my habits: keyboard application launching. I got used to using the application menu to type my application name and launch it. I did use dmenu during my xmonad trial, but I didn’t quite make a habit out of it. It was also on a slower computer, slow enough for dmenu to be a problem. For years I was just launching things from a terminal. However, the Gnome and KDE menus both have a big common annoyance. If you want to add a custom item, you need to write a special desktop file and save it to the right location. Bleh! dmenu works right off your PATH — the way it should work — so no special work needed.

Gnome 2 has been revived with a fork called MATE, but with the lack of a modern application launcher, I’m now too spoiled to be interested. Plus I wanted to find a suitable environment that I could integrate with my dotfiles repository.

After being a little embarrassed at Luke’s latest Show Me Your Desktop (what kind of self-respecting Linux geek uses a heavyweight desktop?!) I shopped around for a clean desktop environment with a configuration that would version properly. Perhaps I might find that perfect desktop environment I’ve been looking for all these years, if it even exists. It wasn’t too long before I ended up in Openbox. I’m pleased to report that I’m exceptionally happy with it.

Its configuration is two XML files and a shell script. The XML can be generated by a GUI configuration editor and/or edited by hand. The GUI was nice for quickly seeing what Openbox could do when I first logged into it, so I did use it once and find it useful. The configuration is very flexible too! I created keyboard bindings to slosh windows around the screen, resize them, move them across desktops, maximize in only one direction, change focus in a direction, and launch specific applications (for example super-n launches a new terminal window). It’s like the perfect combination of tiling and stacking window managers. Not only is it more configurable than KDE, but it’s done cleanly.

Openbox is pretty close to the perfect environment I want. There are still some annoying little bugs, mostly related to window positioning, but they’ve mostly been fixed. The problem is that they haven’t made an official release for a year and a half, so these fixes aren’t yet available. I might normally think to myself, “Why haven’t I been using Openbox for years?” but I know better than that. Versions of Openbox from just two years ago, like the one in Debian Squeeze (the current stable), aren’t very good. So I haven’t actually been missing out on anything. This is something really new.

I’m not using a desktop environment on top of Openbox, so there are no panels or any of the normal stuff. This is perfectly fine for me; I have better things to spend that real estate on. I am using a window composite manager called xcompmgr to make things pretty through proper transparency and subtle drop shadows. Without panels, there were a couple problems to deal with. I was used to my desktop environment performing removable drive mounting and wireless network management for me, so I needed to find standalone applications to do the job.

Removable filesystems can be mounted the old fashioned way, where I create a mount point, find the device name, then mount the device on the mount point as root. This is annoying and unacceptable after experiencing automounting for years. I found two applications to do this: Thunar, Xfce’s file manager; and pmount, a somewhat-buggy command-line tool.

I chose Wicd to do network management. It has both a GTK client and an ncurses client, so I can easily manage my wireless network connectivity with and without a graphical environment — something I could have used for years now (goodbye iwconfig)! Unfortunately Wicd is rigidly inflexible, allowing only one network interface to be up at a time. This is a problem when I want to be on both a wired and wireless network at the same time. For example, sometimes I use my laptop as a gateway between a wired and wireless network. In these cases I need to shut down Wicd and go back to manual networking for awhile.

The next issue was wallpapers. I’ve always liked having natural landscape wallpapers. So far, I could move onto a new computer and have everything functionally working, but I’d have a blank gray background. KDE 4 got me used to slideshow wallpaper, changing the landscape image to a new one every 10-ish minutes. For a few years now, I’ve made a habit of creating a .wallpapers directory in my home directory and dumping interesting wallpapers in there as I come across them. When picking a new wallpaper, or telling KDE where to look for random wallpapers, I’d grab one from there. I’ve decided to continue this with my dotfiles repository.

I wrote a shell script that uses feh to randomly set the root (wallpaper) image every 10 minutes. It gets installed in .wallpapers from the dotfiles repository. Openbox runs this script in the background when it starts. I don’t actually store the hundreds of images in my repository. There’s a fetch.sh that grabs them all from Amazon S3 automatically. This is just another small step I take after running the dotfiles install script. Any new images I throw in .wallpaper get put int the rotation, but only for that computer.

I’ve now got all this encoded into my configuration files and checked into my dotfiles repository. It’s incredibly satisfying to have this in common across each of my computers and to have it instantly available on any new installs. I’m that much closer to having the ideal (and ultimately unattainable) computing experience!

Making Your Own GIF Image Macros

2012-04-10T00:00:00Z

This tutorial is very similar to my video editing tutorial. That’s because the process is the same up until the encoding stage, where I encode to GIF rather than WebM.

So you want to make your own animated GIFs from a video clip? Well, it’s a pretty easy process that can be done almost entirely from the command line. I’m going to show you how to turn the clip into a GIF and add an image macro overlay. Like this,

The key tool here is going to be Gifsicle, a very excellent command-line tool for creating and manipulating GIF images. So, the full list of tools is,

Here’s the source video for the tutorial. It’s an awkward video my wife took of our confused cats, Calvin and Rocc.

My goal is to cut after Calvin looks at the camera, before he looks away. From roughly 3 seconds to 23 seconds. I’ll have mplayer give me the frames as JPEG images.

mplayer -vo jpeg -ss 3 -endpos 23 -benchmark calvin-dummy.webm

This tells mplayer to output JPEG frames between 3 and 23 seconds, doing it as fast as it can (-benchmark). This output almost 800 images. Next I look through the frames and delete the extra images at the beginning and end that I don’t want to keep. I’m also going to throw away the even numbered frames, since GIFs can’t have such a high framerate in practice.

rm *[0,2,4,6,8].jpg

There’s also dead space around the cats in the image that I want to crop. Looking at one of the frames in GIMP, I’ve determined this is a 450 by 340 box, with the top-left corner at (136, 70). We’ll need this information for ImageMagick.

Gifsicle only knows how to work with GIFs, so we need to batch convert these frames with ImageMagick’s convert. This is where we need the crop dimensions from above, which is given in ImageMagick’s notation.

ls *.jpg | xargs -I{} -P4 \
    convert {} -crop 450x340+136+70 +repage -resize 300 {}.gif

This will do four images at a time in parallel. The +repage is necessary because ImageMagick keeps track of the original image “canvas”, and it will simply drop the section of the image we don’t want rather than completely crop it away. The repage forces it to resize the canvas as well. I’m also scaling it down slightly to save on the final file size.

We have our GIF frames, so we’re almost there! Next, we ask Gifsicle to compile an animated GIF.

gifsicle --loop --delay 5 --dither --colors 32 -O2 *.gif > ../out.gif

I’ve found that using 32 colors and dithering the image gives very nice results at a reasonable file size. Dithering adds noise to the image to remove the banding that occurs with small color palettes. I’ve also instructed it to optimize the GIF as fully as it can (-O2). If you’re just experimenting and want Gifsicle to go faster, turning off dithering goes a long way, followed by disabling optimization.

The delay of 5 gives us the 15-ish frames-per-second we want — since we cut half the frames from a 30 frames-per-second source video. We also want to loop indefinitely.

The result is this 6.7 MB GIF. A little large, but good enough. It’s basically what I was going for. Next we add some macro text.

In GIMP, make a new image with the same dimensions of the GIF frames, with a transparent background.

Add your macro text in white, in the Impact Condensed font.

Right click the text layer and select “Alpha to Selection,” then under Select, grow the selection by a few pixels — 3 in this case.

Select the background layer and fill the selection with black, giving a black border to the text.

Save this image as text.png, for our text overlay.

Time to go back and redo the frames, overlaying the text this time. This is called compositing and ImageMagick can do it without breaking a sweat. To composite two images is simple.

convert base.png top.png -composite out.png

List the image to go on top, then use the -composite flag, and it’s placed over top of the base image. In my case, I actually don’t want the text to appear until Calvin, the orange cat, faces the camera. This happens quite conveniently at just about frame 500, so I’m only going to redo those frames.

ls 000005*.jpg | xargs -I{} -P4 \
    convert {} -crop 450x340+136+70 +repage \
               -resize 300 text.png -composite {}.gif

Run Gifsicle again and this 6.2 MB image is the result. The text overlay compresses better, so it’s a tiny bit smaller.

Now it’s time to post it on reddit and reap that tasty, tasty karma. (Over 400,000 views!)

Rumor Simulation

2012-03-09T00:00:00Z

A couple months ago someone posted an interesting programming homework problem on reddit, asking for help. Help had already been provided before I got there, but I thought the problem was an interesting one.

Write a program that simulates the spreading of a rumor among a group of people. At any given time, each person in the group is in one of three categories:

IGNORANT - the person has not yet heard the rumor

SPREADER - the person has heard the rumor and is eager to spread it

STIFLER - the person has heard the rumor but considers it old news and will not spread it

At the very beginning, there is one spreader; everyone else is ignorant. Then people begin to encounter each other.

So the encounters go like this:

If a SPREADER and an IGNORANT meet, IGNORANT becomes a SPREADER.

If a SPREADER and a STIFLER meet, the SPREADER becomes a STIFLER.

If a SPREADER and a SPREADER meet, they both become STIFLERS.

In all other encounters nothing changes.

Your program should simulate this by repeatedly selecting two people randomly and having them “meet.”

There are three questions we want to answer:

Will everyone eventually hear the rumor, or will it die out before everyone hears it?

If it does die out, what percentage of the population hears it?

How long does it take? i.e. How many encounters occur before the rumor dies out?

I wrote a very thorough version to produce videos of the simulation in action.

https://github.com/skeeto/rumor-sim

It accepts some command line arguments, so you don’t need to edit any code just to try out some simple things.

And here are a couple of videos. Each individual is a cell in a 2D grid. IGNORANT is black, SPREADER is red, and STIFLER is white. Note that this is not a cellular automata, because cell neighborship does not come into play.

Here’s are the statistics for ten different rumors.

Rumor(n=10000, meetups=132380, knowing=0.789)
Rumor(n=10000, meetups=123944, knowing=0.7911)
Rumor(n=10000, meetups=117459, knowing=0.7985)
Rumor(n=10000, meetups=127063, knowing=0.79)
Rumor(n=10000, meetups=124116, knowing=0.8025)
Rumor(n=10000, meetups=115903, knowing=0.7952)
Rumor(n=10000, meetups=137222, knowing=0.7927)
Rumor(n=10000, meetups=134354, knowing=0.797)
Rumor(n=10000, meetups=113887, knowing=0.8025)
Rumor(n=10000, meetups=139534, knowing=0.7938)

Except for very small populations, the simulation always terminates very close to 80% rumor coverage. I don’t understand (yet) why this is, but I find it very interesting.

Common Lisp Quick Reference

2010-02-06T00:00:00Z

I found this Common Lisp Quick Reference the other day from r/lisp, and I think it's fantastic. It's a comprehensive, libre booklet of the symbols defined by the Common Lisp ANSI standard. Very slick!

The main version is meant to be printed out and nested with a vertical fold, and it works quite well. If I ever get a chance to use Common Lisp at work (a man can dream), probably at a location without Internet access, this could come in handy. So I printed out one for myself,