null program

The Elfeed Database

The design of Elfeed's database took some experimentation before any part of it was settled. A major design constraint was Emacs' very limited file input/output. There's no random access and, without the aid of an external program, files must always be read and written wholesale. That's not database-friendly at all! In the end I settled on a design that minimized the size of the frequently rewritten parts, an index with two different data models, by storing immutable data in a loose-file, content-addressable database.

At the moment there really aren't any pure-Elisp database solutions for Emacs. This is almost certainly due to the aforementioned I/O limitations. I ran into this same problem last year when I created an Emacs pastebin server. I attempted, and failed, to interface with a SQLite database through it's command line program. Nic Ferrier has published a generic database interface, but it lacks concrete implementations.

As a bit of good news, as far as I know Emacs does properly handle atomic file updates across all platforms, so a pure-Elisp database developer would never have to worry about only writing half the database. It's always a safe operation. Worst case scenario you're left with an old version of data rather than no data at all.

A real possibility for a database would be connecting to an established database server via TCP with an Emacs network process. If the server has a specified wire protocol Elisp could talk to it efficiently. In fact, there's exists pg.el that does exactly this for PostgreSQL. Unfortunately I was not able to get this working with my pastebin, nor is this solution appropriate for Elfeed. It would be unreasonable to require users to first set up a PostgreSQL server just to read web feeds!

Ultimately it would seem that any efficient Emacs database requires the help of an external program. The notmuch mail client, which inspired Elfeed, does this. To access the notmuch database a command line program is run once for each request. A query is passed as a program argument and the output of the program is parsed into the result.

The Early Database

For the first few days of its existence Elfeed only had an in-memory database. Closing Emacs would lose everything. For my personal usage patterns, where I read, or at least address, all entries that arrive -- and especially because I use Elfeed on a couple of different computers -- I don't really need to track things long term. I could easily mark everything after a certain date as read and forget about them. However, it would be nice to have and, more importantly, many people wouldn't use Elfeed without persistence between Emacs sessions.

So, for the first database I did what I always do: dumped the data structure to a file using the printer and parsed it back in later using the reader. This is dead simple in Lisp, it's very fast, and it even works for circular data structures. It's something I missed so much with the much-less-capable JSON format earlier this year that I wrote a JavaScript library to do it.

(defun save-data (file data)
  (with-temp-file file
    (let ((standard-output (current-buffer))
          (print-circle t))  ; Allow circular data
      (prin1 data))))

(defun load-data (file)
  (with-temp-buffer
    (insert-file-contents file)
    (read (current-buffer))))

(save-data "demo.dat" '(a b c ["1" 2 3]))
(load-data "demo.dat")
;; => (a b c ["1" 2 3])

Anything with a printed representation can be serialized and stored this way, including symbols, string, numbers, lists, vectors (structs, objects), hash tables, and even compiled functions (.elc files). Basically every Emacs library that stores data on disk uses this technique.

Unfortunately, this is where I hit another serious database constraint: print-circle is broken in Emacs 24.3, the current stable release. This means Elfeed cannot take advantage of this useful feature, at least not for a long time, as I had been counting on. The final database is slightly slower and larger than strictly required as a result.

The Content Database

After breaking the circular references of the in-memory database I finally had persistence for the first time. With the naive printer/reader approach it was slow, almost 1 second to write just a few thousand entries on my 6-year-old laptop (my minimum requirements target machine). I wanted Elfeed to support hundreds of thousands of entries, if not millions, so this was much too slow.

The big slowdown was writing out all the entry content each time the database is saved. These large strings containing HTML that rarely change. There's no reason to write these out every time, nor is there a reason to even keep them in memory all the time, as it's rarely accessed. The solution is a loose-file, content-addressable database, very similar to an unpacked Git object database.

The content database stores immutable sequences of characters -- not just raw bytes, but rather multibyte strings -- using an unspecified coding system (right now it's UTF-8 for all platforms). The filename for the content is the content hashed with SHA-1 ("content-addressable"). To limit the number of files per directory, these files are stored in subdirectories named by the first hex-encoded byte of the hash (just like Git). A database of 4 items might look like this:

data/
   18/
      18ff6f11945b1e9f3e3c4cae8b5275d36b9944e1
      184c06a83f0bc73a8345c6d886f9043bcae095f8
   6b/
      6b59ae257f2bea24703d8adf5747049c138dfc82
   cc/
      cc47d53872ae2a9186151ef1a68392a94e1f091f

Something really neat about the content database is that it's completely agnostic about Elfeed. If it weren't for Elfeed's garbage collector, anyone could use it to store arbitrary content. The function elfeed-ref accepts a string and returns a reference into the database. Because of the hash, providing the same string in the future will return the same reference without actually performing a write. References are dereferenced with elfeed-deref.

(setf ref (elfeed-ref "Hello, world!"))
;; => [cl-struct-elfeed-ref "943a702d06f34599aee1f8da8ef9f7296031d699"]

(elfeed-deref ref)
;; => "Hello, world"

With content stored elsewhere, entries are a struct containing only some small metadata: title, link, date, and a content database reference. Writing out many of them at once is much, much faster.

I don't expect it happens often, but this also means content is de-duplicated. If two entries happen to have the same content they'll share content database storage. A small savings.

At this point it's really tempting to get fancier and really put this content database to use. The core index itself could be stored as raw content, and the root to accessing the database would be a single SHA-1 hash referencing it -- again, very similar to Git. If an index stores a reference to the previously written index, then the the Elfeed database would be an immutable structure tracking its entire history. Such a change would cost virtually nothing in performance, just disk space.

Multiple Representations

With all the content out of the way, the database is now just a lean index. At this point it's a hash table mapping feed IDs to feeds. Feeds contain a list of its entries. To build the entry listing for the elfeed-search buffer, Elfeed needs to visit each feed in the hash table, gather its entries into one giant list, then finally sort that list by date. At around O(n log n), that sort operation is a real performance killer. Completely unacceptable. To fix this we need to think about how the data is updated and used.

First, entries are always viewed in date order, no exceptions. From my experience of using web feeds for the last six years I never had a reason to list feed entries by any other order. The vast majority of the time, newer entries are most relevant, and if I need to look for something specific I can search for it.

We definitely want to store entries in date-order so we can create entry listings without performing a sort: something around O(n) or so. Inserting new entries into this structure should also be efficient.

Second, entries are never removed from the database. This isn't e-mail. Even if a user doesn't want to see an entry again, we have to keep track of it. Otherwise it will show up as new if it's discovered in a feed again, which is likely. Things are added to the database and never removed. In Elfeed, I use a junk tag to completely hide entries I don't want to see, and I always have a -junk element in my filter.

There's an important caveat to this one that I had missed until after the public release: entry dates can change! When a previously discovered entry is read from a feed, Elfeed updates (read: mutates) the entry struct to reflect the new state. This includes the date. It's very likely that a date-sorted representation won't tolerate date changes underneath it since it's keying off of them. Either we refuse to update the entry date, or we remove the entry, update the date, and then re-insert it (how it currently works).

Third, entries are generally added with a recent date. After the database is initially populated, it's only picking up new items. We should prefer adding recently-dated entries be faster than adding older entries. I didn't get a chance to take advantage of this, but it's something to keep in mind.

Fourth, entries need to be keyed by an ID string. Each entry has a unique, unchanging identifier string, either provided by the feed itself (RSS's guid or Atom's id) or generated intelligently by Elfeed. Especially because of the print-circle bug, we need to be able to talk about feeds in terms of their ID -- an indirect pointer.

(Actually, even when RSS guid tags are present, they're permalinks by default. So, unfortunately, RSS IDs are not at all resistant to collisions across feeds. To work around this, entry identifiers are a pair of strings: feed ID and entry ID. Atom doesn't have this problem, but we're stuck with the lowest common denominator.)

A date-oriented representation would be unable to efficiently look up an entry by its ID, so it needs to be supplemented by an ID-oriented representation. This means we need two representations in our database: date-oriented and ID-oriented.

So what do we use? Well, for keeping entries sorted by date we want some sort of balanced tree. A B-tree is probably a good choice. Rather than write one I went with an AVL tree since Emacs comes with a library for it (avl-tree). It's already debugged and optimized! The bad news is that the internal structure is unspecified, so there are no guarantees that it can be serialized. A future update to the library may break the Elfeed database. I also had to hack into it to work around a security issue. The comparison function is embedded in the tree. After deserializing the database, Elfeed needs to ensure that no one stuck a malicious function in there.

The choice for an ID database was super-easy: a hash table. Due to the print-circle bug, this is actually the main representation. The AVL tree only stores IDs and it has to reach into the hash table to do any date comparisons. If print-circle was working I could store the same exact entry objects in the AVL tree as the hash table, so mutating them would update them in all representations. However, with print-circle off, on deserialization these would become unique objects and updates would break.

The Future

That's where the database is today. I put in a few extra fields that aren't actually used yet, so that there's room to make a few changes without breaking the database. Perhaps someday I'll work out a whole new database structure, or maybe a proper database library will come into existence, and this post will simply document the old database.

tags: [ emacs elfeed ]
blog comments powered by Disqus
Fork me on GitHub