Atom vs. RSS
From working on Elfeed, I’ve recently become fairly intimate with the Atom and RSS specifications. I needed to write a parser for each that would properly handle valid feeds but would also reasonably handle all sorts of broken feeds that it would come across. At this point I’m quite confident in saying that Atom is by far the better specification and I really wish RSS didn’t exist. This isn’t surprising: Atom was created specifically in response to RSS’s flawed and ambiguous specification.
One consequence of this realization is that I’ve added an Atom feed to this blog and made it the the primary feed. Because so many people are still using the RSS feed, it will continue to be supported even though there are no longer links to it (Ha, try to find it now!). You may have noticed that I also started including the full post body in my feed entries. Now that my feed usage habits have changed, I felt that truncating content was actually rather rude. There’s still the issue that it contains relative URLs, but I’m not aware of any way to fix this with Jekyll. I also got a lot more precise with dates. Until recently, all posts occurred at midnight PST on the post date.
For reference, here are the specifications. Just these two documents cover about 99% of the web feeds out there.
Not that it matters too much, but it’s unfortunate that RSS has sort of “won” this format war. Of the feeds that I follow, about 75% are RSS and 25% are Atom. That’s still a significant number of web feeds and Atom is well-supported by all the clients that I’m aware of, so it’s in no danger of falling out of use. The broken (but still valid) RSS feeds I’m come across probably wouldn’t be broken if they were originally created as Atom feeds. Atom is a stricter standard and, therefore, would have guided these authors to create their feeds correctly from the start. RSS encourages authors to do the wrong thing.
The Flaws of RSS
For reference, here’s a typical, friendly RSS 2.0 feed.
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"> <channel> <title>Example RSS Feed</title> <item> <title>Example Item</title> <description>A summary.</description> <link>http://www.example.com/foo</link> <guid>http://www.example.com/foo</guid> <pubDate>Mon, 23 Sep 2013 03:00:05 GMT</pubDate> </item> </channel> </rss>
guid, the misnomer
Two of the biggest RSS flaws — flaws that forced me to make a major
design compromise when writing Elfeed — have to do with the
tag. That’s GUID, as in Global Unique Identifier. Not only did it not
appear until RSS 2.0, but the guid tag is not required. In
practice an RSS client will be rereading the same feed items over and
over, so it’s critical that it’s able to identify what items it’s seen
Without a guid tag it’s up to the client to guess what items have been
seen already, and there’s no guidance in the specification for doing
so. Without a guid tag, some clients use contents of the
link tag as
an identifier (Elfeed, The Old Reader). In practice it’s very unlikely
for two unique items to have the same link. Other clients track the
entire contents of the item, so when any part changes, such as the
description, it’s treated as a brand new item (Liferea). Some
guid-less feeds regularly change their
etc.), so they’re not handled well by the latter clients. It’s a mess.
In contrast, Atom’s
id element is required. If someone doesn’t have
one you can send them angry e-mails for having an invalid feed.
The bigger flaw of the guid tag is that, by default, guid tag
content is not actually a GUID! This was a huge oversight by the
specification’s authors. By default, the content of the guid tag
must be a permanent URL. Only if the
isPermalink attribute is set
to false can it actually be a GUID (but even that’s unlikely). If two
different feeds contain items that link to content with the same
permalink then that “GUID” is obviously no longer unique. Two unique
items have the same “unique” ID. Doh! Even if the guid tag was
required, I still couldn’t rely on it in Elfeed.
In contrast, Atom’s
id element must contain an Internationalized
Resource Identifier (IRI). This is guaranteed to be unique.
Unlike Atom, RSS feeds themselves also don’t have identifiers. Due to RSS guids never actually being GUIDs, in order to uniquely identify feed entries in Elfeed I have to use a tuple of the feed URL and whatever identifier I can gather from the entry itself. It’s a lot messier than it should be.
In a purely Atom world, the GUID alone would be enough to identify an entry and the feed URL wouldn’t matter for identification: I wouldn’t care where the feed came from, just what it’s called. If the same feed was hosted at two different URLs, a user could list both, the second appearance acting as a backup mirror, and Elfeed would merge them effortlessly.
pubDate, the incorrectly specified
RSS didn’t have any sort of date tag until version 2.0! A standard specifically oriented around syndication sure took a long time to have date information. Before 2.0 the workaround was to pull in a date tag from another XML namespace, such as Dublin Core.
In contrast, Atom has always had
updated tags for
communicating date information.
Finally, in RSS 2.0, dates arrived in the form of the
For some reason the name “date” wasn’t good enough so they went with
this ugly camel-case name. Despite all the extra time, they still
screwed this part up. The specification says that dates must conform
to the outdated RFC 822, then provides examples that
aren’t RFC 822 dates! Doh! This is because RFC 822 only allows for
2-digit years, so no one should be using it anymore. The RSS authors
unwittingly created yet another date specification — a mash-up
between these two RFCs. In practice everyone just pretends RSS uses
RFC 2822, which superseded RFC 822.
In contrast, Atom consistently uses RFC 3339 dates, along with a couple of additional restrictions. These dates are much simpler to parse than RFC 2822, which is complex because it attempts to be backwards compatible with RFC 822.
RSS 1.0, the problem child
RSS changed a lot between versions. There was the 0.9x series, several of which were withdrawn. Later on there was version 1.0 (2000) and 2.0 (2002). The big problem here is that RSS 1.0 has very little in common with 0.9x and 2.0. It’s practically a whole different format. In order to officially support RSS, a client has to be able to parse all of these different formats. In fact, in Elfeed I have an entirely separate parser for RSS 1.0.
What’s so weird about RSS 1.0? If you thought the name “pubDate” was
ugly you might want to skip this part. In practice it’s namespace
hell. For example, look at this Gmane RSS 1.0 feed. Unlike the
other RSS versions, the top level element is
rdf:RDF. That’s not a
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns"> <channel> <title>RSS 1.0 Example</title> <items> <rdf:Seq> <rdf:li rdf:resource="http://example.com/foo"/> </rdf:Seq> </items> </channel> <item> <title>Example Item</title> <description>A summary.</description> <link>http://www.example.com/foo</link> </item> </rdf:RDF>
Remember, if you want dates you’ll need to import another namespace.
Notice the completely redundant
items tag. It’s not like you’re
going to download a partial feed and use the
items tag to avoid
grabbing full content. It’s just noise.
Even more important: notice that the items are outside the
channel tag! Why would they completely restructure everything in
1.0? It’s madness. Fortunately everything here was dumped in RSS 2.0
and, except for a very small number of feeds, it’s almost just a bad
channel, the vestigial tag
Notice in the example RSS feed it goes
channel tag suggests a single feed can have a number of
different channels. Nope! Only one channel is allowed, meaning the
channel tag serves absolutely no purpose. It’s just more noise. Why
was this ever added?
The good news is that RSS has a
category tag which serves this
purpose much better anyway. Tagging is preferable to hierarchies —
e.g. an item could only belong to one channel but it could belong to
Atom is a much cleaner specification, with much clearer intent, and without all the mistakes and ambiguities. It’s also more general, designed for the syndication of many types and shapes of content. This is what made it popular for use with podcasts. Everything I listed above I discovered myself while writing Elfeed. There are surely many other problems with RSS I haven’t noticed yet.
If I only had to support Atom, things would have been significantly simpler. At the moment I have no complaints about Atom. It’s given me no trouble.
Someday if you’re going to create a new feed for some content, please do the web a favor and choose Atom! You’re much more likely to get things right the first time and you’ll make someone else’s job a lot easier. As the author of a web feed client you can take my word for it.