Doing Comment Previews the Right Way

Many comment/discussion systems get previews wrong. This even includes major sites like Boing Boing and Slashdot. Sometimes they feed back a different comment in the textarea, so repeated previews slowly degrade the comment. Other times the comment preview isn't the same thing as the final result. A comment actually has four states,

The raw comment is the unfiltered string of bytes from the user. This is not safe to give directly back to the user, as it could be exploited to feed an arbitrary page to an innocent user.

The escaped comment is created from the raw comment by filtering it through the escapeHTML() function. This function creates HTML entities out of some of the characters, like < and >. A browser will interpret the escaped comment as a simple string, and is safe to give back to the user. This function is actually provided by perl's CGI module, so perl programmers need not implement this.

Note that escapeHTML() is reversible, though the server side won't need to reverse it. The browser does.

The stripped comment is created from the raw comment by filtering it through stripHTML(), which removes non-whitelisted HTML tags. It also strips non-whitelisted attributes from allowed tags. It should probably add a rel="nofollow" to links. It also runs escapeHTML() on attribute values and content outside tags. This is safe to give back to the user because only safe tags are left.

If your comments use markup other than HTML, like BBCode, this function should strip all HTML (your whitelist is empty) and do the conversion from your markup to HTML.

It might also be a good idea for it to produce well-formed HTML. This will allow your comments/discussion pages to be XHTML compliant.

stripHTML() is irreversible because it dumps information.

The stored comment is the encoding of the comment in the system. This depends entirely on the storage system. In some cases it may be identical to the stripped comment (and store is the identity function). If the comment is going through SQL into database, some characters may need to be escaped as to not cause problems. It could even be a base 64 encoding.

store() must be unambiguously reversible, and the server should have an unstore() to do this. It should probably also be able to convert any arbitrary string of characters into a safe encoding for storage.

There should only be one version of all these functions for both previews and final posting of comments.

When doing a comment preview both the escaped comment and the stripped comment are given back to the user. The stripped comment is dropped in as HTML, and the escaped comment is put into the textarea of the form. It would probably be convenient for the user if you give them back any other form information, including the same captcha and their answer to it (or not charge them with a captcha for that comment anymore).

You may be tempted to store the raw comments (safely with store()) and do HTML stripping on the fly. This would allow you to upgrade your HTML stripping function in the future to "better" handle user input. I don't recommend it. That's extra processing for each page request, but worse, it breaks the concept of the preview, because the comment formatting is subject to change in the future.

The hardest function to implement is probably stripHTML() because it needs to be able to handle poorly formed HTML. If you are using perl, you will probably want to use the HTML::Parser module, which is what I did. This does everything noted above and also auto-links anything that looks like a URL, forces proper comment nesting, automatically makes paragraphs from blank-line-separated chunks, and almost produces well-formed HTML.

htmlclean.pm

The documentation is basically non-existent, but if you want to whitelist more tags add them to @allowed_tags. Use it, abuse it.

I use this code in my comment system, so you can play around with it by using my preview function.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

This post has archived comments.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)