nullprogram.com/blog/2009/05/25/
Many comment/discussion systems get previews wrong. This even includes
major sites like Boing Boing and Slashdot. Sometimes they feed back a
different comment in the textarea, so repeated previews slowly degrade
the comment. Other times the comment preview isn't the same thing as
the final result. A comment actually has four states,
The raw comment is the unfiltered string of bytes from the
user. This is not safe to give directly back to the user, as it could
be exploited to feed an arbitrary page to an innocent user.
The escaped comment is created from the raw comment by
filtering it through the escapeHTML()
function. This
function creates HTML entities out of some of the characters, like
< and >. A browser will interpret the escaped comment as a
simple string, and is safe to give back to the user. This function is
actually provided by perl's CGI module, so perl programmers need not
implement this.
Note that escapeHTML()
is reversible, though the server
side won't need to reverse it. The browser does.
The stripped comment is created from the raw comment by
filtering it through stripHTML()
, which removes
non-whitelisted HTML tags. It also strips non-whitelisted attributes
from allowed tags. It should probably add a
rel="nofollow"
to links. It also runs escapeHTML()
on attribute values and content outside tags. This is safe to give
back to the user because only safe tags are left.
If your comments use markup other than HTML, like BBCode, this
function should strip all HTML (your whitelist is empty) and do
the conversion from your markup to HTML.
It might also be a good idea for it to produce well-formed HTML. This
will allow your comments/discussion pages to be XHTML compliant.
stripHTML()
is irreversible because it dumps information.
The stored comment is the encoding of the comment in the
system. This depends entirely on the storage system. In some cases it
may be identical to the stripped comment (and store
is
the identity function). If the comment is going through SQL into
database, some characters may need to be escaped as to not cause
problems. It could even be a base 64 encoding.
store()
must be unambiguously reversible, and the server
should have an unstore()
to do this. It should probably
also be able to convert any arbitrary string of characters into a safe
encoding for storage.
There should only be one version of all these functions for both
previews and final posting of comments.
When doing a comment preview both the escaped comment and the stripped
comment are given back to the user. The stripped comment is dropped in
as HTML, and the escaped comment is put into the textarea of the
form. It would probably be convenient for the user if you give them
back any other form information, including the same captcha and their
answer to it (or not charge them with a captcha for that comment
anymore).
You may be tempted to store the raw comments (safely with
store()
) and do HTML stripping on the fly. This would
allow you to upgrade your HTML stripping function in the future to
"better" handle user input. I don't recommend it. That's extra
processing for each page request, but worse, it breaks the concept of
the preview, because the comment formatting is subject to change in
the future.
The hardest function to implement is probably stripHTML()
because it needs to be able to handle poorly formed HTML. If you are
using perl, you will probably want to use the HTML::Parser module,
which is what I did. This does everything noted above and also
auto-links anything that looks like a URL, forces proper comment
nesting, automatically makes paragraphs from blank-line-separated
chunks, and almost produces well-formed HTML.
htmlclean.pm
The documentation is basically non-existent, but if you want to
whitelist more tags add them to @allowed_tags
. Use it,
abuse it.
I use this code in my comment system, so you can play around with it
by using my preview function.