As is my wont, I’m in the middle of porting jerakeen.org to another back-end. This time, I’m porting it back to the Django-based Python version (it’s been written in rails for a few months now). It’s grown a few more features, and one of them is somewhat smarter comment parsing.
This being a vaguely technical blog, I have vaguely technical people leaving comments. And most of them want to be able to use HTML. I’ve seen blogs that allow markdown in comments, but I hate that - unless you’re know you’re writing it, it’s too easy for markdown to do things like eat random underscores and italicise the rest of the sentence by accident. But at the same time, I need to let people who just want to type text leave comments.
The trick then is to turn plain text into HTML, but also allow some HTML through. Because the world is a nasty place, this means whitelisting based on tags and attributes, rather than removing known-to-be-nasty things. Glossing over the ‘turn plain text into HTML‘ part, because it’s easy, here’s how I use BeautifulSoup to sanitise HTML comments, permitting only a subset of allowed tags and attributes:
It’s based on an Hpricot HTML sanitizer that I’ve used in a few things.