Disqus comments

I used to think that if I did my own thing to handle comment spam then I’d be low-hanging fruit and wouldn’t have a problem. And this worked for quite a long time. But a couple of weeks ago it stopped working, and last week I turned off comments on jerakeen.org just so I didn’t have to delete 30 spam comments every day. And I’m far too lazy to do anything properly to solve this.

Fortunately, Disqus have appeared recently, and they’re great. I can just embed someone else’s commenting framework and let them deal with the problem. The obvious downside is that the comments aren’t ‘really’ on my page, so I won’t get any google juice from them. But on the other hand, the comments aren’t really on my page, so noone else will get any google juice from them. Maybe this’ll make them less appealing as a spam target in the first place.

The other downside is that there’s no import capability for my old comments. I’ve settled for just displaying the old comments in-place, with Disqus comments under them. I’ve also take the opportunity to make it slightly clearer when I’m syndicating comments from flickr rather than allowing them on the local site. I haven’t managed to combine the count of old and new comments yet, but I’m sure I’ll get it soon. Until then, pages with old-style comments will just have 2 figures for comment count. You’ll live.

Sanitising comments with Python

As is my wont, I’m in the middle of porting jerakeen.org to another back-end. This time, I’m porting it back to the Django-based Python version (it’s been written in rails for a few months now). It’s grown a few more features, and one of them is somewhat smarter comment parsing.

This being a vaguely technical blog, I have vaguely technical people leaving comments. And most of them want to be able to use HTML. I’ve seen blogs that allow markdown in comments, but I hate that – unless you’re know you’re writing it, it’s too easy for markdown to do things like eat random underscores and italicise the rest of the sentence by accident. But at the same time, I need to let people who just want to type text leave comments.

The trick then is to turn plain text into HTML, but also allow some HTML through. Because the world is a nasty place, this means whitelisting based on tags and attributes, rather than removing known-to-be-nasty things. Glossing over the ‘turn plain text into HTML‘ part, because it’s easy, here’s how I use BeautifulSoup to sanitise HTML comments, permitting only a subset of allowed tags and attributes:

# Assume some evil HTML is in 'evil_html'

# allow these tags. Other tags are removed, but their child elements remain
whitelist = ['blockquote', 'em', 'i', 'img', 'strong', 'u', 'a', 'b', "p", "br", "code", "pre" ]

# allow only these attributes on these tags. No other tags are allowed any attributes.
attr_whitelist = { 'a':['href','title','hreflang'], 'img':['src', 'width', 'height', 'alt', 'title'] }

# remove these tags, complete with contents.
blacklist = [ 'script', 'style' ]

attributes_with_urls = [ 'href', 'src' ]

# BeautifulSoup is catching out-of-order and unclosed tags, so markup
# can't leak out of comments and break the rest of the page.
soup = BeautifulSoup(evil_html)

# now strip HTML we don't like.
for tag in soup.findAll():
    if tag.name.lower() in blacklist:
        # blacklisted tags are removed in their entirety
        tag.extract()
    elif tag.name.lower() in whitelist:
        # tag is allowed. Make sure all the attributes are allowed.
        for attr in tag.attrs:
            # allowed attributes are whitelisted per-tag
            if tag.name.lower() in attr_whitelist and attr[0].lower() in attr_whitelist[ tag.name.lower() ]:
                # some attributes contain urls..
                if attr[0].lower() in attributes_with_urls:
                    # ..make sure they're nice urls
                    if not re.match(r'(https?|ftp)://', attr[1].lower()):
                        tag.attrs.remove( attr )

                # ok, then
                pass
            else:
                # not a whitelisted attribute. Remove it.
                tag.attrs.remove( attr )
    else:
        # not a whitelisted tag. I'd like to remove it from the tree
        # and replace it with its children. But that's hard. It's much
        # easier to just replace it with an empty span tag.
        tag.name = "span"
        tag.attrs = []

# stringify back again
safe_html = unicode(soup)

# HTML comments can contain executable scripts, depending on the browser, so we'll
# be paranoid and just get rid of all of them
# e.g. <!--[if lt IE 7]><script type="text/javascript">h4x0r();</script><![endif]-->
# TODO - I rather suspect that this is the weakest part of the operation..
safe_html = re.sub(r'<!--[.n]*?-->','',safe_html)

It’s based on an Hpricot HTML sanitizer that I’ve used in a few things.

Update 2008-05-23: My thanks to Paul Hammond and Mark Fowler, who pointed me at all manner of nasty things (such as javascript: urls ) that I didn’t handle very well. I now also whitelist allowed URIs. I should also point out the test suite I use – all code needs tests!

On blog comment spam

Blog comment spam, the scourge of the internet. Having written yet another CMS to power jerakeen.org, I wanted comments on pages again. Django rocks hard – adding commenting was easy. And a day later, I have comment spam. Bugger.

From a purely abstract point of view, I find this interesting. There must be a spider looking for forms that look like things that can take comments. And the robots must be reasonably flexible – it’s not like my CMS is an off-the-shelf. But from a more concrete, ‘spam bad’ point of view, it’s bloody annoying.

So begins my personal battle against spam. Others have fought this battle, but of course the downside of rolling your own site is that you can’t use anything off the shelf. My plan was to forget trying to recognise and filter spam, and preferably I don’t want to have to moderate anything – I don’t want the spam to be submitted at all. And this really can’t be that hard. Unless there’s a human surfing for blogs and typing in the spam themselves, this should really just be a measure of my ability to write a Turing test. Right?

My first plan was to require a form value in the comment submission, but to not include that field in the form itself – instead, I added it with client-side JavaScript. This should stop simplistic robots, at the cost of requiring JS to be turned on in the client, which is something I’m willing to live with, frankly. Alas, it didn’t work. Clearly too simple – either there’s a human typing spam into the box, or the robot doing the work is using something like Mozilla::Mechanize that’ll do the JavaScript. Or maybe they just handle some obvious cases. After all, my ‘clever’ code was merely document.write("&lt;input name=......

Or perhaps they figure it out once, and use a replay attack to hit every page? Not really a good assumption with hindsight, but never mind. I added prefixes to the form fields that were generated from the current time, and checked at submit time that the fields weren’t more than an hour old. This saves me from having to store state anywhere, and gains me forms that exipre after a while, unless you reverse-engineer the timestamp format. But I’m premising the existence of some automated tool, perhaps with a little human interaction. I don’t need to be perfect, I merely need to be not as bad as everyone else… But no, this failed too.

Ok, so the JavaScript is too obvious. I split it up into sections, and also write the wrong value into the form, and change it to the right one using a regular expression later (BWHAHAHAH). At the same time (and I suspect this is the important bit) I changed the names of the fields completely. Calling them ‘name’, ‘email’ and ‘comment’ is a bit of a giveaway, really. ‘foo’, ‘bar’ and ‘baz’ they are, then. Now it should be practically impossible for an automated tool to even figure out that I accept comments. Sure, you could probably think ‘hmm, two small input fields, and a textarea, on a page that has an RSS feed’, but I’m assuming that, for 90% of the blogs out there, this isn’t needed, so no-one does it.

And yes, I’ve received no blog spam comments since I did this. On the other hand, I’ve also received no normal comments either. Hope I haven’t raised the barrier too high. If the situation stays good, I may remove the client-side JavaScript requirement. Or figure out a noscript fall-back solution for people using lynx. Poor souls..