On blog comment spam

28 Apr 2006

Blog comment spam, the scourge of the internet. Having written yet another CMS to power jerakeen.org, I wanted comments on pages again. Django rocks hard - adding commenting was easy. And a day later, I have comment spam. Bugger.

From a purely abstract point of view, I find this interesting. There must be a spider looking for forms that look like things that can take comments. And the robots must be reasonably flexible - it's not like my CMS is an off-the-shelf. But from a more concrete, 'spam bad' point of view, it's bloody annoying.

So begins my personal battle against spam. Others have fought this battle, but of course the downside of rolling your own site is that you can't use anything off the shelf. My plan was to forget trying to recognise and filter spam, and preferably I don't want to have to moderate anything - I don't want the spam to be submitted at all. And this really can't be that hard. Unless there's a human surfing for blogs and typing in the spam themselves, this should really just be a measure of my ability to write a Turing test. Right?

My first plan was to require a form value in the comment submission, but to not include that field in the form itself - instead, I added it with client-side JavaScript. This should stop simplistic robots, at the cost of requiring JS to be turned on in the client, which is something I'm willing to live with, frankly. Alas, it didn't work. Clearly too simple - either there's a human typing spam into the box, or the robot doing the work is using something like Mozilla::Mechanize that'll do the JavaScript. Or maybe they just handle some obvious cases. After all, my 'clever' code was merely document.write("<input name=......

Or perhaps they figure it out once, and use a replay attack to hit every page? Not really a good assumption with hindsight, but never mind. I added prefixes to the form fields that were generated from the current time, and checked at submit time that the fields weren't more than an hour old. This saves me from having to store state anywhere, and gains me forms that exipre after a while, unless you reverse-engineer the timestamp format. But I'm premising the existence of some automated tool, perhaps with a little human interaction. I don't need to be perfect, I merely need to be not as bad as everyone else... But no, this failed too.

Ok, so the JavaScript is too obvious. I split it up into sections, and also write the wrong value into the form, and change it to the right one using a regular expression later (BWHAHAHAH). At the same time (and I suspect this is the important bit) I changed the names of the fields completely. Calling them 'name', 'email' and 'comment' is a bit of a giveaway, really. 'foo', 'bar' and 'baz' they are, then. Now it should be practically impossible for an automated tool to even figure out that I accept comments. Sure, you could probably think 'hmm, two small input fields, and a textarea, on a page that has an RSS feed', but I'm assuming that, for 90% of the blogs out there, this isn't needed, so no-one does it.

And yes, I've received no blog spam comments since I did this. On the other hand, I've also received no normal comments either. Hope I haven't raised the barrier too high. If the situation stays good, I may remove the client-side JavaScript requirement. Or figure out a noscript fall-back solution for people using lynx. Poor souls..