Tom Insam

I tend to reimplement the CMS that drives jerakeen.org more often than I add content to it, but the current Django based incarnation seems to have decent sticking power. A lot of this is Django’s magic admin interface middleware. When I add, say, a tagging engine to the site, I only need to worry about the object model and presenting it on the site itself. All the boring and much harder to write admin pages to add and remove tags just write themselves. But the other reason I’m staying with it is that I’ve now added so many features to it (because it’s easy!) that a re-write in another language would be a huge amount of effort.

This weekend, for instance, I’ve added an implementation of the metaweblog API to the site, using the excellent code on allyourpixel as a base. The main source of pain is the persistent weirdness of implementing the Movable Type extensions to the metaweblog extensions to the Blogger XMLRPC API. How can you call something a metaweblog API and not allow for post excerpts, for instance? So annoying.

editing jerakeen.org using ecto

While implementing it, I found the TextPattern API reference to be far more useful than the official spec, mostly because it covers everything up to the Movable Type extensions, which you need if you want to edit page excerpts. The other problem I encountered was that Ecto won’t talk to an endpoint over HTTPS with a self-signed certificate unless the SSL cert is in the local machine X509 database. The way it fails is incredibly unhelpful and annoying, too. The simplest way to fix it (assuming a recent macos) is to visit the endpoint in Safari. It’ll complain about the certificate - click the ‘always trust this site’ box, and it’ll stop.

Bot::BasicBot 0.7

  • Updates for new PoDo::IRC
  • No longer do 2 server connects on startup
  • the connect test doesn’t break itself by faking a connection first



I gave a talk on E4X. In a Just and Decent world, I wouldn’t have to write a blog entry on this, because there would be a nice front page to jerakeen.org that listed all the recent things I’ve done, with the option to subscribe to RSS (or whatever) feeds of various subsets. But I’ve been too lazy to write this so far, so I’ll just link to it here until I get django to do what I want.

E4X is a lovely extension to JS (well, compared to messing with the DOM, and it’s in core, so embedded users get it too), despite its crazy inconsistent syntax and annoying brokeness in Firefox. Fortunately, I don’t have to care about web browser-based JS implementations, so I get to use it, and you don’t..

Having played around with the JavaScript string type some more, I think I understand why it acts as it does. I’m a Perl monkey normally, so I’m not used to the concept of immutable strings, but JavaScript strings are immutable. Playing with the === operator (approximately, ‘is this the same object’) gives:

js> "a" === "a";
true
js> "a" + "b" === "ab";
true
js> "ab".replace(/./, "c") === "cb";
true

but

js> new String("a") === new String("a");
false

If strings were to magically upgrade themselves to objects, they’d change behaviour - previously equivalent strings would suddenly not be equivalent. Likewise, suppose this worked:

var a = "string";
var b = "string";
a === b; # true
a.foo = 1;

Shoud a still be equivalent to b? If not, a clearly isn’t immutable, as we’ve changed it. But if it is, then we’ve chanaged b at a distance - it’s grown a foo attribute.

Still all very annoying, of course, but I understand why now.

Recently, I mentioned a peculiar difference between uneval and toSource. Specifically (using the SpiderMonkey JS console):

js> uneval("");
""
js> "".toSource();
(new String(""))

"" and new String("") are different types of objects. The first is the basic string type, and only really has a value. The second is a full Object, that happens to have a value. However, it turns out that if you treat a basic string type as an Object, say by putting ‘.’ after it in an expression, the SpiderMonkey runtime will implicitly promote the string to a String. Hence, "".toSource() promotes the string object, then calls toSource on the new String object.

Annoyingly, the String Object doesn’t hang around, it’ll get thrown away as soon as you’re done with it. This leads to the weird case that you can set attributes on a basic string type (because it’ll get promoted to an Object, and Objects have attributes) but they don’t stay set (because the Object you’ve set them on gets thrown away as soon as the set call finishes).

By the way, all of this applies very specifically to the current CVS trunk SpiderMonkey. I don’t know what most web browser engines do with strings, so don’t assume this applies in, say, Internet Explorer. But I’d be interested if someone wants to find out and tell me…

On blog comment spam

Blog comment spam, the scourge of the internet. Having written yet another CMS to power jerakeen.org, I wanted comments on pages again. Django rocks hard - adding commenting was easy. And a day later, I have comment spam. Bugger.

From a purely abstract point of view, I find this interesting. There must be a spider looking for forms that look like things that can take comments. And the robots must be reasonably flexible - it’s not like my CMS is an off-the-shelf. But from a more concrete, ‘spam bad’ point of view, it’s bloody annoying.

So begins my personal battle against spam. Others have fought this battle, but of course the downside of rolling your own site is that you can’t use anything off the shelf. My plan was to forget trying to recognise and filter spam, and preferably I don’t want to have to moderate anything - I don’t want the spam to be submitted at all. And this really can’t be that hard. Unless there’s a human surfing for blogs and typing in the spam themselves, this should really just be a measure of my ability to write a Turing test. Right?

My first plan was to require a form value in the comment submission, but to not include that field in the form itself - instead, I added it with client-side JavaScript. This should stop simplistic robots, at the cost of requiring JS to be turned on in the client, which is something I’m willing to live with, frankly. Alas, it didn’t work. Clearly too simple - either there’s a human typing spam into the box, or the robot doing the work is using something like Mozilla::Mechanize that’ll do the JavaScript. Or maybe they just handle some obvious cases. After all, my ‘clever’ code was merely document.write("<input name=......

Or perhaps they figure it out once, and use a replay attack to hit every page? Not really a good assumption with hindsight, but never mind. I added prefixes to the form fields that were generated from the current time, and checked at submit time that the fields weren’t more than an hour old. This saves me from having to store state anywhere, and gains me forms that exipre after a while, unless you reverse-engineer the timestamp format. But I’m premising the existence of some automated tool, perhaps with a little human interaction. I don’t need to be perfect, I merely need to be not as bad as everyone else… But no, this failed too.

Ok, so the JavaScript is too obvious. I split it up into sections, and also write the wrong value into the form, and change it to the right one using a regular expression later (BWHAHAHAH). At the same time (and I suspect this is the important bit) I changed the names of the fields completely. Calling them ‘name’, ‘email’ and ‘comment’ is a bit of a giveaway, really. ‘foo’, ‘bar’ and ‘baz’ they are, then. Now it should be practically impossible for an automated tool to even figure out that I accept comments. Sure, you could probably think ‘hmm, two small input fields, and a textarea, on a page that has an RSS feed’, but I’m assuming that, for 90% of the blogs out there, this isn’t needed, so no-one does it.

And yes, I’ve received no blog spam comments since I did this. On the other hand, I’ve also received no normal comments either. Hope I haven’t raised the barrier too high. If the situation stays good, I may remove the client-side JavaScript requirement. Or figure out a noscript fall-back solution for people using lynx. Poor souls..

More playing with JSON and Spidermonkey has revealed yet another incredibly annoying fact (I hate those guys). Spidermonkey provides a lovely uneval() function, that does the exact opposite of eval() - turns JS objects into strings. It works on almost everything, and make life very very nice. There’s also Object.toSource() which does something similar (but not the same - try uneval("") vs "".toSource()).

But the strings that uneval produce are not valid JSON, as I have been assuming. I’ve been getting steadily more worked up at all the JSON parsers in the world, refusing to parse things that are clearly valid JavaScript, and eventually I go look at the spec, which fails to list ' as a valid string delimiter. And guess what delimiter uneval produces? Yay. So all the parsers are fine, and it’s just SpiderMonkey that’s broken.

Fortunately, Mochikit provides a nice serializeJSON() function.