Automatically activating a Python virtualenv

I have a lot of different python projects in a lot of different directories, and like them all to have their own virtualenv. Because I can’t even be bothered to type a single line of code to activate them, I’ve ended up with this (slightly insane) setup:

Every project folder has a ./venv/ which is where I keep the virtualenv. Then, in my .bash_profile, I have the following snippet:

__activate_venv() {
  if [ -f ./venv/bin/activate ]
  then
    . ./venv/bin/activate
    hash -r
  else
    if (type deactivate >/dev/null 2>&1)
    then
      deactivate
      hash -r
    fi
  fi
}
export PROMPT_COMMAND="__activate_venv"

In short – if the current directory has a venv/bin/activate script, then run it. Otherwise, if there’s something called deactivate that I can call, then do so.

So whenever I cd into a folder that has a ./venv/, it activates, and whenever I leave, it deactivates. This is probably insane.

flickrgram

Make a simple photostream of your Flickr contacts’ photos

Instagram does two things. Firstly, and most obviously, it takes photos, sticks a filter on them, and uploads. But it also has a social aspect to it; it gives you a single, very simple view of the photos of your friends – a flat list in reverse upload order.

When I first started using Instagram, I just turned on the ‘post to flickr’ option for all of my uploads and didn’t think about it again. But that stream has a certain fascination to it – photos are uploaded one at a time, in the expectation that people will look at them very soon, and they’re fleeting – there’s no homepage for your past photos. You link to them or they go. People only upload things that they think people want to look at.

Combine this with an app like Carousel so that I don’t have to keep waking up the phone to see things, and I’ve started to quite like the ambient pictures that it gives me.

I haven’t decided if any of this is Meaningful yet, however. These photos are interestingly ephemeral. I’m not ready to decide if transience is a useful property, but I keep coming back to Jason Scott:

[..] if someone gives you an amazing Moon Laser and the Moon Laser lets you put words on the side of the moon, the fact that the Moon Laser’s effects wear off after a day or so isn’t that big a deal, and really, whatever you probably put on the side of the Moon with your Moon Laser is probably pretty shallow stuff along the lines of “WOW THIS IS COOL” and “FUCK MARS”. (Again, to belabor, a historian or anthropologist might be into what people, given their Moon Laser, chose to write, but that’s not your problem).

To hedge against this, I still upload all of mine to Flickr (except when I forget to press the button grr defaults) and I have the phone set up to store all the pre-filtered high-quality versions of the photographs (there’s a minor issue here that instagrammed photos have crap exif – this may bite me later. We will see).

Anyway, there’s nothing inherent to Instagram-the-application about any of this. This is just the model that the software encourages. As an experiment, I threw together something that I’ve called (for now) flickrgram. It emulates a similar thing for your flickr photostream – stuff your contacts have posted, in reverse-uploaded order, with a tiny bit of metadata wrapped round it. It’s formatted for iPhone, because I like the portability, but scale it up a bit and it’s a perfectly decent desktop interface (I’m using high-resolution images for the retina display, so it still looks good).

And here’s the conclusion I, and some people I know, have come to — it doesn’t work as well, because people don’t use Flickr the same way as they use Instagram. This isn’t entirely unexpected — several people have mentioned that they put photos into Flickr more for archive and storage than for sharing. I know plenty of people who will upload massive batches of photos (hundreds at once), entirely swamping everyone else. (Luckily, the API call I’m using to populate flickrgram returns at most 5 photos per user, so I’m defended against this. But only by accident.)

The interesting thing to me is that these models — “shoeboxing” verses Instaagram-style “lifestreaming” — are two entirely different usage models for a photo sharing site. Flickr was built for the streaming case (it’s got a photostream as the main thing you see) but recently the shoeboxing is rather swamping the streaming, and the two models just can’t coexist in the same contacts list – the uploads of the shoeboxers will swamp the incoming streams of people who just want to follow streamers. Instagram, on the other hand, by utterly ignoring the needs of shoeboxers, has been able to build a much better streaming experience.

It reminds me of Twitter, where the same thing has happened. The high-volume broadcast / at-reply people drown out the ambient “eating a sandwich” group of people that I quite liked getting the updates of.

Two things interest me about this. Firstly, is this ‘streaming only’ interface convention something that’s going to hurt the Instagram streamers in the future? Are they going to realise in a few years that they’ve not built up any meaningful history in this service? When they want a photo they remember taking, and can’t get it, will there be pain? Or will no-one care?

The other question is, can you get any money out of streamers? Shoeboxers want reliably stored photos, safe URLs, lots of upload bandwidth, all things you can charge money for. They can’t easily drift between services, because they have all their data in this one. Streamers don’t care. They haven’t got any history, and as long as the streaming app will push into facebook, who cares what the backend is? They’ll change apps just because the new one has a better filter.

Maybe the streaming experience that Flickr provides is as good as you can get it, because you have to pander to people who want to do archive, or you can’t make proper money.

Anyway. go play with flickrgram.

Hosting toy Rails and Django apps using Passenger

I like writing small self-contained applications, and I like writing then using nice high-level application frameworks like Django and Rails. Alas, I also like being able to run these services for the foreseeable future, and that’s a lot harder than writing them is. Running a single Rails or Django application consumes an appreciable chunk of the memory on my tiny colo, and I currently have about 5 projects I really want running all the time (this could easily grow to 50 if I had a sufficiently good way of hosting them). Ideally, I’d never stop hosting these things. Otherwise what’s the point?

It’s sometimes tempting to just write all my toys in PHP. I’m certain that PHP has the mind-share that it does primarily because it’s so incredibly easy to deploy. Ease of development is utterly trumped by ease of deployment for anything not written for internal use only for a large company. tar is easier to use than mongrel, so there are more deployed PHP apps than Rails apps. But I’m not that desperate. I like my nice frameworks.

I tried Heroku as an external host for my apps for a bit, and it’s great. Very easy to start things, very easy to leave them up, and the free hosting plan is perfectly adequate for your average web application. Alas, there are a couple of raw edges that only really became apparent after using them for a few weeks. Firstly, they want to charge me for using custom domains, and I’m not willing to park my apps on domains that don’t belong to me. Secondly, their service goes through odd periods of 500 errors. This doesn’t bother me – what does bother me is that there is no official reaction to any of the complaints about it on what seems to be the official mailing list. Finally, quite a lot of the things I do need cron scripts, for polling services, etc, and the heroku crons (a) aren’t very reliable that I’ve found, and (b) cost money. So I’m edging away from them recently. Would still recommend them for prototyping, not sure I’d want to host anything Real there just yet.

(An aside – I’m not unwilling to pay any money at all. I will happily pay money for things that matter. But these apps are toys. The average number of users they have is ‘1’. I’m not willing to pay a fiver a month per application to be able to host them on my domain rather than Heroku’s domain. A fiver a month for all of them at once? Sure. But the Heroku payment model assumes that you have a small number of apps that you care about, rather than a large number of apps that you don’t.)

Anyway, my current attempt at solving this problem is Phusion Passenger (via mattb), which does exactly what I want, for Rails apps. It’s an Apache 2 or nginx module, and it’s trivially easy to install, unless you’re using Debian, which I am. Short verison? It was a lot easier to totally ignore the debian packaging system except to install ruby, then build rubygems and everything else I needed from source. Sigh. I understand there are horrible philosophical differences underlying this pain. But it’s still pain.

Once installed, you can just point your domain’s DocumentRoot at a Rails app’s ‘public’ folder, and the Right Thing happens – files in public are served directly, other requests will cause a rails process to be started, and serve your app. Enough idle time, and it’ll shut down again. Magic. My favourite part is that it’ll start up the application server as the user who owns the ‘environment.rb’ file of your application, meaning that your app is running as your user, and can do things like write files into temp folders that don’t need sudo to be able to delete again.

Not all of my projects are Rails apps, though. jerakeen.org is a Django app, for instance (this week, anyway). Unexpectedly, it turns out that Passenger will do the same thing for Django apps, though it’s not as well documented. I have a file called passenger_wsgi.py in the root folder of my Django application folder. It looks something like this (if you use this, you’ll need to change the settings module name):

import sys, os
current_dir = os.path.dirname( os.path.abspath( __file__ ) )
sys.path.append( current_dir )
os.environ['DJANGO_SETTINGS_MODULE'] = 'mydjango.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()

And in my Apache config file, I have this:

<VirtualHost *:80>
  ServerName jerakeen.org
  ...
  DocumentRoot /home/tomi/web/jerakeen.org
  PassengerAppRoot /home/tomi/svn/Projects/mydjango

and thus are all my toy projects now brought up and down on demand. I’m happy again. Till next week, probably. ONWARDS.

Warcraft guild achievements as RSS

I play World of Warcraft. Oh, the shame. But I play it because I’m in a fun guild – we do science!. Well, actually they do science. I’m still at the ‘cleaning the glassware afterwards’ stage, but a tauren can dream..

Anyway, I code. It’s what I do. So once WoLK came out and half the guild went completely insane and started chasing the really silly achievements, it was clear we were going to need an RSS feed of the things. So I built one. It’s based on the Armory, like most WoW tools, and is a complete kludge, like most of my tools. But here are my notes anyway.

The trick to scraping the Armoury is pretending to be Firefox. If you visit as a normal web browser, they serve you a traditional HTML page with some Ajax, and it’s all quite normal and boring. If you visit the armoury in firefox they return an XML document with an XSL stylesheet referenced in the header that transforms the XML into a web page. Why are they doing this? It must be a huge amount of work compared to just serving HTML, I don’t get it. Let’s ignore that. Fake a firefox user agent, and you can fetch lovely XML documents that describe things! There’s no ‘guild achievement’ page, alas, so let’s start by fetching the page that lists the people in the guild. Using Python.

import urllib, urllib2
opener = urllib2.build_opener()
# Pretend to be firefox
opener.addheaders = [ ('user-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4') ]
url = "http://eu.wowarmory.com/guild-info.xml?r=%s&n=%s&p=1"%( urllib.quote(realm,''), urllib.quote(guild,'') )
req = urllib2.Request(url)
data = opener.open(req)

(This is the EU armoury, because that’s where I am). The armoury is a really unreliable site, so in practice I put lots more error handling round this. But error handling makes for very hard-to-read example code. The XML looks like this:

<page globalSearch="1" lang="en_us" requestUrl="/guild-info.xml">
  <guildKey factionId="1" name="unassigned variable" nameUrl="unassigned+variable" realm="Nordrassil" realmUrl="Nordrassil" url="r=Nordrassil&amp;n=unassigned+variable"/>
  <guildInfo>
    <guild>
      <members filterField="" filterValue="" maxPage="1" memberCount="66" page="1" sortDir="a">
        <character achPoints="2685" class="Hunter" classId="3" gender="Male" genderId="0" level="80" name="Munchausen" race="Tauren" raceId="6" rank="0" url="r=Nordrassil&amp;n=Munchausen"/>
        <character achPoints="1175" class="Paladin" classId="2" gender="Male" genderId="0" level="80" name="Jonadin" race="Blood Elf" raceId="10" rank="1" url="r=Nordrassil&amp;n=Jonadin"/>
        ...

I parse XML using xmltramp, because I’m very lazy and it works. I use xmltramp for all my XML parsing needs. It’s old, and there might be something better, but I don’t really care. This is a toy.

import xmltramp
xml = xmltramp.seed( data )
toons = xml['guildInfo']['guild']['members']['character':]

That gets us a list of people in the guild. The rendered web page has pagination, but the underlying XML seems to have all characters in a single document, so no messing around fetching multiple pages here. (I’ve tried this on a guild of 350ish people. Maybe it paginates beyond that. Don’t use this script on a guild that big, it won’t make you happy.)

Alas, the next thing we have to do is loop over every character and fetch their achievements page (that’s why you shouldn’t run this script over a large guild). This is extremely unpleasant and slow.

for character in toons:
    char_url = "http://eu.wowarmory.com/character-achievements.xml?r=%s&n=%s"%( urllib.quote(realm,''), urllib.quote(character('name'),'') )
    char_req = urllib2.Request(char_url)
    char_data = opener.open(char_req)
    char_xml = xmltramp.seed( char_data )

The achievement XML looks like this:

...
<achievement categoryId="168" dateCompleted="2009-02-08+01:00" desc="Defeat Shade of Eranikus." icon="inv_misc_coin_07" id="641" points="10" title="Sunken Temple"/>
<achievement categoryId="168" dateCompleted="2009-01-31+01:00" desc="Defeat the bosses in Gundrak." icon="achievement_dungeon_gundrak_normal" id="484" points="10" title="Gundrak"/>
<achievement categoryId="155" dateCompleted="2009-01-31+01:00" desc="Receive a Coin of Ancestry." icon="inv_misc_elvencoins" id="605" points="10" title="A Coin of Ancestry"/>
...

My biggest annoyance here is that there’s no timestamp on these things better than ‘day’, so you don’t get very good ordering when you combine them later. I could solve this by storing some state myself, remembering the first time I see each new entry, etc, etc, but I’m trying to avoid keeping any state here, so I don’t do that. The XML also lists only 5 achievements per character, and getting more involves fetching a lot more pages, so the final feed includes only the 5 most recent achievements per character. Again, something I could solve with local storage.

Anyway, now I have a list of everyone in the guild, and their last 5 achievements. It’s pretty trivial building a list of these and outputting Atom or something. I do it using ‘print’ statements, myself, because I’m inherently evil. You can’t deep-link to the achievement itself on the Armoury, so I link to the wowhead page for individual achievements.

Because the Armoury is unreliable, and my script is slow, I don’t use this thing to generate the feed on demand. I have a crontab call the script once an hour, and if it doesn’t explode, it copies the result into a directory served by my web browser. If it does explode, then meh, I’ll try again in an hour. The feed isn’t exactly timely, but we’re not controlling nuclear power stations here, we’re tracking a computer game. It’ll do.

The code I actually run to generate the feed can be found in my repository here, and the resulting feed (assuming you care, which you shouldn’t, you’re not in the guild..) is here. feel free to steal the code and do your own guild feeds.

Things I learned at DJUGL

I went to DJUGL (pronounced ‘juggle’) yesterday, to watch tech talks and say hello to people. I learned the following things: (I know! Learning things! At a tech talk!)

  • IPython, an improved Python shell. Does tab-completion, amonst other things. The Django ‘shell’ command will use it automatically if it’s installed.

  • The SEND_BROKEN_LINK_EMAILS setting – sends mail to addresses listed in the MANAGERS config variable when the Django server serves a 404. Not something I particularly want to turn on, but I liked it. I also like the way Django will send mail on every server error. The absolute fastest way to get live crash bugs fixed is to mail all the developers every time they happen.

  • There was some cool middleware that displayed profiling information. Must use it in something.

  • The django-tagging application bears looking into at some point.

Simon has talk notes up.

Sanitising comments with Python

As is my wont, I’m in the middle of porting jerakeen.org to another back-end. This time, I’m porting it back to the Django-based Python version (it’s been written in rails for a few months now). It’s grown a few more features, and one of them is somewhat smarter comment parsing.

This being a vaguely technical blog, I have vaguely technical people leaving comments. And most of them want to be able to use HTML. I’ve seen blogs that allow markdown in comments, but I hate that – unless you’re know you’re writing it, it’s too easy for markdown to do things like eat random underscores and italicise the rest of the sentence by accident. But at the same time, I need to let people who just want to type text leave comments.

The trick then is to turn plain text into HTML, but also allow some HTML through. Because the world is a nasty place, this means whitelisting based on tags and attributes, rather than removing known-to-be-nasty things. Glossing over the ‘turn plain text into HTML‘ part, because it’s easy, here’s how I use BeautifulSoup to sanitise HTML comments, permitting only a subset of allowed tags and attributes:

# Assume some evil HTML is in 'evil_html'

# allow these tags. Other tags are removed, but their child elements remain
whitelist = ['blockquote', 'em', 'i', 'img', 'strong', 'u', 'a', 'b', "p", "br", "code", "pre" ]

# allow only these attributes on these tags. No other tags are allowed any attributes.
attr_whitelist = { 'a':['href','title','hreflang'], 'img':['src', 'width', 'height', 'alt', 'title'] }

# remove these tags, complete with contents.
blacklist = [ 'script', 'style' ]

attributes_with_urls = [ 'href', 'src' ]

# BeautifulSoup is catching out-of-order and unclosed tags, so markup
# can't leak out of comments and break the rest of the page.
soup = BeautifulSoup(evil_html)

# now strip HTML we don't like.
for tag in soup.findAll():
    if tag.name.lower() in blacklist:
        # blacklisted tags are removed in their entirety
        tag.extract()
    elif tag.name.lower() in whitelist:
        # tag is allowed. Make sure all the attributes are allowed.
        for attr in tag.attrs:
            # allowed attributes are whitelisted per-tag
            if tag.name.lower() in attr_whitelist and attr[0].lower() in attr_whitelist[ tag.name.lower() ]:
                # some attributes contain urls..
                if attr[0].lower() in attributes_with_urls:
                    # ..make sure they're nice urls
                    if not re.match(r'(https?|ftp)://', attr[1].lower()):
                        tag.attrs.remove( attr )

                # ok, then
                pass
            else:
                # not a whitelisted attribute. Remove it.
                tag.attrs.remove( attr )
    else:
        # not a whitelisted tag. I'd like to remove it from the tree
        # and replace it with its children. But that's hard. It's much
        # easier to just replace it with an empty span tag.
        tag.name = "span"
        tag.attrs = []

# stringify back again
safe_html = unicode(soup)

# HTML comments can contain executable scripts, depending on the browser, so we'll
# be paranoid and just get rid of all of them
# e.g. <!--[if lt IE 7]><script type="text/javascript">h4x0r();</script><![endif]-->
# TODO - I rather suspect that this is the weakest part of the operation..
safe_html = re.sub(r'<!--[.n]*?-->','',safe_html)

It’s based on an Hpricot HTML sanitizer that I’ve used in a few things.

Update 2008-05-23: My thanks to Paul Hammond and Mark Fowler, who pointed me at all manner of nasty things (such as javascript: urls ) that I didn’t handle very well. I now also whitelist allowed URIs. I should also point out the test suite I use – all code needs tests!

A usable Shelf release

Right, Shelf has now reached version 0.0.6 – download it (there are newer versions out now – get those). It’s good enough that I’m running it full time now. Thanks to Mark Fowler, it can now pull clues from Firefox, which is a relief. I’ve also added Address Book and iChat support, although the iChat stuff is a little hokey – it assumes you’re not using tabbed chats, and that you speak English. Sorry. The iChat AppleScript dictionary is lousy.

Musings

It’s been suggested that I could work out twitter feed and Flickr photostream URLs about people based on their name / nick / email. I’m currently shying away from deriving too many things about a person magically. For instance, I could work out (and cache, obviously) a Flickr username for a person from their email address. Quite apart from the horrible privacy implications of sending the email addresses of everyone you read mail from to Flickr, I just don’t like the approach. I’d much rather encourage a rich address book with lots of data in it. This has the side-effect that Shelf will also recognise my Flickr page as belonging to me.

DuckCall 0.0.3

DuckCall didn’t work work under Leopard. Noone really noticed, so I assume noone uses it. Which is probably a Good Thing. But if you were sitting on the edge of your seat, waiting for a compatibility release, you can now relax. DuckCall-0.0.3.zip is now available.

It’s also 80k zipped, as opposed to the 3 megs of version 0.0.2. Hurray for bundled PyObjC. This means that this version will only work under Leopard. But there are no other changes between it and 0.0.2, so all you laggards don’t need to feel left out.