I play World of Warcraft. Oh, the shame. But I play it because I’m in a fun guild – we do science!. Well, actually they do science. I’m still at the ‘cleaning the glassware afterwards’ stage, but a tauren can dream..
Anyway, I code. It’s what I do. So once WoLK came out and half the guild went completely insane and started chasing the really silly achievements, it was clear we were going to need an RSS feed of the things. So I built one. It’s based on the Armory, like most WoW tools, and is a complete kludge, like most of my tools. But here are my notes anyway.
The trick to scraping the Armoury is pretending to be Firefox. If you visit as a normal web browser, they serve you a traditional HTML page with some Ajax, and it’s all quite normal and boring. If you visit the armoury in firefox they return an XML document with an XSL stylesheet referenced in the header that transforms the XML into a web page. Why are they doing this? It must be a huge amount of work compared to just serving HTML, I don’t get it. Let’s ignore that. Fake a firefox user agent, and you can fetch lovely XML documents that describe things! There’s no ‘guild achievement’ page, alas, so let’s start by fetching the page that lists the people in the guild. Using Python.
import urllib, urllib2
opener = urllib2.build_opener()
# Pretend to be firefox
opener.addheaders = [ ('user-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:18.104.22.168) Gecko/20070515 Firefox/22.214.171.124') ]
url = "http://eu.wowarmory.com/guild-info.xml?r=%s&n=%s&p=1"%( urllib.quote(realm,''), urllib.quote(guild,'') )
req = urllib2.Request(url)
data = opener.open(req)
(This is the EU armoury, because that’s where I am). The armoury is a really unreliable site, so in practice I put lots more error handling round this. But error handling makes for very hard-to-read example code. The XML looks like this:
<page globalSearch="1" lang="en_us" requestUrl="/guild-info.xml">
<guildKey factionId="1" name="unassigned variable" nameUrl="unassigned+variable" realm="Nordrassil" realmUrl="Nordrassil" url="r=Nordrassil&n=unassigned+variable"/>
<members filterField="" filterValue="" maxPage="1" memberCount="66" page="1" sortDir="a">
<character achPoints="2685" class="Hunter" classId="3" gender="Male" genderId="0" level="80" name="Munchausen" race="Tauren" raceId="6" rank="0" url="r=Nordrassil&n=Munchausen"/>
<character achPoints="1175" class="Paladin" classId="2" gender="Male" genderId="0" level="80" name="Jonadin" race="Blood Elf" raceId="10" rank="1" url="r=Nordrassil&n=Jonadin"/>
I parse XML using xmltramp, because I’m very lazy and it works. I use xmltramp for all my XML parsing needs. It’s old, and there might be something better, but I don’t really care. This is a toy.
xml = xmltramp.seed( data )
toons = xml['guildInfo']['guild']['members']['character':]
That gets us a list of people in the guild. The rendered web page has pagination, but the underlying XML seems to have all characters in a single document, so no messing around fetching multiple pages here. (I’ve tried this on a guild of 350ish people. Maybe it paginates beyond that. Don’t use this script on a guild that big, it won’t make you happy.)
Alas, the next thing we have to do is loop over every character and fetch their achievements page (that’s why you shouldn’t run this script over a large guild). This is extremely unpleasant and slow.
for character in toons:
char_url = "http://eu.wowarmory.com/character-achievements.xml?r=%s&n=%s"%( urllib.quote(realm,''), urllib.quote(character('name'),'') )
char_req = urllib2.Request(char_url)
char_data = opener.open(char_req)
char_xml = xmltramp.seed( char_data )
The achievement XML looks like this:
<achievement categoryId="168" dateCompleted="2009-02-08+01:00" desc="Defeat Shade of Eranikus." icon="inv_misc_coin_07" id="641" points="10" title="Sunken Temple"/>
<achievement categoryId="168" dateCompleted="2009-01-31+01:00" desc="Defeat the bosses in Gundrak." icon="achievement_dungeon_gundrak_normal" id="484" points="10" title="Gundrak"/>
<achievement categoryId="155" dateCompleted="2009-01-31+01:00" desc="Receive a Coin of Ancestry." icon="inv_misc_elvencoins" id="605" points="10" title="A Coin of Ancestry"/>
My biggest annoyance here is that there’s no timestamp on these things better than ‘day’, so you don’t get very good ordering when you combine them later. I could solve this by storing some state myself, remembering the first time I see each new entry, etc, etc, but I’m trying to avoid keeping any state here, so I don’t do that. The XML also lists only 5 achievements per character, and getting more involves fetching a lot more pages, so the final feed includes only the 5 most recent achievements per character. Again, something I could solve with local storage.
Anyway, now I have a list of everyone in the guild, and their last 5 achievements. It’s pretty trivial building a list of these and outputting Atom or something. I do it using ‘print’ statements, myself, because I’m inherently evil. You can’t deep-link to the achievement itself on the Armoury, so I link to the wowhead page for individual achievements.
Because the Armoury is unreliable, and my script is slow, I don’t use this thing to generate the feed on demand. I have a crontab call the script once an hour, and if it doesn’t explode, it copies the result into a directory served by my web browser. If it does explode, then meh, I’ll try again in an hour. The feed isn’t exactly timely, but we’re not controlling nuclear power stations here, we’re tracking a computer game. It’ll do.
The code I actually run to generate the feed can be found in my repository here, and the resulting feed (assuming you care, which you shouldn’t, you’re not in the guild..) is here. feel free to steal the code and do your own guild feeds.