Stupid Feed Tricks

Ok, so RSS feeds turn out to be even more amazingly broken than I’d assumed that they were. I’d like to pull out some favourites, but they’re all amazing and you should read the whole list.

Brent does point out another horrible failure case, which is hotel / airport / etc gateways that hijack your HTTP requests and redirect you somewhere else. This is so annoying. Not that I have an alternative. Does lead to some nasty app failure cases, though.

When a feed reader gets a permanent redirect, it’s supposed to take that to mean: “Hey, the feed moved. It’s over here now. Save the new URL and use the new one from now on.”

And if you don’t do that in your reader, and your feed reader is popular enough, smart people who quite rightly care about proper behavior will call you out. You have to do that.

Google Reader never used to do that. It would drive me crazy, because I move my feed around a lot. (I’m crazy). I guess this is why.

Most days, I’m mildly astonished that the internet actually works.

Warcraft guild achievements as RSS

I play World of Warcraft. Oh, the shame. But I play it because I’m in a fun guild – we do science!. Well, actually they do science. I’m still at the ‘cleaning the glassware afterwards’ stage, but a tauren can dream..

Anyway, I code. It’s what I do. So once WoLK came out and half the guild went completely insane and started chasing the really silly achievements, it was clear we were going to need an RSS feed of the things. So I built one. It’s based on the Armory, like most WoW tools, and is a complete kludge, like most of my tools. But here are my notes anyway.

The trick to scraping the Armoury is pretending to be Firefox. If you visit as a normal web browser, they serve you a traditional HTML page with some Ajax, and it’s all quite normal and boring. If you visit the armoury in firefox they return an XML document with an XSL stylesheet referenced in the header that transforms the XML into a web page. Why are they doing this? It must be a huge amount of work compared to just serving HTML, I don’t get it. Let’s ignore that. Fake a firefox user agent, and you can fetch lovely XML documents that describe things! There’s no ‘guild achievement’ page, alas, so let’s start by fetching the page that lists the people in the guild. Using Python.

import urllib, urllib2
opener = urllib2.build_opener()
# Pretend to be firefox
opener.addheaders = [ ('user-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4') ]
url = "http://eu.wowarmory.com/guild-info.xml?r=%s&n=%s&p=1"%( urllib.quote(realm,''), urllib.quote(guild,'') )
req = urllib2.Request(url)
data = opener.open(req)

(This is the EU armoury, because that’s where I am). The armoury is a really unreliable site, so in practice I put lots more error handling round this. But error handling makes for very hard-to-read example code. The XML looks like this:

<page globalSearch="1" lang="en_us" requestUrl="/guild-info.xml">
  <guildKey factionId="1" name="unassigned variable" nameUrl="unassigned+variable" realm="Nordrassil" realmUrl="Nordrassil" url="r=Nordrassil&amp;n=unassigned+variable"/>
  <guildInfo>
    <guild>
      <members filterField="" filterValue="" maxPage="1" memberCount="66" page="1" sortDir="a">
        <character achPoints="2685" class="Hunter" classId="3" gender="Male" genderId="0" level="80" name="Munchausen" race="Tauren" raceId="6" rank="0" url="r=Nordrassil&amp;n=Munchausen"/>
        <character achPoints="1175" class="Paladin" classId="2" gender="Male" genderId="0" level="80" name="Jonadin" race="Blood Elf" raceId="10" rank="1" url="r=Nordrassil&amp;n=Jonadin"/>
        ...

I parse XML using xmltramp, because I’m very lazy and it works. I use xmltramp for all my XML parsing needs. It’s old, and there might be something better, but I don’t really care. This is a toy.

import xmltramp
xml = xmltramp.seed( data )
toons = xml['guildInfo']['guild']['members']['character':]

That gets us a list of people in the guild. The rendered web page has pagination, but the underlying XML seems to have all characters in a single document, so no messing around fetching multiple pages here. (I’ve tried this on a guild of 350ish people. Maybe it paginates beyond that. Don’t use this script on a guild that big, it won’t make you happy.)

Alas, the next thing we have to do is loop over every character and fetch their achievements page (that’s why you shouldn’t run this script over a large guild). This is extremely unpleasant and slow.

for character in toons:
    char_url = "http://eu.wowarmory.com/character-achievements.xml?r=%s&n=%s"%( urllib.quote(realm,''), urllib.quote(character('name'),'') )
    char_req = urllib2.Request(char_url)
    char_data = opener.open(char_req)
    char_xml = xmltramp.seed( char_data )

The achievement XML looks like this:

...
<achievement categoryId="168" dateCompleted="2009-02-08+01:00" desc="Defeat Shade of Eranikus." icon="inv_misc_coin_07" id="641" points="10" title="Sunken Temple"/>
<achievement categoryId="168" dateCompleted="2009-01-31+01:00" desc="Defeat the bosses in Gundrak." icon="achievement_dungeon_gundrak_normal" id="484" points="10" title="Gundrak"/>
<achievement categoryId="155" dateCompleted="2009-01-31+01:00" desc="Receive a Coin of Ancestry." icon="inv_misc_elvencoins" id="605" points="10" title="A Coin of Ancestry"/>
...

My biggest annoyance here is that there’s no timestamp on these things better than ‘day’, so you don’t get very good ordering when you combine them later. I could solve this by storing some state myself, remembering the first time I see each new entry, etc, etc, but I’m trying to avoid keeping any state here, so I don’t do that. The XML also lists only 5 achievements per character, and getting more involves fetching a lot more pages, so the final feed includes only the 5 most recent achievements per character. Again, something I could solve with local storage.

Anyway, now I have a list of everyone in the guild, and their last 5 achievements. It’s pretty trivial building a list of these and outputting Atom or something. I do it using ‘print’ statements, myself, because I’m inherently evil. You can’t deep-link to the achievement itself on the Armoury, so I link to the wowhead page for individual achievements.

Because the Armoury is unreliable, and my script is slow, I don’t use this thing to generate the feed on demand. I have a crontab call the script once an hour, and if it doesn’t explode, it copies the result into a directory served by my web browser. If it does explode, then meh, I’ll try again in an hour. The feed isn’t exactly timely, but we’re not controlling nuclear power stations here, we’re tracking a computer game. It’ll do.

The code I actually run to generate the feed can be found in my repository here, and the resulting feed (assuming you care, which you shouldn’t, you’re not in the guild..) is here. feel free to steal the code and do your own guild feeds.

Irritating RSS feed links

A side-effect of all this Google Social lunacy is that I’m seeing a lot of URLs for people that I wouldn’t normally have put in their Address Book entries. For instance, Simon Wistow’s Vox page links to his gestalt page which in turn links to his use.perl page, so I see all of these URLs in Shelf. It fetches the pages, and discovers that there’s a single RSS feed advertised on the use.perl page – http://use.perl.org/index.rss. But this RSS feed is nothing to do with Simon’s page – it’s the main use.perl article feed. Shelf doesn’t know this, of course, so Simon’s display in my Shelf window contains all recent use.perl articles.

The HTML spec seems to imply to me that rel=”alternate” links are for linking to the same content, but represented in a different way, not some completely unrelated content that happens to be hosted on the same domain. This is very annoying.

I’m picking on use.perl unreasonably here, of course. Lots of people do it. use.perl is just the first one I noticed. Followed by search.cpan.org (author modules pages have an RSS feed of the master module upload list). But there are others.

NNW subscriptions

So I wanted to see which of my NNW () subscriptions were dead. And I wanted to get the hang of AppleScript. Right.

set errorlog to ""

tell application "NetNewsWire"
  repeat with check in subscriptions
    set err to error string of check as string
    if length of err > 1 then
      set errorlog to errorlog & “Error for ‘” & ((display name of check) as string) & “’ (” & (RSS URL of check as string) & “): ‘” & ((error string of check) as string) & “’r”
    end if
  end repeat
end tell

tell application “BBEdit”
  make new text window with properties {contents:errorlog}
end tell

Pretty nifty. Course, you have to have BBEdit. But making it use TextEdit shouldn’t be hard.

referrer and agent mixup

The blogging/RSS community has discovered HTTP headers actually have a defined purpose. Amazing. It’s like when they discovered that HTTP actually allows you to see if a page has changed since you last downloaded it and not get the whole thing. That was fun, too.

Ok, that’s a little bit too bitter. But I can name one linux RSS reader that’s done the Right Thing here for months. </smug>

sharpreader

sharpreader – a windows RSS feed reader. Uses .NET, which is all the rage nowadays, apparently.

It’s beautiful, easily the best RSS reader I’ve ever seen, and that includes the one I wrote :-). Proper OPML export / import (It’s amazing how meny readers get this wrong), the interface, although slightly hard to figure out makes a lot of sense once you get the hang of it, and frankly usability and learning curves can go hang once I can use the thing.

The nicest feature, though, is the threading. I’ll notice which other blogs you read have linked to this one, and will do the litte ‘+’ symbol thing so you can expand them and see all the interlinks. It’s niiiiiiiice. I’m suddenly tempted to go back to “lectern”:/programming/lectern and hack this in somehow, though it’ll be hard. Maybe I’ll write a mac one and steal the niche of NNW. Maybe I’ll write a bad alpha and get distracted by some other project. Yes, that seems to be the best idea.

Software interfaces evolve like this, it’s wonderful to watch. Web browsers are another fairly immature tech that grow “tabs” and other interface things, and that’s nice to watch too, even if they’re stupid. Genuinely new types of apps are rare, I can’t think of many off the top of my head, although obviously once they’re pointed out, it’s obvious…