Tom Insam

Before yesterday, The Oculus Rift was technofetish gear. It ceased to be so in an instant. [..] I used the shitty, old Rift, and I thought I was underwater. Think of every corner they had to cut because they were trying to make this thing in the finite realm of men. Now imagine the corners restored, and the corner cutting machine in ruins.

“Spite-driven development,” declares Nicholas, the other two nodding immediately. I ask them to explain, and Daniel gives me an example. Let’s say he wants rabbits in the game – as the programmer it’s not really within his powers to make this happen. So, he says, he’ll use his poor artistic skills to draw something like a rabbit on the office whiteboard, take a photo, put it on his computer and crop it out, and put that square flat drawing into the game. On seeing this, says David with a look on his face that entirely confirms this isn’t hypothetical, he’ll be so horrified that he’ll be forced to draw a proper one to replace it.

Different locales also return different versions of the same symbol. The US locale (en_US) returns a full-width ¥ symbol, where as the Japan locale (ja_JP) returns a regular ¥ symbol. Similarly, the French locale (fr_FR) will return a non-breaking space between the digits and the symbol, where as the French Canadian locale (fr_CA) which formats numbers the same way (“15,00 $NZ”, like above) uses a regular space.

This is an utterly brilliant list of broken assumptions under Unicode from rjh. Perl-biased, but syntax aside, the majority of these are just generally true. A trimmed list of my personal favorites (but you should read the whole list):

  • Code that assumes it can open a text file without specifying the encoding is broken.
  • Code that assumes [any language] uses UTF‑8 internally is wrong.
  • Code that assumes [..] code points are limited to 0x10_FFFF is wrong.
  • Code that assumes roundtrip equality on casefolding [..] is completely broken and wrong. Consider that the uc(“σ”) and uc(“ς”) are both “Σ”, but lc(“Σ”) cannot possibly return both of those.
  • Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, “ª” is a lowercase letter with no uppercase; whereas both “ᵃ” and “ᴬ” are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not p{Lowercase_Letter}, despite being both p{Letter} and p{Lowercase}.
  • Code that assumes changing the case doesn’t change the length of the string is broken.
  • Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a p{Mark} turning into a p{Letter}. It can also make it switch from one script to another.
  • Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
  • Code that assumes characters like > always points to the right and < always points to the left are wrong — because they in fact do not.
  • Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.
  • Code that assumes that ü has an umlaut is wrong.
  • Code that believes things like ₨ contain any letters in them is wrong.
  • Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
  • Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.
  • Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
  • Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
  • Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
  • Code that believes that stuff like /s/i can only match “S” or “s” is broken and wrong. You’d be surprised.
You can’t roll out a syrup-drenched waffle filled with bacon and eggs under the slogan “Live More”. You just can’t.