Tom Insam

Broken Unicode Assumptions

This is an utterly brilliant list of broken assumptions under Unicode from rjh. Perl-biased, but syntax aside, the majority of these are just generally true. A trimmed list of my personal favorites (but you should read the whole list):

  • Code that assumes it can open a text file without specifying the encoding is broken.
  • Code that assumes [any language] uses UTF‑8 internally is wrong.
  • Code that assumes [..] code points are limited to 0x10_FFFF is wrong.
  • Code that assumes roundtrip equality on casefolding [..] is completely broken and wrong. Consider that the uc(“σ”) and uc(“ς”) are both “Σ”, but lc(“Σ”) cannot possibly return both of those.
  • Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, “ª” is a lowercase letter with no uppercase; whereas both “ᵃ” and “ᴬ” are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not p{Lowercase_Letter}, despite being both p{Letter} and p{Lowercase}.
  • Code that assumes changing the case doesn’t change the length of the string is broken.
  • Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a p{Mark} turning into a p{Letter}. It can also make it switch from one script to another.
  • Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
  • Code that assumes characters like > always points to the right and < always points to the left are wrong — because they in fact do not.
  • Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.
  • Code that assumes that ü has an umlaut is wrong.
  • Code that believes things like ₨ contain any letters in them is wrong.
  • Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
  • Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.
  • Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
  • Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
  • Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
  • Code that believes that stuff like /s/i can only match “S” or “s” is broken and wrong. You’d be surprised.