Code Points

A commonly touted disadvantage of UTF-8 is that string indexing is O(n). Because code points take up a variable number of bytes, you won’t know where the 5th codepoint is until you scan the string and look for it. UTF-32 doesn’t have this problem; it’s always 4 * index bytes away.

The problem here is that indexing by code point shouldn’t be an operation you ever need!

[..]

Unicode itself gives the term “character” multiple incompatible meanings, and as far as I know doesn’t use the term in any normative text.

– Let’s Stop Ascribing Meaning to Code Points

Sorting

in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o.

Alphabetical order explained in a mere 27,817 words. previously

Different locales also return different versions of the same symbol. The US locale (en_US) returns a full-width ¥ symbol, where as the Japan locale (ja_JP) returns a regular ¥ symbol. Similarly, the French locale (fr_FR) will return a non-breaking space between the digits and the symbol, where as the French Canadian locale (fr_CA) which formats numbers the same way (“15,00 $NZ”, like above) uses a regular space.

Android Currency Localisation Hell – Adam Speakman

Broken Unicode Assumptions

This is an utterly brilliant list of broken assumptions under Unicode from rjh. Perl-biased, but syntax aside, the majority of these are just generally true. A trimmed list of my personal favorites (but you should read the whole list):

  • Code that assumes it can open a text file without specifying the encoding is broken.
  • Code that assumes [any language] uses UTF‑8 internally is wrong.
  • Code that assumes [..] code points are limited to 0x10_FFFF is wrong.
  • Code that assumes roundtrip equality on casefolding [..] is completely broken and wrong. Consider that the uc(“σ”) and uc(“ς”) are both “Σ”, but lc(“Σ”) cannot possibly return both of those.
  • Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, “ª” is a lowercase letter with no uppercase; whereas both “ᵃ” and “ᴬ” are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not p{Lowercase_Letter}, despite being both p{Letter} and p{Lowercase}.
  • Code that assumes changing the case doesn’t change the length of the string is broken.
  • Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a p{Mark} turning into a p{Letter}. It can also make it switch from one script to another.
  • Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
  • Code that assumes characters like > always points to the right and < always points to the left are wrong — because they in fact do not.
  • Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.
  • Code that assumes that ü has an umlaut is wrong.
  • Code that believes things like ₨ contain any letters in them is wrong.
  • Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.
  • Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.
  • Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
  • Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
  • Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
  • Code that believes that stuff like /s/i can only match “S” or “s” is broken and wrong. You’d be surprised.

using utf-8 in irssi under screen

Firstly, tell your local terminal application that you want a utf-8
window. This is left to you, but under macos (which I use), right click
the window, select ‘Window settings’, pick the ‘Display’ option from
the drop-down, and pick utf-8 under ‘Character set encoding’.

Next, when you start the screen session, pass the ‘-U’ flag. This has
to be passed to a new screen session – you can’t connect to an existing
one this way.

screen -U

Alternatively, you can turn on the utf-8 flag for a single existing
screen window by typing your hotkey (ctrl-a by default), then ‘:utf8 on’.
This is good if you don’t want all of your windows to be utf now.

On the remote machine, make sure that the ‘LANG‘ environment variable
is set to something UTF-8 like, for instance, I use

export LANG=en_GB.UTF-8

in my .bashrc.

Finally, you need to tell irssi to use UTF-8. Start it up in your new
utf-8 window, and type

/set term_type utf-8

Hopefully everything should work now.

python and unicode

I like python’s unicode handling. Instead of perl’s situation, where file handles are assumed, by default, to be latin-1, python file handles (including STDIN/OUT) are assumed, by default, to be ASCII. Forget nasty things like ‘☃’, in python, you can’t even print ‘é’ without explicitly telling it how. Lovely.

More UTF8 pain

Does no-one in the world care about non-ASCII characters? It’s pathetic. I’m trying to make HTML form uploads work for files with non-ASCII characters in their names, and I’m hitting the stupidest problems.

The main bugbear is mozilla – you can’t upload files with wide characters in their names. At all. Piece of shit. Safari seems to be encoding the upload filenames with some made-up encoding that I can’t figure out, so that’s out of luck. At least safari sends the actual contents of the files.

The one browser I’ve tried that works flawlessly is Internet Explorer. Microsoft, at least, seem to care about the non-US market.

UTF8 Openguides

I foolishly offered to make OpenGuides UTF-8 safe. Because I don’t do that enough at work, or something. Anyway, it’s going quite well – because I did all the grunt work in CGI::Wiki a while ago, it’s just a matter of finding all the inputs and outputs and making sure they’re encoded properly. So far, the page contents and names are utf-8 safe, along with the cookie preferences, so your username is good. The search stuff looks scary, and there are various broken plugins, etc, etc, so there’s still stuff to do. I should also do the hooks properly – CGI::Wiki should offer nice functions for this stuff.

safari and password fields

Today I discovered that safari ‘magically’ downgrades latin-1 input in form password fields to their nearest ascii equivalents – typing ‘pásswörd’ into a password box actually submits ‘password’. But you can cut and paste non-ascii in and it works fine. I’m very confused.