Code Points

A commonly touted disadvantage of UTF-8 is that string indexing is O(n). Because code points take up a variable number of bytes, you won’t know where the 5th codepoint is until you scan the string and look for it. UTF-32 doesn’t have this problem; it’s always 4 * index bytes away. The problem here is […]

in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or […]

Sorting

in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting “Søren” after “Sylt” in a long list, as would be expected in Norwegian or […]

Different locales also return different versions of the same symbol. The US locale (en_US) returns a full-width ¥ symbol, where as the Japan locale (ja_JP) returns a regular ¥ symbol. Similarly, the French locale (fr_FR) will return a non-breaking space between the digits and the symbol, where as the French Canadian locale (fr_CA) which formats […]

using utf-8 in irssi under screen

Firstly, tell your local terminal application that you want a utf-8 window. This is left to you, but under macos (which I use), right click the window, select ‘Window settings’, pick the ‘Display’ option from the drop-down, and pick utf-8 under ‘Character set encoding’. Next, when you start the screen session, pass the ‘-U’ flag. This […]

python and unicode

I like python’s unicode handling. Instead of perl’s situation, where file handles are assumed, by default, to be latin-1, python file handles (including STDIN/OUT) are assumed, by default, to be ASCII. Forget nasty things like ‘☃’, in python, you can’t even print ‘é’ without explicitly telling it how. Lovely.

More UTF8 pain

Does no-one in the world care about non-ASCII characters? It’s pathetic. I’m trying to make HTML form uploads work for files with non-ASCII characters in their names, and I’m hitting the stupidest problems. The main bugbear is mozilla – you can’t upload files with wide characters in their names. At all. Piece of shit. Safari seems […]

UTF8 Openguides

I foolishly offered to make OpenGuides UTF-8 safe. Because I don’t do that enough at work, or something. Anyway, it’s going quite well – because I did all the grunt work in CGI::Wiki a while ago, it’s just a matter of finding all the inputs and outputs and making sure they’re encoded properly. So far, […]