character encoding, Epsilon Clue

Site News

Character Encodings Are a PITA

Character encoding schemes (UTF-8, ASCII, ISO-8859-1/-15,
Windows-1252, etc.) are an incredible source of headaches. Stay away
from them.

(Oh, and if you tell me I mean “raw character encoding” or “codepoint
set” some such, I’ll whack you upside the head with a thick Unicode
reference.)

In case you hadn’t noticed, I upgraded WordPress not too long ago.
Being the cautious sort, I did a dump of the back-end database before
doing so, as I’ve done every other time I upgraded. And, like every
other time, I noticed that some characters got mangled. This time
around, though, I decided to do something about it.

It turned out that when I originally set up the database, I told it to
use ISO-8859-1 as the default text encoding. But later, I told
WordPress to use UTF-8. And somewhere between dumping, restoring, and
WordPress’s upgrade of the schema, various characters got mangled. For
the most part, various ISO-8859-1 quotation marks got converted to
UTF-8, then interpreted as ISO-8859-1, and converted again. On top of
which, some commenters used retarded software to post comments, which
insisted on using cp1252 or cp1258 (and I even saw something which
might’ve been IBM-CP1133), which also got converted to and from UTF-8
and ISO-8859-1 or -15.

Obviously, with 13 Mb of data, I wasn’t going to correct it all by
hand; I needed to write a script. But that introduced additional
problems: a Perl script that’s basically “s/foo/bar/g” is
pretty simple, but when foo and bar are strings that
represent the same character using different encodings, things can get
hairy: what if bar is UTF-8, but Perl thinks that the file is
in ISO-8859-15?

On top of that, you have to keep track of which encoding Emacs is
using to show you any given file.

iconv turned out to be an invaluable forensic tool, but it has one
limitation: you can’t use it to simply decode UTF-8 (or if you can, I
wasn’t able to figure out how to do so). There were times when I
wanted to decode a snippet of text and look at it to see if I could
recognize the encoding. But iconv only allows you to convert from one
encoding to another; so if you try to convert from UTF-8 to
ISO-8859-1, and the resulting character isn’t defined in ISO-8859-1,
you get an error. Bleah.

The moral of the story is, use UTF-8 for everything. If the software
you’re using doesn’t give you UTF-8 as an option, ditch it and use
another package.

Andrew Arensburger

Dec, Sat, 2008

Epsilon Clue

Epsilon Clue

Tag character encoding

Character Encodings Are a PITA