Removing Accents in Strings
I’ve been ripping and encoding a bunch of music. Since I’m a hacker, naturally I have scripts that take a file with artist, album title, and track titles, and finds the corresponding .wav or .aiff source files, encodes them as MP3 and tags them.
A lot of the music I have is in French or German (and some Spanish and Russian), so there are accented letters in names and titles. My input files are in UTF-8 format, so that’s cool. But one problem is that of generating a filename for the MP3 files: if I want to play the song “Diogène série 87” by H.F. Thiéfaine on his album “Météo für nada”, I don’t want to have to figure out how to type those accents in the file and directory names. I want the script to pick filenames that use only ASCII characters.
In most cases, the Right Thing to do is simply to remove the accent, so that the song mentioned above would be stored as H.F. Thiefaine/Meteo fur nada/08 - Diogene serie 87.mp31. Fortunately, it turns out that there’s a way to automate this.
Unicode, of which UTF-8 is a representation, tries to be ridiculously complete. So there are separate representations for “e”, the acute accent (́ ), and “e with acute accent” (é). Naturally, this led to another problem: if one string contains “e”+<combining acute accent>, and another string contains <e with acute accent>, it ought to be possible to compare them and see that they’re in some sense the same string, but how can you tell?
The way they solved this is by so-called composition and decomposition forms. Basically, this means that if your string contains “e”+<combining acute accent> and run it through the standard composition algorithm, you’ll get <e with acute accent> (and decomposition does the reverse). This allows you to normalize two strings so that you can then compare them byte by byte. Furthermore, Perl’s Unicode::Normalize module implements this.
What this means in this case is that to strip the accent from a character, all you need to do is to decompose it (using Unicode::Normalize::NFD or Unicode::Normalize::NFKD, then take the first character in the decomposed version:
use utf8; # Tell Perl that this file is written in UTF-8 use Unicode::Normalize; $str = "é"; $decomposed = NFKD($str); # Compatibility decomposition of $str $decomposed =~ /^(.)/; print "First letter: $1n";
Obviously, not every weird Unicode character is a letter with an accent, so you still need to do some sanity checking.
Another problem is that for some reason, not all ligatures can be decomposed: “ff” (the “ff” ligature) can be decomposed into “f”+”f”, but “œ” (the “oe” ligature) does not decompose into “o”+”e” for some reason, which is annoying, since the French word for “heart” is “cœur”, and you can imagine how often that comes up in French song titles.
One possible way of working around this might be to use Perl’s charnames module, which allows you to get the full name of a Unicode character:
use utf8; use charnames ":full"; print charnames::viacode(ord("œ"));
prints “LATIN SMALL LIGATURE OE”, so it should be possible to search for /(SMALL|CAPITAL) LIGATURE (S+)/ and break that down into its component letters. I haven’t done this yet, though.
One final problem with ID3 tags is that of finding the right representation for strings: ID3 v2.3.0 only supports ISO-8859-1 and UTF-16. I dislike UTF-16 because it uses two bytes per character, even for plain ASCII text, and breaks strings. That leaves me with either ISO-8859-1 (good for most European languages) or ID3 v2.4.0. ID3 v2.4.0 supports UTF-8, but unfortunately the tools I have available for tagging MP3 files only support ID3 v2.3.0. Fortunately, ISO-8859-1 is good enough for 99% of what I want, so the only problem is converting the UTF-8 in the source file to ISO-8859-1 when it comes time to tag the file. And for this, we can use Perl’s Encode module:
use utf8; use Encode; $str = "é"; print encode("iso-8859-1", $str);
1: Actually, it should be H.F. Thiefaine/Meteo fuer nada/08 - Diogene serie 87.mp3, since in German, “ü” is written “ue” when no accents are available. But I don’t feel like dealing with the problem of explaining to the script that there’s a German word in the middle of a French title, so in this case I just renamed the directory.
MMMM… geek.
You say that like it’s a bad thing.
Great post. It was exactly the info I was looking for to solve a similar problem. Thanks!
There is also a Pear package to normalize Unicoded string and stripping accented characters at: http://pear.php.net/package/I18N_UnicodeNormalizer/
Try:
$str = Unicode::Normalize::NFKD($str);
$str =~ s/p{NonspacingMark}//g;
Bryce:
It took me a while to get that to work, but it did. Thanks! I should’ve known that the Unicode consortium would have included something like this.
Of course, this doesn’t quite work for everything; it might be nice if an AE ligature became “A”, “E”, but I can deal with that later.