MacOS filesystem stores Unicode characters in their decomposed form. For example, it stores é as two code points e (plain vanilla ASCII e) + ´ (combining acute accent). Kiran had written a blog post about this so I won’t go into details, but having recently discovered the wonderful charnames Perl pragma, I was curious to find out how it could help me ‘see’ what form (precomposed or decomposed) of a character does an OS use . Given two folders – café and müller in /tmp/test/, here is what the following Perl script:
use strict;
use charnames ':full';
opendir my ($dir), "/tmp/test";
my @stuff = readdir($dir);
closedir($dir);
foreach my $stuff (@stuff) {
next if $stuff =~ /^\./;
utf8::decode($stuff);
my @parts = unpack("U*", $stuff);
foreach my $part (@parts) {
print charnames::viacode($part);
print "\n";
}
print "\n\n";
}
gives:
LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E
COMBINING ACUTE ACCENT
LATIN SMALL LETTER M
LATIN SMALL LETTER U
COMBINING DIAERESIS
LATIN SMALL LETTER L
LATIN SMALL LETTER L
LATIN SMALL LETTER E
LATIN SMALL LETTER R
The same script when run on a Linux box results in:
LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER M
LATIN SMALL LETTER U WITH DIAERESIS
LATIN SMALL LETTER L
LATIN SMALL LETTER L
LATIN SMALL LETTER E
LATIN SMALL LETTER R
Notice that é is stored as one code-point (latin small letter e with acute). Ditto for ü (latin small letter u with diaeresis as opposed to latin small letter u + combining diaeresis on Mac).
charnames can be a very useful tool in your toolbox. More on what it can do on perldoc: http://perldoc.perl.org/charnames.html
p.s. Eric Sink mentions this (the difference between the way different filesystems store certain Unicode characters) in his wonderful cross-platform version control post too. See point #9.
Add New Comment
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Add New Comment