Unicode characters on MacOS and Linux filesystems

MacOS filesystem stores Unicode characters in their decomposed form. For example, it stores é as two code points e (plain vanilla ASCII e) + ´ (combining acute accent). Kiran had written a blog post about this so I won’t go into details, but having recently discovered the wonderful charnames Perl pragma, I was curious to find out how it could help me ‘see’ what form (precomposed or decomposed) of a character does an OS use . Given two folders – café and müller in /tmp/test/, here is what the following Perl script:

use strict;
use charnames ':full';

opendir my ($dir), "/tmp/test";
my @stuff = readdir($dir);
closedir($dir);

foreach my $stuff (@stuff) {
	next if $stuff =~ /^\./;
	utf8::decode($stuff);
	my @parts = unpack("U*", $stuff);
	foreach my $part (@parts) {
		print charnames::viacode($part);
		print "\n";
	}
	print "\n\n";
}

gives:

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E
COMBINING ACUTE ACCENT

LATIN SMALL LETTER M
LATIN SMALL LETTER U
COMBINING DIAERESIS
LATIN SMALL LETTER L
LATIN SMALL LETTER L
LATIN SMALL LETTER E
LATIN SMALL LETTER R

The same script when run on a Linux box results in:

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E WITH ACUTE

LATIN SMALL LETTER M
LATIN SMALL LETTER U WITH DIAERESIS
LATIN SMALL LETTER L
LATIN SMALL LETTER L
LATIN SMALL LETTER E
LATIN SMALL LETTER R

Notice that é is stored as one code-point (latin small letter e with acute). Ditto for ü (latin small letter u with diaeresis as opposed to latin small letter u + combining diaeresis on Mac).

charnames can be a very useful tool in your toolbox. More on what it can do on perldoc: http://perldoc.perl.org/charnames.html

p.s. Eric Sink mentions this (the difference between the way different filesystems store certain Unicode characters) in his wonderful cross-platform version control post too. See point #9.