Ruby 1.9 vs MacRuby – string handling
Ruby 1.9, among other things, brings much needed improvements to the way unicode strings are handled. The string class now includes a property called encoding which tells us the – well – encoding of a given string. By default a string’s encoding is same as the encoding of the source file, which in turn can be set by using the coding comment. For example, to use utf8 as the source’s encoding (and to be able to use utf8 characters as part of string literals) you’d use: # -*- coding: utf-8 -*-
Let’s look at some sample code and its output under Ruby 1.9
Code:
# -*- coding: utf-8 -*-
str = "café"
puts "Encoding : #{str.encoding}"
puts "Length : #{str.length}"
puts "Byte Size : #{str.bytesize}"
puts "#{str} in upper case is: #{str.upcase}"
Output:
Encoding : UTF-8 Length : 4 Byte Size : 5 café in upper case is: CAFé
As is evident from the output above, Ruby 1.9 still doesn’t handle casing beyond the ASCII range. Upper casing café, gave us CAFé as opposed to CAFÉ (which is the correct response).
Also the byte size of the string is 5 because under the utf-8 encoding, é takes up two bytes – 0xC3, 0xE9.
MacRuby – to quote the project site – is a version of Ruby 1.9, ported to run directly on top of Mac OS X core technologies such as the Objective-C common runtime and garbage collector, and the CoreFoundation framework.
This means that the Ruby datatypes have been implemented on top of Mac “native” (Cocoa) datatypes – e.g. Ruby strings are implemented on top NSString.
This introduces some differences in the way strings are handled by MacRuby. To start with, non-uncode strings use the ‘MACINTOSH’ encoding (Ruby 1.9 default is US-ASCII) while the unicode strings use utf-16 (even if you’ve set the coding comment to use utf-8). MacRuby also handles casing correctly.
So the same code snippet as above gives different output:
Output:
Encoding : UTF-16 Length : 4 Byte Size : 4 café in upper case is: CAFÉ
Note that casing is handled correctly by MacRuby.
The byte size for the string is 4 because É is 0xE9 under the utf-16 encoding. Technically most characters, when using the utf-16 encoding, should take up 2 bytes (c should be 0×0043, é should be 0x00E9 and so on) but I guess the most significant byte is not used if it is 0×00.
p.s. the versions of the products used in the examples above are:
1. Ruby – 1.9.1p376 (2009-12-07 revision 26041) [i386-darwin10] (installed via macports 1.8.2)
2. MacRuby version 0.5 (ruby 1.9.0) [universal-darwin10.0, x86_64] (binary distribution from the official MacRuby project site)