infundibulum

The Africa You Never See

June 11th, 2005

Here’s a link to a rather old article I ran across (from last April) from the Washington Post:

The Africa You Never See

Most of the time, Africa is simply not on the map. The continent’s booming stock markets are almost never mentioned in newspaper financial pages. How often is an African country– apart, perhaps, from South Africa or Egypt or Morocco — featured in a newspaper travel section? Even the listing of worldwide weather includes only a few African cities.

The result of this portrait is an Africa we can’t relate to. It seems so foreign to us, so different and incomprehensible. Since we can’t relate to it, we ignore it.

Blogs are one force chipping away at this unfamiliarity: there are more every day.

Ruby and Unicode

June 11th, 2005

I’ve been looking into Ruby’s Unicode support, since I’m working on a Rails project. I had to jump through some hoops to figure out how to get Ruby to handle UTF-8 — it’s not too well documented.

The short answer can be found here: How To Use Unicode Strings in Rails. Bottom line: prefix your code with:

$KCODE = 'u'
require 'jcode'

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH
ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL A

…there are three bytes behind the scenes (0xE3, 0×81, and 0×81). ASCII letters are still one byte, as ASCII is a subset of Unicode UTF-8 [thanks, Alex].

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

For instance… in Mike Clark intro to learning Ruby through Unit Testing, he suggests testing some of the string methods like this:

require 'test/unit'

class StringTest < Test::Unit::TestCase
def test_length
s = "Hello, World!"
assert_equal(13, s.length)
end
end

So let’s use the Greek translation of “Hello, World!”

Καλημέρα κόσμε!

That has 15 letters, including the space. Let’s test it:

require 'test/unit'
require 'jcode'
$KCODE = 'UTF8'

class StringTest < Test::Unit::TestCase
def test_jlength
s = "Καλημέρα κόσμε!"
assert_equal(15, s.jlength) # Note the 'j'
assert_not_equal(15, s.length) # Normal, non unicode length
assert_equal(28, s.length) # Greek letters happen to take two-bytes
end
end

All that works as expected. In fact, I went and looking in the Pickaxe book and there was an example just like this.

But I’ll leave you with a few tests that fail, and seem to me like they shouldn’t (or am I misunderstanding?).

def test_reverse
# there are ways aorund this, but...
s = "Καλημέρα κόσμε!"
reversed = "!εμσόκ αρέμηλαΚ"
srev = s.reverse
assert_equal(reversed,srev) # fails
end

def test_index
# String#index isn't Unicode-aware, it's counting bytes
# there are ways aorund this, but...
s = "Καλημέρα κόσμε!"
assert_equal(0, s.index('Κ')) # passes
assert_equal(1, s.index('α')) # fails!
assert_equal(3, s.index('α')) # passes; 3rd byte!
end

Neither of those work.

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

UPDATE
Why the Lucky Stiff has some interesting ideas about how to get around these limitations, at least until they’re fixed in Ruby: RedHanded » Closing in on Unicode with Jcode