Ruby and Unicode

I’ve been looking into Ruby’s Unicode support, since I’m working on a Rails project. I had to jump through some hoops to figure out how to get Ruby to handle UTF-8 — it’s not too well documented.

The short answer can be found here: How To Use Unicode Strings in Rails. Bottom line: prefix your code with:

$KCODE = 'u'
require 'jcode'

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH
ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL A

…there are three bytes behind the scenes (0xE3, 0x81, and 0x81). ASCII letters are still one byte, as ASCII is a subset of Unicode.

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

For instance… in Mike Clark intro to learning Ruby through Unit Testing, he suggests testing some of the string methods like this:

require 'test/unit'
 
class StringTest < Test::Unit::TestCase
  def test_length
    s = "Hello, World!"
    assert_equal(13, s.length)
  end
end

So let’s use the Greek translation of “Hello, World!”

Καλημέρα κόσμε!

That has 15 letters, including the space. Let’s test it:

require 'test/unit'
require 'jcode'
$KCODE = 'UTF8'
 
class StringTest < Test::Unit::TestCase
  def test_jlength
    s = "Καλημέρα κόσμε!"
    assert_equal(15, s.jlength) # Note the 'j'
    assert_not_equal(15, s.length) # Normal, non unicode length
    assert_equal(28, s.length) # Greek letters happen to take two-bytes
  end
end

All that works as expected. In fact, I went and looking in the Pickaxe book and there was an example just like this.

But I’ll leave you with a few tests that fail, and seem to me like they shouldn’t (or am I misunderstanding?).

  def test_reverse
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    reversed = "!εμσόκ αρέμηλαΚ"
    srev = s.reverse
    assert_equal(reversed,srev) # fails
  end
 
  def test_index
    # String#index isn't Unicode-aware, it's counting bytes
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    assert_equal(0, s.index('Κ')) # passes
    assert_equal(1, s.index('α')) # fails!
    assert_equal(3, s.index('α')) # passes; 3rd byte!
  end

Neither of those work.

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

UPDATE
Why the Lucky Stiff has some interesting ideas about how to get around these limitations, at least until they’re fixed in Ruby: RedHanded » Closing in on Unicode with Jcode

4 Comments »

  1. Oliver Said,

    July 12, 2005 @ 10:27 am

    It is not correct that Unicode hides “all that nonsense” from your programming, at least not the “nonsense” you mean. Unicode does not deal with letters. Unicode deals with code points, and in Unicode not every code point corresponds to a letter. For example, there exists modifier letters that are not complete letters itself but only changes the precedings letter to a variant. For some of these letters exist precomposed form but this complicates the things even more: The letter ä can be encoded as either U+00E4 or as the sequence U+0075 U+0308. As you see, you can neither count the letters nor you can compare both strings simply. To deal with the latter case Unicode provide a normalization algorithm to construct a canonical form from a given Unicode sequence.

    I like Unicode because it is a great step onwards. But what I want to say is: Doing real i18n is more than to just plug Unicode in. It is important that people understand that there are issues Unicode does not solve because Unicode does not want to solve these things! Dealing with all this “nonsense” requires a higher-level library (for example, have a look at ICU http://icu.sf.net). If people do not see these issues they simple mark all their string as Unicode (e.g. when they are Python programmers with a preceding u :) ) and think “Wooow, I have a fully i18n’zed app” but things are _not_ sooo easy!

  2. pat Said,

    July 12, 2005 @ 12:56 pm

    Hi Oliver,

    Thanks for the comment!

    You’re right on, of course. My phrase “all that nonsense” is sweeping an awful lot of stuff under the rug. ☺

    I was trying to emphasize that in text which represents Unicode the number of bytes per “point” — in utf-8 — is effectively abstracted away: i.e., that the programmer doesn’t have to know that a Japanese character is encoded as 3 bytes, an Arabic letter as 2, or whatever. So, insofar as points agree with peoples’ intuition of what a letter is, Unicode brings us closer to the abstract idea of “letters.”

    But of course, intuition of what a letter is is also fuzzy: is «é» its own letter in the context of an English document, for instance? Or a variant of «e»?

    And it seems to me that these are just the same sort of vagaries that underly the precomposed vs. sequential issue that you describe. For instance, if one were to write a letter-counting program, one would have to make decisions about counting «é» vs. «e». But yes, it seems unlikely that one would want to count «ˊ» (U+02CA) as a letter of its own. And, again as you point out, that’s what normalization is for, and a decent character counting utility should do normalization.

    My point is simply that these sorts of issues are at least in the realm of “language,” or “orthography” as most people think of them, and not “bytes,” which is the way it’s always been.

    As for your point that real i18n is more than just plugging Unicode in, I couldn’t agree more. Unicode is just a starting point. I think i18n is really best thought of as a process rather something that one can accomplish all at once, although libraries like the one you mention handle a lot of the burden.

    Thanks again for dropping by — by the way, do you have a blog? I’m always on the lookout for blogs that touch on i18n and Unicode in particular, especially when their authors have more experience than I do! ☺

  3. Lion Kimbro Said,

    July 23, 2005 @ 1:21 pm

    I recently learned something somewhat interesting: Unicode opposition is strongest in Japan. There was something called the Han Unification, and they didn’t like it. I always wondered if that rejection influence Ruby’s support for Unicode.

  4. pat Said,

    July 23, 2005 @ 11:59 pm

    Hi Lion,

    That’s an interesting article, thanks for the link. It’s sort of surprising to see that the disagreements over Japanese and Chinese forms has proven more contentious than that between Simplified and Traditional Chinese.

    But you’re right, there seems to be an awful lot of inertia with Japanese content — even the author of Ruby, Matz’s blog is encoded as EUC-JP.

    I think eventually, though, people will come around to Unicode, because practically speaking, there’s just no other encoding that allows you to any two (or more) languages in the same document. The problem isn’t so much those rare documents which will include both Chinese and Japanese variants of a character, it’s the far more common documents that include Chinese or Japanese and any other language.

    Without Unicode, how can one have a bilingual document in Japanese and Persian, or Mandarin and Hindi, etc, etc.

RSS feed for comments on this post · TrackBack URI

Leave a Comment

Please wrap code snippets in <code> tags, thanks!