Continuing Adventures in Ruby and Unicode

I’m like reeeeeally tired right now, but here goes anyway.

Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.

Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.

Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?

Er, that’s what I thought. But behold, from the docs for that script:

distance(str1, str2) Calculate the Levenshtein distance between two strings str1 and str2. str1 and str2 should be ASCII or UTF-8 .

¿Como say what?

Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):

        s = str1.unpack('U*')
        t = str2.unpack('U*')
        n = s.length
        m = t.length
        return m if (0 == n)
        return n if (0 == m)

And n and m now contain the length of str1 and str2.

That is, the “real” length.

The number of characters, not bytes.

Hmm, yeah. The jlength.

Except we didn’t even require 'jcode'.

So just how does this String#unpack doohickey work, anyway…

Well one could go look at the docs

IT’S LOOKS LIKE C!!! *runs screaming*

Okay yeah. Well, that’s interesting. I’m going to have to read about that.

And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.

Okay well specifically I don’t know what 指定する and 計算する mean.

It’s always the verbs that get you.

We’ll figure all that out tomorrow, mmkay?

ruby unicode

4 Comments »

  1. Paul Battley Said,

    July 3, 2005 @ 2:44 pm

    Hehe - I have Unicode-fu? I like that.

    I’ve done a fair bit of multilingual text processing in Ruby, in UTF-8, with very few problems. Of course, a large part of ensuring an easy life is ensuring that all systems are speaking UTF-8 to start with, and to normalise any non-UTF-8 external input to UTF-8 before processing.

    Regarding the 'U*' format specifier in pack/unpack, this converts UTF-8 text from/to an array of Unicode codepoints (or is that Unicodepoints? Whatever). That means that my code above makes an Array of Fixnum from a UTF-8 string, where each element corresponds to one letter in the source. For the purposes of the Levenshtein algorithm, it doesn’t even matter what those are as long as there is one per character, so my implementation uses the Unicode codepoints as the basis for calculation.

    As I speak Japanese, I can also tell you that the text above says, “First, we call [require] the jcode module here. Next, we set the kanji code [$KCODE] to EUC-JP. After that, we use [the] jlength [method], which has been added for use with Japanese, to count the number of characters.”

  2. julik Said,

    July 10, 2005 @ 11:26 am

    This is not fair that you have to speak Japanese to be able to use Ruby. Gives me this warm fuzzy feeling of discrimination.

  3. Sean Bryant Said,

    July 25, 2005 @ 7:39 pm

    Speaking japanese doesn’t really help you with ruby. It does give more incentive to learn to use UTF-8 encoded strings though.

  4. pat Said,

    July 25, 2005 @ 7:46 pm

    Yeah, Japanese isn’t necessary (at all). Actually, I get the impression that most of the really useful documentation nowadays is in English.

    I just happened to take a few Japanese classes in college and I think it’s fun to try to decipher stuff.

    Looking back at this post, it’s one of the worst things I’ve written, haha. It’s totally all over the map.

    HOLD ME TO A HIGHER STANDARD HERE, PEOPLE!
    ☺☻☺☻☺☻☺☻☺☻

RSS feed for comments on this post · TrackBack URI

Leave a Comment

Please wrap code snippets in <code> tags, thanks!