infundibulum

Continuing Adventures in Ruby and Unicode

June 25th, 2005

I’m like reeeeeally tired right now, but here goes anyway.

Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.

Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.

Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?

Er, that’s what I thought. But behold, from the docs for that script:

distance(str1, str2) Calculate the Levenshtein distance between two strings str1 and str2. str1 and str2 should be ASCII or UTF-8 .

¿Como say what?

Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):

        s = str1.unpack('U*')
        t = str2.unpack('U*')
        n = s.length
        m = t.length
        return m if (0 == n)
        return n if (0 == m)

And n and m now contain the length of str1 and str2.

That is, the “real” length.

The number of characters, not bytes.

Hmm, yeah. The jlength.

Except we didn’t even require 'jcode'.

So just how does this String#unpack doohickey work, anyway…

Well one could go look at the docs

IT’S LOOKS LIKE C!!! *runs screaming*

Okay yeah. Well, that’s interesting. I’m going to have to read about that.

And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.

Okay well specifically I don’t know what 指定する and 計算する mean.

It’s always the verbs that get you.

We’ll figure all that out tomorrow, mmkay?

ruby unicode

Want to Know How to Make Some Money?

June 25th, 2005

Here, I’ll tell you.

News Sentinel | 06/24/2005 | Funding cut for translator service

Asterisk + Wireless network + Laptops + Webcams + Subscriptions + Nationwide (Worldwide?) network of on-call interpreters for lots of languages.

Well, go on.

Update: This probably wouldn’t work: I Bet You Didn’t Make Any Money…