Continuing Adventures in Ruby and Unicode
I’m like reeeeeally tired right now, but here goes anyway.
Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.
Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.
Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?
Er, that’s what I thought. But behold, from the docs for that script:
¿Como say what?
Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):
s = str1.unpack('U*')
t = str2.unpack('U*')
n = s.length
m = t.length
return m if (0 == n)
return n if (0 == m)
And n and m now contain the length of str1 and str2.
That is, the “real” length.
The number of characters, not bytes.
Hmm, yeah. The jlength.
Except we didn’t even require 'jcode'.
So just how does this String#unpack doohickey work, anyway…
Well one could go look at the docs…
IT’S LOOKS LIKE C!!! *runs screaming*
Okay yeah. Well, that’s interesting. I’m going to have to read about that.
And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.
Okay well specifically I don’t know what 指定する and 計算する mean.
It’s always the verbs that get you.
We’ll figure all that out tomorrow, mmkay?
