Continuing Adventures in Ruby and Unicode
I’m like reeeeeally tired right now, but here goes anyway.
Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.
Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.
Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8
stuff in Ruby doesn’t work so hot, right?
Er, that’s what I thought. But behold, from the docs for that script:
¿Como say what?
Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):
s = str1.unpack('U*') t = str2.unpack('U*') n = s.length m = t.length return m if (0 == n) return n if (0 == m)
And n
and m
now contain the length of str1
and str2
.
That is, the “real” length.
The number of characters, not bytes.
Hmm, yeah. The jlength
.
Except we didn’t even require 'jcode'
.
So just how does this String#unpack
doohickey work, anyway…
Well one could go look at the docs…
IT’S LOOKS LIKE C!!! *runs screaming*
Okay yeah. Well, that’s interesting. I’m going to have to read about that.
And maybe this, because that guy seems to know what the heck this jcount
thing I found in /usr/lib/ruby/1.8/jcode.rb
is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.
Okay well specifically I don’t know what 指定する and 計算する mean.
It’s always the verbs that get you.
We’ll figure all that out tomorrow, mmkay?