Continuing Adventures in Ruby and Unicode
June 25th, 2005I’m like reeeeeally tired right now, but here goes anyway.
Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.
Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.
Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?
Er, that’s what I thought. But behold, from the docs for that script:
¿Como say what?
Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):
s = str1.unpack('U*')
t = str2.unpack('U*')
n = s.length
m = t.length
return m if (0 == n)
return n if (0 == m)
And n and m now contain the length of str1 and str2.
That is, the “real” length.
The number of characters, not bytes.
Hmm, yeah. The jlength.
Except we didn’t even require 'jcode'.
So just how does this String#unpack doohickey work, anyway…
Well one could go look at the docs…
IT’S LOOKS LIKE C!!! *runs screaming*
Okay yeah. Well, that’s interesting. I’m going to have to read about that.
And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.
Okay well specifically I don’t know what 指定する and 計算する mean.
It’s always the verbs that get you.
We’ll figure all that out tomorrow, mmkay?
Hehe - I have Unicode-fu? I like that.
I’ve done a fair bit of multilingual text processing in Ruby, in UTF-8, with very few problems. Of course, a large part of ensuring an easy life is ensuring that all systems are speaking UTF-8 to start with, and to normalise any non-UTF-8 external input to UTF-8 before processing.
Regarding the
'U*'format specifier inpack/unpack, this converts UTF-8 text from/to an array of Unicode codepoints (or is that Unicodepoints? Whatever). That means that my code above makes anArrayofFixnumfrom a UTF-8 string, where each element corresponds to one letter in the source. For the purposes of the Levenshtein algorithm, it doesn’t even matter what those are as long as there is one per character, so my implementation uses the Unicode codepoints as the basis for calculation.As I speak Japanese, I can also tell you that the text above says, “First, we call [require] the
- Paul Battley @ 3 July 2005jcodemodule here. Next, we set the kanji code [$KCODE] to EUC-JP. After that, we use [the]jlength[method], which has been added for use with Japanese, to count the number of characters.”This is not fair that you have to speak Japanese to be able to use Ruby. Gives me this warm fuzzy feeling of discrimination.
- julik @ 10 July 2005Speaking japanese doesn’t really help you with ruby. It does give more incentive to learn to use UTF-8 encoded strings though.
- Sean Bryant @ 25 July 2005Yeah, Japanese isn’t necessary (at all). Actually, I get the impression that most of the really useful documentation nowadays is in English.
I just happened to take a few Japanese classes in college and I think it’s fun to try to decipher stuff.
Looking back at this post, it’s one of the worst things I’ve written, haha. It’s totally all over the map.
HOLD ME TO A HIGHER STANDARD HERE, PEOPLE!
- pat @ 25 July 2005☺☻☺☻☺☻☺☻☺☻
指定する == shiteisuru, to designate or assign, I think
計算する == keisansuru, to calculate or compute, I think
With the help of this handy site: http://www.popjisyo.com, I think the whole thing means:
To begin with, the module is called ‘jcode’. The Kanji coding is
specified to be EUC. Additionally, the number of Japanese
characters is computed by using jlength.
わたしの 日本語は 度塩基性です。。。
- Dido Sevilla @ 19 September 2005Thanks Dido,
I’m a popjisyo fan as well.
- pat @ 20 September 2005Here is a relatively simple implementation of unicode-aware string class functions using scan(/./u):
====
$KODE=’U’
class String
def __str2carr(str=self)
str.scan(/./u)
end
def uat(*args)
arr = __str2carr()
if (args.length == 2)
arr = arr[args[0].to_i, args[1].to_i]
unless (arr)
self
else
arr.join
end
else
if (not args.to_s.include?(’..’)) # Fixnum
arr[args[0]]
else
arr = arr[eval(args.to_s)] # Range
unless (arr)
self
else
arr.join
end
end
end
end
def uindex(what, offset=0)
idx = nil
str = self
if (offset > 0)
str = __str2carr(str)[offset..-1].join
end
if (what.class == String)
counter = -1
failed = false
0.upto(str.ulength-1) {
|i1|
if (str.uat(i1) == what.uat(0))
counter = i1
0.upto(what.ulength-1) {
|i2|
unless (str.uat(i1) == what.uat(i2))
failed = true
break
end
i1 = i1 + 1
}
break
end
}
if (counter > -1 and not failed)
idx = counter + offset
end
elsif (what.class == Fixnum)
## not sure what to do for this…what are the semantics for Unicode here? Look for a char represented by a unicode escape…or?
elsif (what.class == Regexp)
m = /(#{what})/u.match(str)
if (m)
idx = m.pre_match.ulength + offset
end
end
idx
end
def uinspect()
“\”#{self}\”"
end
def ulength()
__str2carr().length
end
def uslice(from, to)
self.uat(from, to)
end
def ureverse()
__str2carr().reverse.join
end
end
s = “Καλημέρα κόσμε!”
puts(s.uat(3)) # η
puts(s.uat(3..4)) # ημ
puts(s.uat(3,2)) # ημ
puts(s.uslice(3,2)) # ημ
puts(s.ulength) # 15
puts(s.uinspect) # “Καλημέρα κόσμε!”
puts(s.ureverse) # !εμσόκ αρέμηλαΚ
puts(s.uindex(’μέρ’)) # 4
====
It is by no means complete, just a proof of concept really. And obviously it is much slower than native support in the C backend would be. But still, it passes the suggested unit tests and is better than nothing.
- Jordan Callicoat @ 20 July 2006ack formatting…
- Jordan Callicoat @ 20 July 2006Hello? Oh well, I probably just duplicated what is already out there anyhow…I thought it was cool though
- Jordan Callicoat @ 25 July 2006