Archive for Ruby

Continuing Adventures in Ruby and Unicode

I’m like reeeeeally tired right now, but here goes anyway.

Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.

Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.

Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?

Er, that’s what I thought. But behold, from the docs for that script:

distance(str1, str2) Calculate the Levenshtein distance between two strings str1 and str2. str1 and str2 should be ASCII or UTF-8 .

¿Como say what?

Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):

        s = str1.unpack('U*')
        t = str2.unpack('U*')
        n = s.length
        m = t.length
        return m if (0 == n)
        return n if (0 == m)

And n and m now contain the length of str1 and str2.

That is, the “real” length.

The number of characters, not bytes.

Hmm, yeah. The jlength.

Except we didn’t even require 'jcode'.

So just how does this String#unpack doohickey work, anyway…

Well one could go look at the docs

IT’S LOOKS LIKE C!!! *runs screaming*

Okay yeah. Well, that’s interesting. I’m going to have to read about that.

And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.

Okay well specifically I don’t know what 指定する and 計算する mean.

It’s always the verbs that get you.

We’ll figure all that out tomorrow, mmkay?

ruby unicode

Comments (4)

Getting Unicode, MySql, and Rails to Cooperate

In a post about Ruby and Unicode a while back, I mentioned the page at the Rails wiki called How To Use Unicode Strings in Rails. (Btw, check out Why the Lucky Stiff’s response to my post, some useful code there.)

It turns out that there were a few more mostly MySQL-specific steps involved in getting Unicode to work correctly with Rails. So I thought I’d describe all the steps we went through to get it set up in one place. This has only been tested with MySQL 4.1.

In MySQL: Set the Encoding when you Create Tables

You need to explicity tell MySQL that you want your tables to be encoded in UTF-8. Here’s a sample table with 3 columns, id, foo, and bar:

create table samples (
    id int not null auto_increment,
    foo varchar(100) not null,
    bar text not null,
    primary key (id)
) Type=MyISAM CHARACTER SET utf8;

The line Type=MyISAM CHARACTER SET utf8; is where the action is. The table type has to be MyISAM, not innoDB, because unfortunately innoDB tables don’t support full text searching of UTF-8 encoded content (details here). Apparently innoDB tables are more flexible in general, but if full-text search is crucial for you, you’ll have to go with MyISAM.

Bummer, that.

In any case, then add the CHARACTER SET utf8 directive, as shown.

(I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)

Set the “charset” and “Content-type” in the Application Controller

This is also described at How To Use Unicode Strings in Rails.

class ApplicationController < ActionController::Base
  before_filter :set_charset
 
  def set_charset
    @headers["Content-Type"] = "text/html; charset=utf-8" 
  end
end

The previous steps were enough to get Unicode showing up in a little test app I generated with scaffold (the app just consisted of an input field and a textarea).

To test it, I pasted in some sample text in various languages. It worked okay for text containing only the characters found in ASCII or latin1, but among other characters there were weird cases of random characters being removed or added. The problem seemed not to have anything to do with the script (i.e., the Unicode block). For instance, in an Esperanto text, “ĉ” (U+0109 LATIN SMALL LETTER C WITH CIRCUMFLEX) came out fine, but “ĝ” (U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX) was borked. Go figure.

It took some digging to find the next bit — thanks to Ben Jackson of INCOMUM Design & Conceito for getting the straight story on the Rails list. The solution is…

Tell Rails to tell MySQL to Use UTF-8. (Got that?)

It came down to a MySQL configuration option: You have to tell MySQL to SET NAMES UTF8, as DHH pointed out in the previous link. You can either do it in the source to ActiveRecord, in mysql_adapter.rb, or you can just make the change in your own application. We chose the latter route.

So, here’s our app/controllers/application.rb as it stands:

class ApplicationController < ActionController::Base
  before_filter :set_charset
  before_filter :configure_charsets
 
  def set_charset
        @headers["Content-Type"] = "text/html; charset=utf-8"
  end
 
  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

(In case you’re wondering, yes, you can have more than one before_filter.)

UPDATE Er, but you can get away with just one: Jonas refactored it. Use this version instead. :)

class ApplicationController < ActionController::Base
  before_filter :configure_charsets
 
  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

UPDATED AGAIN Argh, I forgot to remove the set_charset definition before. It should be correct now…

Anyway, now our Rails installation seems to handle (almost) any nutty writing system that we throw at it.

I’m not sure these are all the Right Way™, I’m just saying it’s worked for us (so far).

For one thing, it seems like it might make sense to just go ahead and set the default character set and collation for MySQL to UTF-8, independently of any Rails stuff at all. Aren’t all character sets supposed to be slouching toward Unicode, anyway? But that sounds like a rather apocalyptic measure, for some reason… I guess I’ll hold off on the rapture of the character sets.

Collation is it’s own topic — but then we’ve not gotten to sorting anything yet. Here’s a recommendation for building the tables with:

CHARACTER SET utf8 COLLATE utf8_general_ci;

To set the collation order. Wading through the MySQL docs on that is next on the agenda.

Comments (4)

Reading lots of Rails Source

If, like me, you happen to have been studying Ruby on Rails, here’s a silly trick for reading through the source to the applications awaiting judging at Rails Day Contest.

The source of the entries is here in Subversion repositories, but there really isn’t any way to navigate between projects. Each project has the typical Rails folder hierarchy, under URLs like:

http://railsday.com/svn/railsday65/app/models/
http://railsday.com/svn/railsday66/app/models/
http://railsday.com/svn/railsday66/app/controllers/

So if you’re like me, you like to look at lots of code and compare stuff . In the case of Rails, I wanted to get a general feel, for instance, for the sorts of stuff that goes in /app/controllers or /app/models in various projects.

It so happens that Jesse Ruderman has written some navigation bookmarklets that work great for nagivating around those Rails projects: if you drag those two linked words (”increment” and “decrement”) to your toolbar and then visit the first project, you can click them to navigate around.

increment: Increases the last number in the URL by 1.

decrement: Decreases the last number in the URL by 1.

Okay, maybe that didn’t warrant an entire post.

Heh.

Comments (2)

Class variables are Mysterious

I was just reading about Class variables, which, for some reason, don’t make an awful lot of intuitive sense to me… so, how important are they? Being a statistically-minded guy, when in doubt, I count.

UPDATED below, thanks sbp! (The script immediately following gives the right results, but only by chance. Find the bug? 8^) )

$ pwd 
/usr/lib/ruby/1.8
$  grep -c  '@@' *.rb |rev|sort -n |rev|tac|head
weakref.rb 17
yaml.rb 16
profiler.rb 12
gserver.rb 11
tempfile.rb 5
tmpdir.rb 4
set.rb 3
resolv.rb 2
matrix.rb 0
complex.rb 0
$ ls -alh *.rb |wc -l
80

Well, only 8 out of 80 files in the Ruby directory have Class variables, so I’m not going to go insane about it. I mean, I’ll learn it but… that gives me a little perspective .

UPDATE
Sean B. Palmer pointed via a comment and instant messaging that there’s a problem with my little script: Uh, it doesn’t sort right. It’s sorting reversed numerals. Funnily enough, the data I had (the number of hits for '@@' in those Ruby files) sorted the same way reversed.

sbp’s solution doesn’t suffer from this thinko:

$ grep -c '@@' *.rb | awk -F: '{print "$2" "$1"}' | sort -rn | head

Which gives the same results as mine borked commandline. But it’s right.

Ah, blogging! The debugger of the future!

Comments (2)

Ruby and Unicode

I’ve been looking into Ruby’s Unicode support, since I’m working on a Rails project. I had to jump through some hoops to figure out how to get Ruby to handle UTF-8 — it’s not too well documented.

The short answer can be found here: How To Use Unicode Strings in Rails. Bottom line: prefix your code with:

$KCODE = 'u'
require 'jcode'

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH
ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL A

…there are three bytes behind the scenes (0xE3, 0x81, and 0x81). ASCII letters are still one byte, as ASCII is a subset of Unicode.

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

For instance… in Mike Clark intro to learning Ruby through Unit Testing, he suggests testing some of the string methods like this:

require 'test/unit'
 
class StringTest < Test::Unit::TestCase
  def test_length
    s = "Hello, World!"
    assert_equal(13, s.length)
  end
end

So let’s use the Greek translation of “Hello, World!”

Καλημέρα κόσμε!

That has 15 letters, including the space. Let’s test it:

require 'test/unit'
require 'jcode'
$KCODE = 'UTF8'
 
class StringTest < Test::Unit::TestCase
  def test_jlength
    s = "Καλημέρα κόσμε!"
    assert_equal(15, s.jlength) # Note the 'j'
    assert_not_equal(15, s.length) # Normal, non unicode length
    assert_equal(28, s.length) # Greek letters happen to take two-bytes
  end
end

All that works as expected. In fact, I went and looking in the Pickaxe book and there was an example just like this.

But I’ll leave you with a few tests that fail, and seem to me like they shouldn’t (or am I misunderstanding?).

  def test_reverse
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    reversed = "!εμσόκ αρέμηλαΚ"
    srev = s.reverse
    assert_equal(reversed,srev) # fails
  end
 
  def test_index
    # String#index isn't Unicode-aware, it's counting bytes
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    assert_equal(0, s.index('Κ')) # passes
    assert_equal(1, s.index('α')) # fails!
    assert_equal(3, s.index('α')) # passes; 3rd byte!
  end

Neither of those work.

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

UPDATE
Why the Lucky Stiff has some interesting ideas about how to get around these limitations, at least until they’re fixed in Ruby: RedHanded » Closing in on Unicode with Jcode

Comments (4)

Learning Ruby through Testing

This is a great idea:

Perhaps you’ve been meaning to learn Ruby for fun or profit, but you just don’t know where to start. I’d like to help by trying a bit of an experiment. No, I’m not going to send you a copy of my Ruby learning tests. The learning comes through doing.

Rather, I’ll start by showing you how I wrote my first Ruby learning test. Then, over the coming weeks and months, I’ll spoonfeed you more tests as a starting point for exploring new facets of Ruby. (Submissions are appreciated, too.) Of course, if I get the sense that nobody’s listening, I’ll stop.

Well, I for one am listening, Mike. Naturally, the first test I wrote myself had to do with Unicode… more on that later.

Comments