Random thought…
June 30th, 2005It occurred to me that advertisements are like an incredibly messy, distributed directory of companies.
It occurred to me that advertisements are like an incredibly messy, distributed directory of companies.
A friend of mine from Helsinki taught me that. Apparently that’s Finnish slang for “what’s up?”
Don’t say you never learned anything on this blog, eh?
Haven’t updated in a bit, here’s the things I should have blogged about:
dl can contain unequal numbers of dds and dts — it’s many-to-many, not one-to-one.unpack built-in and Unicode. But I’m losing sleep over it — so you can rest assured I’ll blog it sooner or later…Thus ends today’s episode of mundanity and fluff. Nothing else to see here, move along.
I’m like reeeeeally tired right now, but here goes anyway.
Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.
Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.
Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?
Er, that’s what I thought. But behold, from the docs for that script:
¿Como say what?
Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):
s = str1.unpack('U*')
t = str2.unpack('U*')
n = s.length
m = t.length
return m if (0 == n)
return n if (0 == m)
And n and m now contain the length of str1 and str2.
That is, the “real” length.
The number of characters, not bytes.
Hmm, yeah. The jlength.
Except we didn’t even require 'jcode'.
So just how does this String#unpack doohickey work, anyway…
Well one could go look at the docs…
IT’S LOOKS LIKE C!!! *runs screaming*
Okay yeah. Well, that’s interesting. I’m going to have to read about that.
And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します.つぎに,漢字コードをEUCに指定しています.その上で日本語用に追加された jlength により文字数を計算しています means.
Okay well specifically I don’t know what 指定する and 計算する mean.
It’s always the verbs that get you.
We’ll figure all that out tomorrow, mmkay?
Here, I’ll tell you.
News Sentinel | 06/24/2005 | Funding cut for translator service
Asterisk + Wireless network + Laptops + Webcams + Subscriptions + Nationwide (Worldwide?) network of on-call interpreters for lots of languages.
Well, go on.
I usually skip these sorts of memes, but this one is pretty interesting:
How many languages in your music collection? (kottke.org)
Here’s what I came up with:
(Don’t even get me started on the mp3s…)
Thank you for visiting my post full of indulgence and have a nice day.
In a post about Ruby and Unicode a while back, I mentioned the page at the Rails wiki called How To Use Unicode Strings in Rails. (Btw, check out Why the Lucky Stiff’s response to my post, some useful code there.)
It turns out that there were a few more mostly MySQL-specific steps involved in getting Unicode to work correctly with Rails. So I thought I’d describe all the steps we went through to get it set up in one place. This has only been tested with MySQL 4.1.
You need to explicity tell MySQL that you want your tables to be encoded in UTF-8. Here’s a sample table with 3 columns, id, foo, and bar:
create table samples (
id int not null auto_increment,
foo varchar(100) not null,
bar text not null,
primary key (id)
) Type=MyISAM CHARACTER SET utf8;
The line Type=MyISAM CHARACTER SET utf8; is where the action is. The table type has to be MyISAM, not innoDB, because unfortunately innoDB tables don’t support full text searching of UTF-8 encoded content (details here). Apparently innoDB tables are more flexible in general, but if full-text search is crucial for you, you’ll have to go with MyISAM.
Bummer, that.
In any case, then add the CHARACTER SET utf8 directive, as shown.
(I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)
This is also described at How To Use Unicode Strings in Rails.
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
@headers["Content-Type"] = "text/html; charset=utf-8"
end
end
The previous steps were enough to get Unicode showing up in a little test app I generated with scaffold (the app just consisted of an input field and a textarea).
To test it, I pasted in some sample text in various languages. It worked okay for text containing only the characters found in ASCII or latin1, but among other characters there were weird cases of random characters being removed or added. The problem seemed not to have anything to do with the script (i.e., the Unicode block). For instance, in an Esperanto text, “ĉ” (U+0109 LATIN SMALL LETTER C WITH CIRCUMFLEX) came out fine, but “ĝ” (U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX) was borked. Go figure.
It took some digging to find the next bit — thanks to Ben Jackson of INCOMUM Design & Conceito for getting the straight story on the Rails list. The solution is…
It came down to a MySQL configuration option: You have to tell MySQL to SET NAMES UTF8, as DHH pointed out in the previous link. You can either do it in the source to ActiveRecord, in mysql_adapter.rb, or you can just make the change in your own application. We chose the latter route.
So, here’s our app/controllers/application.rb as it stands:
class ApplicationController < ActionController::Base
before_filter :set_charset
before_filter :configure_charsets
def set_charset
@headers["Content-Type"] = "text/html; charset=utf-8"
end
def configure_charsets
@response.headers["Content-Type"] = "text/html; charset=utf-8"
# Set connection charset. MySQL 4.0 doesn't support this so it
# will throw an error, MySQL 4.1 needs this
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
end
(In case you’re wondering, yes, you can have more than one before_filter.)
class ApplicationController < ActionController::Base
before_filter :configure_charsets
def configure_charsets
@response.headers["Content-Type"] = "text/html; charset=utf-8"
# Set connection charset. MySQL 4.0 doesn't support this so it
# will throw an error, MySQL 4.1 needs this
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
end
UPDATED AGAIN Argh, I forgot to remove the set_charset definition before. It should be correct now…
Anyway, now our Rails installation seems to handle (almost) any nutty writing system that we throw at it.
I’m not sure these are all the Right Way™, I’m just saying it’s worked for us (so far).
For one thing, it seems like it might make sense to just go ahead and set the default character set and collation for MySQL to UTF-8, independently of any Rails stuff at all. Aren’t all character sets supposed to be slouching toward Unicode, anyway? But that sounds like a rather apocalyptic measure, for some reason… I guess I’ll hold off on the rapture of the character sets.
Collation is it’s own topic — but then we’ve not gotten to sorting anything yet. Here’s a recommendation for building the tables with:
CHARACTER SET utf8 COLLATE utf8_general_ci;
To set the collation order. Wading through the MySQL docs on that is next on the agenda.
DIT gives push to language software : HindustanTimes.com
The contents of the free CD will include Hindi language true type fonts with keyboard driver, Hindi Language Unicode Compliant Open Type Fonts, generic fonts code and storage code converter for Hindi, Hindi language version of Bharateeya OO, Firefox Browser in Hindi, Multi Protocol Messenger in Hindi, Email Client in Hindi among others.
This is forward-thinking on the part of the Indian government; for a long time it seemed to be the case that the only major website that encoded Hindi in UTF-8 was a foreign site, BBCHindi. Most news sites in Hindi use any of a bewildering array of proprietary encodings, with a proprietary font to accompany it. (Intended presumably to lock in users).
But India is a country which stands to benefit more than most from Unicode: not only does it have a huge variety of languages, it has a large number of scripts (which are already defined in Unicode). Standardizing on a single character set will make it much easier to localize software and spread digital literacy.
And literacy, period…
Whether these efforts will be officially extended to other languages and scripts in India remains to be seen, but the fact that it’s been done in Unicode for Hindi will make the path much easier.
Incidentally, all of this is related to other domains besides news — email, for instance. Consider one blogger’s criticism of Yahoo Mail… gaping void: Why Yahoo will not be my primary mail client?)
See also: वेब पर हिन्दी - हिन्दी - hindi A blog on the Hindi language, in Hindi and English.
If, like me, you happen to have been studying Ruby on Rails, here’s a silly trick for reading through the source to the applications awaiting judging at Rails Day Contest.
The source of the entries is here in Subversion repositories, but there really isn’t any way to navigate between projects. Each project has the typical Rails folder hierarchy, under URLs like:
http://railsday.com/svn/railsday65/app/models/
http://railsday.com/svn/railsday66/app/models/
http://railsday.com/svn/railsday66/app/controllers/
So if you’re like me, you like to look at lots of code and compare stuff . In the case of Rails, I wanted to get a general feel, for instance, for the sorts of stuff that goes in /app/controllers or /app/models in various projects.
It so happens that Jesse Ruderman has written some navigation bookmarklets that work great for nagivating around those Rails projects: if you drag those two linked words (”increment” and “decrement”) to your toolbar and then visit the first project, you can click them to navigate around.
increment: Increases the last number in the URL by 1.
decrement: Decreases the last number in the URL by 1.
Okay, maybe that didn’t warrant an entire post.
Heh.
I’ve been following the story about Microsoft’s latest adventure in China with some interest, but it really wasn’t until I read the latest post at Global Voices that I saw that this story is directly related to a topic I’ve been sort of obsessed with lately, what I think of as “ the wall of translation .”
If you missed the story, basically what’s happened is that Microsoft is cooperating with China’s censorship of MSN Spaces blogs and blocking words like “democracy” and “human rights” in the way some blogging systems block words like “fuck.”
I’m glad mine doesn’t.
Anyway, what I mean by the “ wall of translation ” is that this is a story where a dialog could take place on a large scale between Chinese-speaking and English-speaking bloggers (or speakers of any language, really), if there were an effective mechanism for that translation to take place.
But the conversation hits a wall, because the connections and routines that make translation happen aren’t public.
Presumably some day machine translation will solve that problem. But that day isn’t today. And despite what Google says, I don’t think it’s going to come within in the next few years.
People need to think about this problem, a lot, because it must be solved.
I think about it.
A lot.