Archive for Language

Tamil Blogs and Unicode


A letter in the Tamil script.

Seems to have become “languages of India week” here at Infundibulum.

Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…

Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:

Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.

After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing.

(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)

The rather magnificent-looking Tamil letter up there in the corner is TAMIL LETTER I (U+0B87). It looks fun to write. ☺

By the way, am I weird to be obsessed with figuring out how languages I can’t speak a word of are encoded?

Comments (8)

Translation and China

Since I’ve started paying more attention to translation on the web, one country has started to stand out in terms of the sheer amount of translation: China. Deutsche Welle and BBC World Service and VOA News all make massive translation efforts, but it turns out that China’s national Xinhua news service does as well: China to standardize minority language translation system drives home the point:

According the State Ethnic Affairs Commission, More than 60 million people from 55 minority populations within China use more than 80 spoken languages and about 40 written languages.

China has about 300 minority language translation organizationswith part-time and full time staff of more than 100,000.

CRI Online has a list of forty-plus languages. (Looking at some of the content there doesn’t seem too impressive, however–the text on the Burmese page, for instance, is all images.)

Comments

On-the-fly ASCII to Unicode Transliteration with Javascript?

Here’s an interesting little script I found on the Reta Vortaro (that is, the Esperanto web dictionary).


anstataŭigu cx, gx, …, ux

Try typing the string jxauxdo in that box. And press “Trovu”, if you like, that will search Google for ĵaŭdo (Esperanto for “Thursday”). Notice that jxĵ and uxŭ “on the fly,” as you type. (Come to think of it, maybe “transliteration” isn’t the right word for this process…)

So, backing up a bit, Esperanto has a few odd characters in its orthography:

Letter Pronunciation (IPA) Unicode x-system
ĉ [ʧ] U+0109 cx
ĝ [ʤ] U+011D gx
ĥ [x] U+0125 hx
ĵ [ʒ] U+0135 jx
ŝ [ʃ] U+015D sx
ŭ
(as aŭ, eŭ)
[u̯] U+016D ux

Even today those characters are relatively rare in fonts–if you can’t see them I imagine this post may not make too terribly much sense. 8^)

The good doktoro even got a little flak back in the day, for choosing to include such unusual characters in a supposedly universal language. Nowadays, however, they’re all in Unicode–here’s the full info for ŝ, for example:

U+015D LATIN SMALL LETTER S WITH CIRCUMFLEX
ŝ

But pragmatically speaking, there’s still a problem with input. Suppose you are a gold-star-wearing green-flag-waving Esperanto afficionado, and you want to post something on the internet. How do you actually type these characters? The “right” answer is that you install a keyboard layout for the language in question, and you memorize its layout.

This is a pain, of course.

And it’s nothing new: in the (typographical) bad old days of all-ASCII USENET, Unicode wasn’t widely available, and what people would generally do (for many languages, not just Esperanto) was come up with all-ASCII transliteration systems. The “x-system” added to the table above was probably the most popular. It so happens that there is no letter x in Esperanto, so it didn’t cause any massive problems with ambiguity.

So let’s look at the script in question, it’s quite simple:

function xAlUtf8(t) {
  if (document.getElementById("x").checked) {
    t = t.replace(/c[xX]/g, "\u0109");
    t = t.replace(/g[xX]/g, "\u011d");
    t = t.replace(/h[xX]/g, "\u0125");
    t = t.replace(/j[xX]/g, "\u0135");
    t = t.replace(/s[xX]/g, "\u015d");
    t = t.replace(/u[xX]/g, "\u016d");
    t = t.replace(/C[xX]/g, "\u0108");
    t = t.replace(/G[xX]/g, "\u011c");
    t = t.replace(/H[xX]/g, "\u0124");
    t = t.replace(/J[xX]/g, "\u0134");
    t = t.replace(/S[xX]/g, "\u015c");
    t = t.replace(/U[xX]/g, "\u016c");
    document.getElementById("q").value=t;
  }
}

Include it with something like:

< script type="text/javascript" src="http://example.com/translit.js"> < /script > 

And the function gets called with an onkeyup="xAlUtf8(this.value)" inside the input tag.

(Using onkeyup is actually sort of verboten these days–it should be done with unobtrusively, etc.)

So anyway, that’s a pretty interesting way to enter some unusual characters. It’s interesting to muse on just how far one could take this approach. Would it be possible to create a script that would handle an entire writing system? Say, a script that would convert an entire textarea from an ASCII-based transliteration to Unicode characters, on the fly? Japanese and Chinese are definitely excluded from this approach (every Chinese character in RAM? Er, no.) but people who use those languages generally already have keyboard input taken care of.

That would be neat: you could, for instance, have textareas where users without keyboard layouts could input something in Amharic or Persian or whatever without having the keyboard layout actually installed.

But as it stands, it’s just simple substitution, and no string which is to be substituted can be a substring of another such string. In order to handle a more generalized set of substitutions, you’d probably need to use a Trie structure. (nice trie implementation in Python by James Tauber. )

I’m sure there are complications that would arise from what’s called “font shaping” — that is, how operating systems combine adjacent characters. In Arabic or Thai, for instance, characters vary depending on which characters they’re adjacent to. How does this process affect text in textareas, for instance, or text which is mushed around with Javascript?

I’ll be playing around with this.

Comments (2)

How many languages in your music collection?

I usually skip these sorts of memes, but this one is pretty interesting:

How many languages in your music collection? (kottke.org)

Here’s what I came up with:

    More than five

  • tons of Brazilian Portuguese: Chico Buarque, bossa nova stuff, Caetano Veloso Marisa Monte…
  • tons of Welsh: SFA, Gorky’s, Datblygu…
  • a fair amount of Japanese: Cornelius, Shonen Knife, Takako Minekawa…
  • French: Françoise Hardy, Gainsbourg…
  • Cape Verde Creole: Cesaria Evora, Simentera
    Just a few

  • Hopi: The soundtrack to Koyaanisqatsi (does that count?)
  • Tuvan: some throat-singing stuff
  • Amharic: Asnaqetch Werqu, Gigi, a couple other cds from the Ethiopiques series
  • Malagasy: a band called Tarika
  • Hungarian: Muszikas

(Don’t even get me started on the mp3s…)

Thank you for visiting my post full of indulgence and have a nice day.

Comments (2)

Hindi and Unicode

यूनिकोड क्या है?
What is Unicode? in Hindi

DIT gives push to language software : HindustanTimes.com

The contents of the free CD will include Hindi language true type fonts with keyboard driver, Hindi Language Unicode Compliant Open Type Fonts, generic fonts code and storage code converter for Hindi, Hindi language version of Bharateeya OO, Firefox Browser in Hindi, Multi Protocol Messenger in Hindi, Email Client in Hindi among others.

This is forward-thinking on the part of the Indian government; for a long time it seemed to be the case that the only major website that encoded Hindi in UTF-8 was a foreign site, BBCHindi. Most news sites in Hindi use any of a bewildering array of proprietary encodings, with a proprietary font to accompany it. (Intended presumably to lock in users).

But India is a country which stands to benefit more than most from Unicode: not only does it have a huge variety of languages, it has a large number of scripts (which are already defined in Unicode). Standardizing on a single character set will make it much easier to localize software and spread digital literacy.

And literacy, period…

Whether these efforts will be officially extended to other languages and scripts in India remains to be seen, but the fact that it’s been done in Unicode for Hindi will make the path much easier.

Incidentally, all of this is related to other domains besides news — email, for instance. Consider one blogger’s criticism of Yahoo Mail… gaping void: Why Yahoo will not be my primary mail client?)

See also: वेब पर हिन्दी - हिन्दी - hindi A blog on the Hindi language, in Hindi and English.

Comments (10)

Last words

I am about to — or I am going to — die: either expression is correct.
– Dominique Bouhours, French grammarian

Some people don’t know when to quit. 8^)

Comments

Brough Turner on Speech Recognition

A new blog to me: Communications: Brute force speech recognition

The Google mindset (”more data, please”) is creeping into other fields. Turner suggests that speech recognition folks should be thinking that way too.

Comments

Ruby and Unicode

I’ve been looking into Ruby’s Unicode support, since I’m working on a Rails project. I had to jump through some hoops to figure out how to get Ruby to handle UTF-8 — it’s not too well documented.

The short answer can be found here: How To Use Unicode Strings in Rails. Bottom line: prefix your code with:

$KCODE = 'u'
require 'jcode'

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH
ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL A

…there are three bytes behind the scenes (0xE3, 0x81, and 0x81). ASCII letters are still one byte, as ASCII is a subset of Unicode.

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

For instance… in Mike Clark intro to learning Ruby through Unit Testing, he suggests testing some of the string methods like this:

require 'test/unit'
 
class StringTest < Test::Unit::TestCase
  def test_length
    s = "Hello, World!"
    assert_equal(13, s.length)
  end
end

So let’s use the Greek translation of “Hello, World!”

Καλημέρα κόσμε!

That has 15 letters, including the space. Let’s test it:

require 'test/unit'
require 'jcode'
$KCODE = 'UTF8'
 
class StringTest < Test::Unit::TestCase
  def test_jlength
    s = "Καλημέρα κόσμε!"
    assert_equal(15, s.jlength) # Note the 'j'
    assert_not_equal(15, s.length) # Normal, non unicode length
    assert_equal(28, s.length) # Greek letters happen to take two-bytes
  end
end

All that works as expected. In fact, I went and looking in the Pickaxe book and there was an example just like this.

But I’ll leave you with a few tests that fail, and seem to me like they shouldn’t (or am I misunderstanding?).

  def test_reverse
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    reversed = "!εμσόκ αρέμηλαΚ"
    srev = s.reverse
    assert_equal(reversed,srev) # fails
  end
 
  def test_index
    # String#index isn't Unicode-aware, it's counting bytes
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    assert_equal(0, s.index('Κ')) # passes
    assert_equal(1, s.index('α')) # fails!
    assert_equal(3, s.index('α')) # passes; 3rd byte!
  end

Neither of those work.

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

UPDATE
Why the Lucky Stiff has some interesting ideas about how to get around these limitations, at least until they’re fixed in Ruby: RedHanded » Closing in on Unicode with Jcode

Comments (4)

How to Choose a Thai Name

Via the ever-awesome Global Voices, I ended up on a great blog about Thailand, Thai-Blogs.com. This post has some linguistic interest: How to Choose a Thai name.

A few years back I taught ESL, and I had several Thai students. Thai names are quite complex; the post above goes into the details on the “official” Thai names, which are derived from Pali and generally given to people by Buddhist monks.

But that’s not the end of it. Most people also have a short, one- or two-syllable nickname as well. And if they’re of Chinese heritage they often also have a Chinese given name, and finally a family name.

I think I will start collecting names for myself in different languages, heh.

Act quick! You can name me in Thai! Now’s your chance!

Comments

Teh

Is it just me or is U+062A ARABIC LETTER TEH a happy-looking letter?

Comments

Next entries » · « Previous entries