On-the-fly ASCII to Unicode Transliteration with Javascript?

Here’s an interesting little script I found on the Reta Vortaro (that is, the Esperanto web dictionary).


anstataŭigu cx, gx, …, ux

Try typing the string jxauxdo in that box. And press “Trovu”, if you like, that will search Google for ĵaŭdo (Esperanto for “Thursday”). Notice that jxĵ and uxŭ “on the fly,” as you type. (Come to think of it, maybe “transliteration” isn’t the right word for this process…)

So, backing up a bit, Esperanto has a few odd characters in its orthography:

Letter Pronunciation (IPA) Unicode x-system
ĉ [ʧ] U+0109 cx
ĝ [ʤ] U+011D gx
ĥ [x] U+0125 hx
ĵ [ʒ] U+0135 jx
ŝ [ʃ] U+015D sx
ŭ
(as aŭ, eŭ)
[u̯] U+016D ux

Even today those characters are relatively rare in fonts–if you can’t see them I imagine this post may not make too terribly much sense. 8^)

The good doktoro even got a little flak back in the day, for choosing to include such unusual characters in a supposedly universal language. Nowadays, however, they’re all in Unicode–here’s the full info for ŝ, for example:

U+015D LATIN SMALL LETTER S WITH CIRCUMFLEX
ŝ

But pragmatically speaking, there’s still a problem with input. Suppose you are a gold-star-wearing green-flag-waving Esperanto afficionado, and you want to post something on the internet. How do you actually type these characters? The “right” answer is that you install a keyboard layout for the language in question, and you memorize its layout.

This is a pain, of course.

And it’s nothing new: in the (typographical) bad old days of all-ASCII USENET, Unicode wasn’t widely available, and what people would generally do (for many languages, not just Esperanto) was come up with all-ASCII transliteration systems. The “x-system” added to the table above was probably the most popular. It so happens that there is no letter x in Esperanto, so it didn’t cause any massive problems with ambiguity.

So let’s look at the script in question, it’s quite simple:

function xAlUtf8(t) {
  if (document.getElementById("x").checked) {
    t = t.replace(/c[xX]/g, "\u0109");
    t = t.replace(/g[xX]/g, "\u011d");
    t = t.replace(/h[xX]/g, "\u0125");
    t = t.replace(/j[xX]/g, "\u0135");
    t = t.replace(/s[xX]/g, "\u015d");
    t = t.replace(/u[xX]/g, "\u016d");
    t = t.replace(/C[xX]/g, "\u0108");
    t = t.replace(/G[xX]/g, "\u011c");
    t = t.replace(/H[xX]/g, "\u0124");
    t = t.replace(/J[xX]/g, "\u0134");
    t = t.replace(/S[xX]/g, "\u015c");
    t = t.replace(/U[xX]/g, "\u016c");
    document.getElementById("q").value=t;
  }
}

Include it with something like:

< script type="text/javascript" src="http://example.com/translit.js"> < /script > 

And the function gets called with an onkeyup="xAlUtf8(this.value)" inside the input tag.

(Using onkeyup is actually sort of verboten these days–it should be done with unobtrusively, etc.)

So anyway, that’s a pretty interesting way to enter some unusual characters. It’s interesting to muse on just how far one could take this approach. Would it be possible to create a script that would handle an entire writing system? Say, a script that would convert an entire textarea from an ASCII-based transliteration to Unicode characters, on the fly? Japanese and Chinese are definitely excluded from this approach (every Chinese character in RAM? Er, no.) but people who use those languages generally already have keyboard input taken care of.

That would be neat: you could, for instance, have textareas where users without keyboard layouts could input something in Amharic or Persian or whatever without having the keyboard layout actually installed.

But as it stands, it’s just simple substitution, and no string which is to be substituted can be a substring of another such string. In order to handle a more generalized set of substitutions, you’d probably need to use a Trie structure. (nice trie implementation in Python by James Tauber. )

I’m sure there are complications that would arise from what’s called “font shaping” — that is, how operating systems combine adjacent characters. In Arabic or Thai, for instance, characters vary depending on which characters they’re adjacent to. How does this process affect text in textareas, for instance, or text which is mushed around with Javascript?

I’ll be playing around with this.

2 Comments »

  1. riny Said,

    July 11, 2005 @ 12:45 pm

    Nice, I ve been playing around as well, but more with cyrillic transliteration. A nice site for this is www.translit.ru , and also http://www.spearance.ru/parser3/translit/ has something like that.

    If you encounter a php script that does whole pages, like
    http://xxxxx/translit_esp.php?url=www.nytimes.com i’d be interested :-)

    Ill keep reading da blogs..

  2. pat Said,

    July 11, 2005 @ 4:30 pm

    Thanks riny!

    Hmm, translit.ru is quite nice, quite similar to what I was thinking of.

    I was actually thinking about Cyrillic as an option for this sort of thing; it seems like the mapping is simple enough to be pretty doable. There are lots of roman alphabet languages with just a few unusual characters as well (Welsh, Hausa, there are zillions).

    Where the process gets considerably more complex, I think, is with scripts where the logical order doesn’t map to the visual order, as is the case in Indic languages.

    If you encounter a php script that does whole pages, like
    http://xxxxx/translit_esp.php?url=www.nytimes.com i’d be interested :-)

    I’m not sure what you have in mind. In this example, do you mean that you’d like a script to transliterate the New York Times so that the English text is more readable to folks who are accustomed to Cyrillic?

    (I haven’t used PHP much lately, myself.)

    Thanks for dropping by!

RSS feed for comments on this post · TrackBack URI

Leave a Comment

Please wrap code snippets in <code> tags, thanks!