Here’s an interesting little script I found on the Reta Vortaro (that is, the Esperanto web dictionary).
Try typing the string
jxauxdo
in that box. And press “Trovu”, if you like, that will search Google for
ĵaŭdo (Esperanto for “Thursday”).
Notice that jx
→ ĵ
and ux
→ ŭ
“on the fly,” as you type. (Come to think of it, maybe “transliteration” isn’t the right word for this process…)
So, backing up a bit, Esperanto has a few odd characters in its orthography:
Letter |
Pronunciation (IPA) |
Unicode |
x-system |
ĉ |
[ʧ] |
U+0109 |
cx |
ĝ |
[ʤ] |
U+011D |
gx |
ĥ |
[x] |
U+0125 |
hx |
ĵ |
[ʒ] |
U+0135 |
jx |
ŝ |
[ʃ] |
U+015D |
sx |
ŭ
(as aŭ, eŭ) |
[u̯] |
U+016D |
ux |
Even today those characters are relatively rare in fonts–if you can’t see them I imagine this post may not make too terribly much sense. 8^)
The good doktoro even got a little flak back in the day, for choosing to include such unusual characters in a supposedly universal language. Nowadays, however, they’re all in Unicode–here’s the full info for ŝ, for example:
U+015D LATIN SMALL LETTER S WITH CIRCUMFLEX
ŝ
But pragmatically speaking, there’s still a problem with input. Suppose you are a gold-star-wearing green-flag-waving Esperanto afficionado, and you want to post something on the internet. How do you actually type these characters? The “right” answer is that you install a keyboard layout for the language in question, and you memorize its layout.
This is a pain, of course.
And it’s nothing new: in the (typographical) bad old days of all-ASCII USENET, Unicode wasn’t widely available, and what people would generally do (for many languages, not just Esperanto) was come up with all-ASCII transliteration systems. The “x-system” added to the table above was probably the most popular. It so happens that there is no letter x in Esperanto, so it didn’t cause any massive problems with ambiguity.
So let’s look at the script in question, it’s quite simple:
function xAlUtf8(t) {
if (document.getElementById("x").checked) {
t = t.replace(/c[xX]/g, "\u0109");
t = t.replace(/g[xX]/g, "\u011d");
t = t.replace(/h[xX]/g, "\u0125");
t = t.replace(/j[xX]/g, "\u0135");
t = t.replace(/s[xX]/g, "\u015d");
t = t.replace(/u[xX]/g, "\u016d");
t = t.replace(/C[xX]/g, "\u0108");
t = t.replace(/G[xX]/g, "\u011c");
t = t.replace(/H[xX]/g, "\u0124");
t = t.replace(/J[xX]/g, "\u0134");
t = t.replace(/S[xX]/g, "\u015c");
t = t.replace(/U[xX]/g, "\u016c");
document.getElementById("q").value=t;
}
}
Include it with something like:
< script type="text/javascript" src="http://example.com/translit.js"> < /script >
And the function gets called with an onkeyup="xAlUtf8(this.value)"
inside the input
tag.
(Using onkeyup
is actually sort of verboten these days–it should be done with unobtrusively, etc.)
So anyway, that’s a pretty interesting way to enter some unusual characters. It’s interesting to muse on just how far one could take this approach. Would it be possible to create a script that would handle an entire writing system? Say, a script that would convert an entire textarea
from an ASCII-based transliteration to Unicode characters, on the fly? Japanese and Chinese are definitely excluded from this approach (every Chinese character in RAM? Er, no.) but people who use those languages generally already have keyboard input taken care of.
That would be neat: you could, for instance, have textareas where users without keyboard layouts could input something in Amharic or Persian or whatever without having the keyboard layout actually installed.
But as it stands, it’s just simple substitution, and no string which is to be substituted can be a substring of another such string. In order to handle a more generalized set of substitutions, you’d probably need to use a Trie structure. (nice trie implementation in Python by James Tauber. )
I’m sure there are complications that would arise from what’s called “font shaping” — that is, how operating systems combine adjacent characters. In Arabic or Thai, for instance, characters vary depending on which characters they’re adjacent to. How does this process affect text in textarea
s, for instance, or text which is mushed around with Javascript?
I’ll be playing around with this.