infundibulum

Transliteration Project

November 9th, 2007

(Warning: I am way tired right now, but I wanted to get this down…)

I have long been interested in transliteration:

mundotype was my first stab at all that, and it kind of works, for a few languages. But let me tell you, building that transliteration map for Amharic was no walk in the park (and it was mostly thanks to my pals Daniel Yacob and Ephrem Menji that I got anywhere!)

My goal remains the same: I want to create (or see someone else create, whatevarr) a Javascript-based transliteration input system that covers a WHOLE BUNCH of languages. With a consistent, easy-to-understand format for writing and editing rules.

But even better than all that would be coming up with an automated way to infer the rules in the first place.

That’s what I’ve been playing with.

Eventually this should end up on my “serious” blog, over at Blogamundo, but I’ve become a little self-conscious about just rambling there ever since Planet I18n came into being; I’d really rather post there when I have something that’s distributable.

It’s late right now but let me give you the 5-minute rundown of where I am:

What’s a transliteration?

Ask Wikipedia. For example: Epictetus is a transliteration of the Greek name Επίκτητος into the Roman alphabet.

I have some code that goes through Wikipedia dumps and extracts all the interwiki links and article titles, and spits out gigantic “lexicons.”

Here’s an example where I grepped out a Greek/English lexicon (the original has a bazillion languages):

http://ruphus.com/svn/translit/en2el.txt

Which has 2432 lines, with stuff like this:

Archaeology Αρχαιολογία
Austria Αυστρία
Australia Αυστραλία
ASCII ASCII
Africa Αφρική

Now, some of these are “transliterations” and some are “translations” — and in the case of ASCII (oh, the irony), straight out borrowing in the original script.

(By the way, the definition between “translation” and “transliteration” is kind of blurry if you start thinking too hard… fortunately, I don’t.)

Having hacked thru the first few chapters of “Teach Yourself Greek,” I can surmise that the pairs Austria/Αυστρία and Australia/Αυστραλία look like “perfect” transliterations.

And by “perfect,” I mean:

  1. Each word in the pair is the same length
  2. Each word in the pair has the same “letter pattern”

(”Perfect” is just an arbitrary designation.)

It’s #2 that I’ve been thinking about, and getting some results with. It involves “patternizing” a word, and you do that like this:

Replace each letter in the word with the numeric index of the first occurrence of the letter in the word.

Examples:

cat → 012
asia → 0120
Ασία → 0123
Ωκεανός → 0123456
Βιόσφαιρα → 012345175

Get it?

Interestingly, this simple trick is very good at helping to find transliterated words. All I do is go through the word pairs in that file at the top, and check to see if both words produce the same pattern.

Check out some results:

http://ruphus.com/svn/translit/matches-en2el.txt

Croatia
0123453
Κροατία

Cyclades
01234567
Κυκλάδες

Dance
01234
Χορός

Kilo
0123
Κιλό

Keflavík
01234567
Κεφλαβίκ

Methanol
01234567
Μεθανόλη

Montreal
01234567
Μόντρεαλ

For one thing, there are some mistakes. “Χορός” is no transliteration of “Dance,” it’s a translation. But mostly transliterated things come up — notice all the place and personal names?

So from there, I zip up these pairs of words into pairs of letters, like this:

T Τ
r ρ
o ό
f φ
a α

And

K Κ
e ε
f φ
l λ
a α
v β
í ί
k κ

Rinse and repeat for every pair in the list, do a bit of frequency-based manipulatin’, and you get something that looks like this:

http://ruphus.com/svn/translit/schema-en2el.txt

Which is incomplete and imperfect, but pretty damn good for zero linguistic knowledge before hand, aside from the lexicon.

More soon.

(digraphs are a thorny problem, for one thing…)

Comments

  1. 1

    does anyone knows if there is any other information about this subject in other languages?

    - Dil Okulu @
  2. 2

    Hello! Please forgive me if this post is inappropriate, but I couldn’t find a direct email address on your blog. I’m Anton and I’m launching my new blog dealing with language translation issues and would appreciate the opportunity to discuss mutual collaboration. You can contact me if you like at anton [at] icanlocalize {dot} com. Thanks!

    - Anton @

Leave a Reply