Here’s a thought I’ve never gotten around to implementing or really trying out.
Transliteration is the process of converting text in one script into another script. Here’s an example from Wikipedia: Greek -> English:
Greek Script: Ελληνική Δημοκρατία
Transcription: Ellēnikē Dēmokratia
Transliteration: Elliniki Dimokratia
The details of such conversion are pretty complex — there are two distinct systems of conversion here. The Wikipedia article tries to maintain a distinction between “transcription” and “transliteration,” but whatever, you get the idea: convert from one writing system to another.
Now, let’s suppose you have reason to believe, as blogger Ethan Zuckerman recently did, that there is an article written about you in a language you don’t know :
…two days ago, when ego-surfing Technorati, I discovered that a Saudi blogger had linked to me, mentioning that an interview with me had just been published in Al-Hayat. I can’t read Arabic, but the few English phrases in the piece connected to topics I’m deeply interested in. So hey, perhaps it was an interview with me.
Let’s imagine Ethan wasn’t fortunate enough to find an Arabic blogger to translate the article for him (which he in fact was in this instance). Is there some way that he might be able to determine if his name is in the thing at all?
Maybe so, using automated transliteration (or transcription, whatever!) and a bit of fuzzy matching.
When you get right down to it, the basic operation in transliteration is just making a bunch of substitutions. As with many tasks related to language processing, the best first step is often to simply think of what you’d do if you had to accomplish the task by hand.
Well, let’s say you were going to work with that Greek up there.
Ελληνική Δημοκρατία
Elliniki Dimokratia
(I picked the simpler transliteration system.)
Anyone can do a little inspection and make an educated guess as to which letter corresponds to which… something like this:
Ε E
λ l
λ l
η i
ν n
ι i
κ k
ή i
Δ D
η i
μ m
ο o
κ k
ρ r
α a
τ t
ί i
α a
And of course we’ll need more such pairs to figure out all the letters, but that’s not hard to find. In fact, we could just cut and paste 30 or 40 words from Wikipedia. (Say, city names Αθήνα Athína; Θεσσαλονίκη Thessaloníki; Πελοπόννησος Peloponnesos, etc.)
Once we’ve done that, we can write a simple program which will make those substitutions, and go from one script to the other.
And of course, this is all grossly simplified and won’t work very well at all.
More later…