infundibulum

Transliteration Project

November 9th, 2007

(Warning: I am way tired right now, but I wanted to get this down…)

I have long been interested in transliteration:

mundotype was my first stab at all that, and it kind of works, for a few languages. But let me tell you, building that transliteration map for Amharic was no walk in the park (and it was mostly thanks to my pals Daniel Yacob and Ephrem Menji that I got anywhere!)

My goal remains the same: I want to create (or see someone else create, whatevarr) a Javascript-based transliteration input system that covers a WHOLE BUNCH of languages. With a consistent, easy-to-understand format for writing and editing rules.

But even better than all that would be coming up with an automated way to infer the rules in the first place.

That’s what I’ve been playing with.

Eventually this should end up on my “serious” blog, over at Blogamundo, but I’ve become a little self-conscious about just rambling there ever since Planet I18n came into being; I’d really rather post there when I have something that’s distributable.

It’s late right now but let me give you the 5-minute rundown of where I am:

What’s a transliteration?

Ask Wikipedia. For example: Epictetus is a transliteration of the Greek name Επίκτητος into the Roman alphabet.

I have some code that goes through Wikipedia dumps and extracts all the interwiki links and article titles, and spits out gigantic “lexicons.”

Here’s an example where I grepped out a Greek/English lexicon (the original has a bazillion languages):

http://ruphus.com/svn/translit/en2el.txt

Which has 2432 lines, with stuff like this:

Archaeology Αρχαιολογία
Austria Αυστρία
Australia Αυστραλία
ASCII ASCII
Africa Αφρική

Now, some of these are “transliterations” and some are “translations” — and in the case of ASCII (oh, the irony), straight out borrowing in the original script.

(By the way, the definition between “translation” and “transliteration” is kind of blurry if you start thinking too hard… fortunately, I don’t.)

Having hacked thru the first few chapters of “Teach Yourself Greek,” I can surmise that the pairs Austria/Αυστρία and Australia/Αυστραλία look like “perfect” transliterations.

And by “perfect,” I mean:

  1. Each word in the pair is the same length
  2. Each word in the pair has the same “letter pattern”

(”Perfect” is just an arbitrary designation.)

It’s #2 that I’ve been thinking about, and getting some results with. It involves “patternizing” a word, and you do that like this:

Replace each letter in the word with the numeric index of the first occurrence of the letter in the word.

Examples:

cat → 012
asia → 0120
Ασία → 0123
Ωκεανός → 0123456
Βιόσφαιρα → 012345175

Get it?

Interestingly, this simple trick is very good at helping to find transliterated words. All I do is go through the word pairs in that file at the top, and check to see if both words produce the same pattern.

Check out some results:

http://ruphus.com/svn/translit/matches-en2el.txt

Croatia
0123453
Κροατία

Cyclades
01234567
Κυκλάδες

Dance
01234
Χορός

Kilo
0123
Κιλό

Keflavík
01234567
Κεφλαβίκ

Methanol
01234567
Μεθανόλη

Montreal
01234567
Μόντρεαλ

For one thing, there are some mistakes. “Χορός” is no transliteration of “Dance,” it’s a translation. But mostly transliterated things come up — notice all the place and personal names?

So from there, I zip up these pairs of words into pairs of letters, like this:

T Τ
r ρ
o ό
f φ
a α

And

K Κ
e ε
f φ
l λ
a α
v β
í ί
k κ

Rinse and repeat for every pair in the list, do a bit of frequency-based manipulatin’, and you get something that looks like this:

http://ruphus.com/svn/translit/schema-en2el.txt

Which is incomplete and imperfect, but pretty damn good for zero linguistic knowledge before hand, aside from the lexicon.

More soon.

(digraphs are a thorny problem, for one thing…)

Facebook Groups

October 25th, 2007

Back in gringolândia, I guess I’ll start speaking gringuês again. Man, I miss Brazil.

Tonight I went to Starbucks, where I was reading a book. I had a few conversations. But it’s sort of weird trying to start conversations with random people. Especially if they’re all face down in their laptops (and lattes).

Thing is, though, other people have to be thinking the same thing–”Why am I so damn popular on Facebook  but have no one to talk to at Starbucks??”

Or something.

You know how Facebook groups mostly suck? They’re just like Orkut groups. Or Friendster groups. People go there to be identified, and then they’re like… uh, what now? Because being off-topic seems pretty retarded in a group that’s defined by having a ridiculously specific topic.

It makes a lot more sense to “be identified” with reference to a place that’s… you know… social.

What I’m getting at is, when I got home, I wished there was a Facebook group (or something) for that one particular Starbucks.

(Okay, mainly so I could have the huevos to message up that one girl with the German accent.)

Does anyone know what I’m trying to say? Why doesn’t every place in the world have a place online, that everyone knows about?

Gulp

April 21st, 2007

Reprap

I must admit…

March 30th, 2007

Mostly I post here now.

Extreeeeeeemly random.

Words? Not so much.

er

March 19th, 2007

man with no social life

No, really.

February 19th, 2007

Fucking, Austria

(click)

Faaa

December 20th, 2006

There is a place called… Faaa.

wtf?

November 30th, 2006

This is hilarious ☺

November 30th, 2006

DeVito’s Not So Sobering View

“I knew it was the last seven limoncellos that was going to get me,” a disheveled DeVito said as he plunked himself down on the View sofa.

Philippine Languages Month

August 12th, 2006

August is Philippine Languages Month in the Philippines.

Filipino Language Month poster

Speaking in tongues–Pilipino-style

This overview has some interesting Tagalog Pilipino Filipino words thrown in. (I think that’s the current term for the national language… you know what? It’s complicated. )

For instance:

IT is not Linggo ng Wika; it’s Buwan ng Wika. It’s not Abakada and Tagalog; it’s ABCD and Pilipino. It’s no longer Taglish as a language borrowed and corrupted; it’s now translation and code switching as proof of comprehension and multilingual mastery. It’s more than just stodgy textbooks and formal oratorical balagtasan; it’s also a celebration of comic-book lore and street corner kwentuhan. It’s no longer Isang Bansa, Isang Diwa; it’s now Buwan ng Wikang Pambansa ay Buwan ng mga Wika sa Pilipinas.

Ricardo Nolasco of the Philippine Languages Commission (whaddya know, they have a wiki) has some more background on that last pair of phrases:

Nolasco explains, “Buwan ng Wikang Pambansa ay Buwan ng mga Wika sa Pilipinas is a pitch for linguistic diversity. Isang Bansa, Isang Diwa was the slogan during the martial-law regime and that promoted dangerous ideas such as that having many languages was disadvantageous to the country—and that’s not correct.”

With a bit of digging I discovered that Isang Bansa, Isang Diwa means “One nation, One spirit.” It was the motto from the bad old days of the Marcos government. (And more amusingly, perhaps, it recently resurfaced in the name of the wacky Eddie Gil’s Partido Isang Bansa Isang Diwa. He promised to “make every Filipino a millionaire within one hundred days” of being elected. That didn’t work out! (Unfortunately!).)

Haven’t managed to decipher the first, more agreeable phrase that Nolasco mentions, but buwan is “month,” wikang pambansa is “national language,” and wika sa Pilipinas is (I think) “languages of the Philippines.” So I’m guessing the whole thing means something like “The Month of the National Language and (All?) the Languages of the Philippines”?

It would also be fun to know what the languages on that poster are, specifically.