infundibulum

Transliteration Project

November 9th, 2007

(Warning: I am way tired right now, but I wanted to get this down…)

I have long been interested in transliteration:

mundotype was my first stab at all that, and it kind of works, for a few languages. But let me tell you, building that transliteration map for Amharic was no walk in the park (and it was mostly thanks to my pals Daniel Yacob and Ephrem Menji that I got anywhere!)

My goal remains the same: I want to create (or see someone else create, whatevarr) a Javascript-based transliteration input system that covers a WHOLE BUNCH of languages. With a consistent, easy-to-understand format for writing and editing rules.

But even better than all that would be coming up with an automated way to infer the rules in the first place.

That’s what I’ve been playing with.

Eventually this should end up on my “serious” blog, over at Blogamundo, but I’ve become a little self-conscious about just rambling there ever since Planet I18n came into being; I’d really rather post there when I have something that’s distributable.

It’s late right now but let me give you the 5-minute rundown of where I am:

What’s a transliteration?

Ask Wikipedia. For example: Epictetus is a transliteration of the Greek name Επίκτητος into the Roman alphabet.

I have some code that goes through Wikipedia dumps and extracts all the interwiki links and article titles, and spits out gigantic “lexicons.”

Here’s an example where I grepped out a Greek/English lexicon (the original has a bazillion languages):

http://ruphus.com/svn/translit/en2el.txt

Which has 2432 lines, with stuff like this:

Archaeology Αρχαιολογία
Austria Αυστρία
Australia Αυστραλία
ASCII ASCII
Africa Αφρική

Now, some of these are “transliterations” and some are “translations” — and in the case of ASCII (oh, the irony), straight out borrowing in the original script.

(By the way, the definition between “translation” and “transliteration” is kind of blurry if you start thinking too hard… fortunately, I don’t.)

Having hacked thru the first few chapters of “Teach Yourself Greek,” I can surmise that the pairs Austria/Αυστρία and Australia/Αυστραλία look like “perfect” transliterations.

And by “perfect,” I mean:

  1. Each word in the pair is the same length
  2. Each word in the pair has the same “letter pattern”

(”Perfect” is just an arbitrary designation.)

It’s #2 that I’ve been thinking about, and getting some results with. It involves “patternizing” a word, and you do that like this:

Replace each letter in the word with the numeric index of the first occurrence of the letter in the word.

Examples:

cat → 012
asia → 0120
Ασία → 0123
Ωκεανός → 0123456
Βιόσφαιρα → 012345175

Get it?

Interestingly, this simple trick is very good at helping to find transliterated words. All I do is go through the word pairs in that file at the top, and check to see if both words produce the same pattern.

Check out some results:

http://ruphus.com/svn/translit/matches-en2el.txt

Croatia
0123453
Κροατία

Cyclades
01234567
Κυκλάδες

Dance
01234
Χορός

Kilo
0123
Κιλό

Keflavík
01234567
Κεφλαβίκ

Methanol
01234567
Μεθανόλη

Montreal
01234567
Μόντρεαλ

For one thing, there are some mistakes. “Χορός” is no transliteration of “Dance,” it’s a translation. But mostly transliterated things come up — notice all the place and personal names?

So from there, I zip up these pairs of words into pairs of letters, like this:

T Τ
r ρ
o ό
f φ
a α

And

K Κ
e ε
f φ
l λ
a α
v β
í ί
k κ

Rinse and repeat for every pair in the list, do a bit of frequency-based manipulatin’, and you get something that looks like this:

http://ruphus.com/svn/translit/schema-en2el.txt

Which is incomplete and imperfect, but pretty damn good for zero linguistic knowledge before hand, aside from the lexicon.

More soon.

(digraphs are a thorny problem, for one thing…)

i was thinking about transliteration

November 7th, 2007

again

and i wrote 30 lines of python about it.

$svn co http://ruphus.com/svn/translit/

if you are bored and or curious.

guantes

July 27th, 2007

Eu: Com licença, será que vocês tem guantes?

Moça da loja: Um… o quê?

Eu: Guantes.

*Moça da lojafica me olhando confusa*

Eu: Sabe, aqueles negócios que cê coloca na mão quando está limpando…

Moça da loja: Será que você está falando de luvas? Como essas aqui?

YES, FRIENDS AND NEIGHBORS, IT’S TRUE! YOU CAN TRAVEL ALL OF SOUTH AMERICA SPEAKING NOTHING BUT PORTUNHOL!

ORkut ou orKUT?

July 20th, 2007

Como todo mundo já sabe muito bem, Orkut já é uma empresa brasileira.

Bom, não é, mais poderia ser, né?

Mais eu estou tendo uma discussão com o meu colega aqui sobre a acentuação dessa palavra. É assim:

ORkut

Ou é assim:

orKUT

O Infundibulum é uma democracia, você pode votar nos comentários.

Heh vs Hehe

March 12th, 2007

Is it just me, or have the words heh and hehe (or sometimes heheh) acquired different nuances?

Sometimes, I will type “heh” in a chat, and then think “wait, ‘heh’ will come across as sarcastic,” and then I’ll type “hehe” instead.

Dear lazyweb, am I insane?

How many clicks in Xhosa?

March 11th, 2007

A while back I happened to meet a couple people who spoke Xhosa (at a Starbucks, heh). So as is my wont I talked them into teaching me a couple phrases… the only one I was able to remember was Hamba kahle, which means something like “au revoir” or “goodbye” or something.

On a lark I stuck “Xhosa” and “Isixhosa” (which is Xhosa for Xhosa, heh) into Youtube’s search engine, and I found a couple of videos that are interesting to compare.

Now, Xhosa is well known for being a click language. If you’ve never heard such a language you will upon watching these videos, it’s neat to hear.

The first is a tourist guide:

The second is a news report, considerably longer:

The thing that stands out for me is how there are far more clicks in the tour guide’s speech than what you hear in the news report. I imagine that he chose something that’s more or less a “tongue twister”, because it’s fun for the tourists to hear all the clicks.

But judging by how many clicks you hear on average in the news report (far fewer), it seems that this gives an incorrect impression of what the language really sounds like.

An additional bit of evidence for this is the fact that a commenter on Youtube says that the tour guide had used just the same phrase on a previous tour — it probably wasn’t just run of the mill speech.

But whatever, cool to hear.

In Which a Portuguese Word Enters English

March 11th, 2007

Eh, this week, anyway.

The word in question, of course, being “fora“.

A Ponca Family Reunion

February 16th, 2007

Technorati sent me a link to an interesting article at the ever-awesome LJWorld.com (the little paper/media empire that Django built):

Split-apart nation comes together

The paper is about a family reunion, of sorts, and a long overdue one: the two parts of the Ponca tribe have been living as as two separate entities in Nebraska and Oklahoma since the 1870s.

I happened to stumbled across this history before, because I was digging around in the Wikipedia article on the Omaha, who speak a related Siouan language. This detail in the article caught my eye:

Congress terminated the tribe in Nebraska in the 1960s, and it was reinstated in 1990. The northern tribe is still feeling the effects of that period, as the Nebraska members have no fluent speakers of the Ponca language.

“In order for us to continue to be a strong nation, Poncas, we need to have that language. We need to have that culture,” Wright said.

Could “lol” ever become a real live word?

January 6th, 2007

Perhaps I’m off my rocker, but sometimes I wonder if the word lol, as in… erm, you know what it means… could ever escape from the intarwebs into meatspace?

I was just digging through a rather amazing series of photos from an ice storm, which contains the following caption:

Here is a closer crop of the above image. Look how the ice grew as it spun around lol There wasn’t any wind now, otherwise I’m guessing this would still spin like the one I saw the day before.

There’s something about this usage that seems “wordy” to me: for one thing, I find it difficult to avoid a comparison to Cantonese’s famous “tag” word, la, for which the always-amusing UrbanDictionary.com provides the definition:

cantonese exclamation which can be added after every single sentence
so cute la/ okay la/ bye la

I find it difficult to articulate why the particular caption above made me think of this, but there’s something about it… maybe it’s emphasized by the fact that there’s no comma?
Or maybe I really am off my rocker.

Language revitalization isn’t really mysterious

January 5th, 2007

The story of a guy who tries to use only Irish as he travels around… well, Ireland:
Cá Bhfuil Na Gaeilg eoirí? * | The Guardian | Guardian Unlimited

Today, a quarter of the population claim they speak it regularly. I have always suspected this figure and to test its accuracy I decided to travel around the country speaking only Irish to see how I would get on.

Things don’t go well. He can’t buy a map, he struggles to buy a map, and he gets a lot of unfriendly, even menacing stare-downs from people all over.

What I had not factored for was the animosity. Part of it, I felt, stemmed from guilt - we feel inadequate that we cannot speak our own language.

I think he’s right about that. I think that generally speaking, people would like to be able to speak any language at all; if we really had Babelfishes, everyone would use them all the time. (Would they forget what language they were speaking at all?)

Of course, this article is meant to be interesting and to raise a rhetorical point. The problem he faces with the people he tries to talk to is that they know  he also speaks English, but he’s refusing to. I can’t help but think that he could have gone about his task more wisely, and dropped the charade. I imagine that some of those people who felt so flustered might well have been more willing to use what Irish they had if he’d gone about it differently.

I dunno, maybe saying that makes me a sell-out or something, but I believe that you can’t learn or promote any language without being pragmatic about it. If you want to revitalize a Native American language, you have to allow for teenagers wanting to talk about their iPods and Metallica (or whatever you whippersnappers are listening to these days) as much as traditional stuff.

And anyway, he finally finds the real, indisputable, 100% certain answer  to how to revitalize a language:

Teach kids.

I was rapidly approaching a point of despair when some children came on the line. I found they spoke clear and fluent Irish in a new and modern urban dialect. They told me how they spoke the language all the time, as did all their friends. They loved it, and they were outraged that I could suggest it was dead. These were the children of the new Gaelscoileanna - the all-Irish schools that are springing up throughout the country in increasing numbers every year.

Yep, that’s how you do it. And it takes about one generation’s-worth of students.

(Kohanga reo teach the same lesson.)