infundibulum

Learning Ruby through Testing

May 31st, 2005

This is a great idea:

Perhaps you’ve been meaning to learn Ruby for fun or profit, but you just don’t know where to start. I’d like to help by trying a bit of an experiment. No, I’m not going to send you a copy of my Ruby learning tests. The learning comes through doing.

Rather, I’ll start by showing you how I wrote my first Ruby learning test. Then, over the coming weeks and months, I’ll spoonfeed you more tests as a starting point for exploring new facets of Ruby. (Submissions are appreciated, too.) Of course, if I get the sense that nobody’s listening, I’ll stop.

Well, I for one am listening, Mike. Naturally, the first test I wrote myself had to do with Unicode… more on that later.

Supporting Small Wikipedias Financially

May 31st, 2005

Over on the Wikipedia mailing list, where folks discuss the details of the sprawling Wiki projects under the “Wikimedia” banner, an intriguing idea has surfaced:

Pay people to write articles in languages that aren’t very strong on the web.

My first reaction to this was “hey wait a second, the whole point is that it’s a volunteer project.” And there are a million parameters involved in solving such an equation:

  • How do you define whether a particular language qualifies?
  • How do you actually get the money to people?
  • Won’t money influence Wikimedia’s “Non-Point-Of-View” tenet?

But maybe it is a good idea, even if it isn’t easy. Milos Rancic made the case better than I have:

There are a lot of Roma people in Serbia and they are very very poor. Maybe it is better to organize some stipendies for some of them to study and work on Wikipedia? Average salary in Serbia is around $350/month, but I think it would be enough $100 for some yung Roma who study high school or university. Romas in high school are very rare, so it can be the target population. (I would waste a lot of time to find some Roma who is studing something on university.)

I don’t think that it is bad idea. Almost all of us are working on Wikipedia in his/her free time, but a lot of small ethicities are living very poor and they don’t have enough of free time.

Wikipedia became important global cultural movement. And Wikipedians should start to thinkg about helping other people to become a part of their movement.

If there is no financial support for some of these languages that are new to the web, their Wikipedias may never get off the ground. That would suck, because not only can such projects preserve languages, they can help them to flower just a bit more, maybe just enough to make a big difference to their futures. Those who are lucky enough to be able to work on Wikipedia for fun might want to think about the chance to help people who don’t have that luxury.

Oh brother.

May 27th, 2005

The whole mainstream media vs. blogs can get really tiring and boring.

But sometimes, you have to admit, journalists are pretty clueless about the web:

Netscape 8.0 disables Internet Explorer XML capabilities | newratings.com (chortle.)

But the title isn’t the only funny part, it’s this:

If one tries to browse an XML file or RSS speed in Internet Explorer, a blank page appears, he added.

The mind boggles.

My Money’s on Hausa

May 26th, 2005

Mark Liberman at Language Log has posted another Language quiz… I love these things: the idea is that he posts an audio clip of a random language, and you pull out all your linguistic stops trying to figure out what it is.

Warning… spoilers ensue.

Or at least, if my guess turns out to be correct, then a spoiler ensues. Otherwise it’s just me babbling nonsensically.

Here’s the (very) loose transcription I came up with:

dem angong LIti shin ti mutani tere su MUtu a su ku la tin doko U SIN ji kata loko tin de SORURISE ke harbitz kinta RO masa zanga zangar.

(I’m a firm believer in using the simplest transcription conceivable when one starts transcribing an unknown language—jumping into using exotic IPA detracts from the goal at hand, at least in my experience.)

I used capitalization to indicate what sounded like tonal variation to me. As soon as I listened to the recording I suspected it was an African language, where tonal languages are rampant, but I couldn’t really tell you why. My first guess was “something Bantu” but now I think that was the wrong family.

The first bit that caught my ear was something like zanga zangar, which is spoken quite clearly. It struck me as pretty hard to transcribe incorrectly (assuming that the language was written in a Roman-letter alphabet, as many African languages are). Furthermore it seemed likely that there was some sort of, um, what’s the technical word, “process” going on there—it looked like reduplication.

So I just stuck it into Google as a phrase, like this.

Golly, that wasn’t too hard: sure looks like Hausa.

So how can we try to verify this theory… well, usually I build a corpus of the language and question and start, uh, poking at it. I hacked together some code to build a corpus using the Yahoo search API, which is quite easy to use. (I’ll post it if anyone asks.)

So anyway, after looking at bigrams and such in the corpus I noticed that zanga zangar is preceded occasionally by masu, and sure enough, another Google search turns up 33 results with the string masu zanga zangar. I’d originally transcribed it as masa.

But whatever, it’s late, you know?

Volunteer Translation Banks

May 25th, 2005

I ran across an article from last year on something called a “language bank”: Volunteer translators break down barriers

It describes a program at the Seattle Red Cross that brings together translators for over 75 languages. They help with all kinds of needs that immigrants run into:

The bank and its volunteers negotiate with apartment managers, communicate with citizenship and immigration services, decipher cable bills, and even assist in emergency situations such as residential fires; it all adds up to about 4,000 cases a year.

I was unsurprised to find, after a little digging, that there’s a similar program in my own Montgomery County, Maryland: the Montgomery County, MD - Language Bank.

Cool!

I’ve done a tiny bit of interpreting and also some translation before, and lemme tell ya, it’s hard work. To do it under the kind of pressure that I’m sure these programs run into must be at least, uh, stressful.

The administrators and translations at these language banks deserve a lot of appreciation.

It seems like the only language policy stories you’ll ever read in big media in the States is about the English only movement. But language banks are also concrete reminders of the fact that the US is actually an incredibly multilingual society, probably one of the most multilingual societies in the world.

We should be proud of that.

Web equivalents to the OSX dictionary application

May 24th, 2005

I’ve heard mixed reviews of OSX Tiger, but the little dictionary widget seems to be universally popular. There’s actually a class of applications that do something similar on the web, usually through a proxy:

I’d be interested to know of any others!

Two Google Flops in a Row?

May 21st, 2005

The Google Web Accelerator caused incredibly un-web problems with web apps.

So, okay, Google finally screws something up.

But twice in a row?

This Google portal doohickey is utterly underwhelming, verging on utter lameness. The reason Google’s minimalist web design works is that their products have astonishing utility, almost without fail: search, maps, email, sets, etc., etc. But sticking a big My-Yahoo-circa-1999 box under the search field hardly qualifies as astonishingly useful.

It just strikes me as lame. Presumably they have more plans for it. (A feed reader that actually works, maybe?)

But whatever, Google is still going to eat the universe.

Insomnia blogging…

May 21st, 2005

Wow, I am a total insomniac of late.

Sucks.

At least I read a lot of blogs…. here’s one that I’ve been interested in lately: The Search Guy’s Weblog. He addresses the tagging and ontology stuff that’s been blowing in the wind, but also some nuts and bolts of search, like the vector space model.

Hie thee hither, if that’s your cup of tea.

Transliteration as Poor Man’s Translation

May 20th, 2005

Here’s a thought I’ve never gotten around to implementing or really trying out.

Transliteration is the process of converting text in one script into another script. Here’s an example from Wikipedia: Greek -> English:

Greek Script: Ελληνική Δημοκρατία
Transcription: Ellēnikē Dēmokratia
Transliteration: Elliniki Dimokratia

The details of such conversion are pretty complex — there are two distinct systems of conversion here. The Wikipedia article tries to maintain a distinction between “transcription” and “transliteration,” but whatever, you get the idea: convert from one writing system to another.

Now, let’s suppose you have reason to believe, as blogger Ethan Zuckerman recently did, that there is an article written about you in a language you don’t know :

…two days ago, when ego-surfing Technorati, I discovered that a Saudi blogger had linked to me, mentioning that an interview with me had just been published in Al-Hayat. I can’t read Arabic, but the few English phrases in the piece connected to topics I’m deeply interested in. So hey, perhaps it was an interview with me.

Let’s imagine Ethan wasn’t fortunate enough to find an Arabic blogger to translate the article for him (which he in fact was in this instance). Is there some way that he might be able to determine if his name is in the thing at all?

Maybe so, using automated transliteration (or transcription, whatever!) and a bit of fuzzy matching.

When you get right down to it, the basic operation in transliteration is just making a bunch of substitutions. As with many tasks related to language processing, the best first step is often to simply think of what you’d do if you had to accomplish the task by hand.

Well, let’s say you were going to work with that Greek up there.

Ελληνική Δημοκρατία
Elliniki Dimokratia

(I picked the simpler transliteration system.)

Anyone can do a little inspection and make an educated guess as to which letter corresponds to which… something like this:

Ε E
λ l
λ l
η i
ν n
ι i
κ k
ή i

Δ D
η i
μ m
ο o
κ k
ρ r
α a
τ t
ί i
α a

And of course we’ll need more such pairs to figure out all the letters, but that’s not hard to find. In fact, we could just cut and paste 30 or 40 words from Wikipedia. (Say, city names Αθήνα Athína; Θεσσαλονίκη Thessaloníki; Πελοπόννησος Peloponnesos, etc.)

Once we’ve done that, we can write a simple program which will make those substitutions, and go from one script to the other.

And of course, this is all grossly simplified and won’t work very well at all.

More later…

Rods from God?

May 19th, 2005

Just imagine when we combine Rods from God with Google Maps.

Works with Paypal!