infundibulum

Using search engines to find content in any language

October 20th, 2006

Here’s a little trick I use a lot. I don’t know if it’s common knowledge, maybe so. But in any case I find it pretty useful (and, I admit, fun).

Just now, I was reading about Madonna stealing adopting a baby from Malawi. I was wondering if there were any blogs in a Malawian language that mentioned Madonna.

(I’ll tell you right now, I didn’t find any.)

A friend of mine did Peace Corp in Malawi, and she taught me a few phrases, so I happened to know that one local language is called Chichewa.

So here’s my technique:

  1. Find some text in Chichewa.
  2. Find the most frequent words.
  3. Pick a few, preferably short ones.
  4. Stick those in a search engine, optionally with the term you’re looking for (in my case, it was “Madonna”).

For step one, a convenient source is Eric Muller’s UDHR in Unicode. (And while you’re there, take a minute to read it if you never have…)

I had to run a few Google searches to figure out which name Eric was using — nopenope … finally I loaded up this sampler and did a find for “Chewa”… bingo:

Nyanja (Chechewa)
Anthu onse amabadwa aufulu ndiponso ofanana mu ulemu ndi ufulu wao. Iwowa ndi wodalitsidwa ndi mphamvu zoganiza ndi chikumbumtima ndipo achitirane wina ndi mnzake mwaubale.

Nyanja (Chinyanja)
Anthu onse amabadwa mwa ufulu ndiponso olinganga m’ makhalidwe ao. Iwo amakhala ndi nzeru za cibadwidwe kotero ayenera kucitirana zabwino wina ndi mnzace.

It’s spelled “Chechewa”… hmm, come to think of it, I wonder why there are two versions there? I should drop Eric an email. (He’s very diligent about fixing things, so if you happen to run across a bug at http://udhrinunicode.org/ do let him know.)

So anyway. Then I went and found it on the OHCHR: UN site.

Now, step 2, find some frequent words.

Javascript guru Jesse Ruderman has some Search Engine Optimization Bookmarklets , one of which happens to pop up a window with word frequency. I made a little test page here.

So, you run that bookmarklet from the aforelinked Chichewa page, and you get a window that says something like this:

1615 words


  • 133: ndi
  • 57: ufulu
  • 39: aliyense
  • 37: ali
  • 35: kapena
  • 31: m
  • 30: ndime
  • 28: pa
  • 28: wa
  • 20: anthu
  • 19: dziko
  • 18: a
  • 16: munthu
  • 15: maiko
  • 15: ndiponso
  • etc…

Pick a few really common ones — shorter is probably better; after all, the word “rights” will be all over that particular document, but as Zipf taught all us groundlings, common words tend to be short.

Survey says:

wa pa a cha ndime madonna - Google Search. Nada. Rinse and repeat:

wa a cha ndime madonna - Google Search Oh for two!

wa a cha madonna - Google Search Now we’re talkin!

Look, Chichewa! Madonna amekanusha taarifa zilizotolewa na vyombo vya habari.

Oh wait, that’s Swahili.

Well anyway. I guess it’s not the year for blogs in Chichewa to catch on yet.

Yet.

In any case, leaving “Madonna” off certainly seems like it gets you some content in Chichewa:

wa pa a cha ndime - Google Search

Which in and of itself is useful when a language might not be indexed as as that language by a search engine, but it’s indexed nonetheless.

A more industrialized version of this process could be easily scripted. I’ve got some rather crufty code lying around that will use this approach to spider content, find frequent words, submit them to the Yahoo API, spider the URLs that return, rinse and repeat. This is cool, because the more iterations you do, the more representative the frequent words become, and they function as ever-better search terms. Maybe I’ll clean it up and post it here.

There’s a paper from before the days of search engine APIs that describes a similar approach (better than I’ve described it here!):

Building Minority Language Corpora by Learning to Generate Web Search Queries - Ghani, Jones, Mladeni’c (ResearchIndex)

Happy fishing.