infundibulum

Wikipedia: WikiProject Endangered languages

January 2nd, 2007

Wikipedia:WikiProject Endangered languages - Wikipedia, the free encyclopedia

A new project on Wikipedia to work on filling out info on endangered languages. Seems to be quite new, give it a look if you’re interested in this important topic.

A Hmong Messianic Script… and Linguistics and that Whole Religion Deal.

December 21st, 2006

Long title, huh?

Long post.

I finally got around to ordering Mother of Writing : The Origin and Development of a Hmong Messianic Script from Amazon.

Anyway, I’ve only just begun to dig in, but the book is about a rather mysterious writing system which was invented for the Hmong language, spoken primarily in Laos, but also in France, Canada, Australia, China, Thailand, French Guyana and the United States.

This isn’t to be confused with the various roman-script systems for writing the language, which are rather interesting and worth a post in their own right– no, the book is about a totally distinct script called Pahawh Hmong. You can see images of it at Omniglot under Pahawh Hmong alphabet.

The alphabet is believed by some Hmong to have been the divinely inspired creation of Shong Lue Yang (Soob Lwj Yaj).

(And by the way, eek the Wikipedia article on the Hmong language is abysmal…)

The book was written by a Christian missionary. I found it pretty interesting trying to peel back the layers of who was trying to interpret whom (and what). Yang himself was a (not exactly Christian) missionary, and part of the book was written by one of his followers. But the part which is strictly linguistic can be read without regard to any of that stuff, and the Pahawh Hmong is certainly a fine and interesting piece of orthographic engineering.

The topic of the intersection of religion and linguistics is something which kind of gets my blood boiling. Linguistics is a science, and it should be treated as such. But because missionary “work” overlaps so much with linguistic research, we end up with that the new language codes have a religious organization as their authority… And I mean the authority. As in, the ISO calls these guys up and asks for “the answer” on linguistic nomenclature.
Now, think about that for a second. What if the ISO called up the “Intelligent Design” guys for authoritative answers on biological classification? Maybe they should call up a proper psychic to resolve disputes on whether global warming is real.

I kid, slightly.

So why is the linguistics world okay with being reduced to being the authority on “Ancient, historical, and constructed languages,” while SIL is the authority for… most of the languages on the planet, and most of the languages for which you really need unyielding impartiality?

Oh whatever.

November 23rd, 2006

NOW Magazine - Movies in Toronto, NOVEMBER 23 - 29, 2006

Cruz’s Spanish performances are quicksilver and funny, ever since her first major role in Bigas Luna’s Jam&oactue;n (1992), as a rural girl involved in a passionate affair with Javier Bardem.

After her sojourn in America, it’s a relief to see Cruz back where she belongs. More importantly, it’s a relief to hear her back where she belongs, not trying to wrap her Castilian consonants around English words.

That’s a pretty lame thing to say.

Penelope Cruz can speak in whateeeeeever language she wants, as far as I’m concerned.

Preferably to me.

Oh, and guys? It’s “oacute.”

Kthx.

Things Lakota/Dakota/Sioux. And copyright.

November 14th, 2006

I spent a few hours tonight poking around in the American University library tonight, and as usual I headed for the “P” section… “PM,” as it happened.

That would be languages… Hyperborean, Indian and artificial languages, according to the ever-aleatoric Library of Congress classification. (Ugh, Shirky was right; ontology is overrated.)

The one I ended up reading was Dakota Grammar: With Texts and Ethnography. I didn’t dig too deeply but it looked like a nice, competent, descriptive piece of work. There a text of a related language (Omaha) at Project Gutenberg with the interlinear text and everything, from an edition recorded by the same anthropologist: Illustration Of The Method Of Recording Indian Languages by James Owen Dorsey.

Now, here’s an honest question, one to which I don’t know the answer: that book is listed as having been published in 2004. But it was actually first published by the Government Printing Office in 1893. Now, doesn’t that mean that the book is in the public domain? Could I go and scan the whole thing and put it on the web, or would that be (by some reasoning unbeknownst to me) a violation of the Minnesota Historical Society edition?

Also interesting: Tampa, Follow the Stories: Lakota Dictionary

Ads in Bengali, Police work in Portuguese, and Medicine in Spanish

November 12th, 2006

Here are some language- and translation-related stories for your perusal.

The Telegraph - Calcutta : Metro “The mosquito coil brand being advertised is Maxo, marketed by Jyothi laboratories. It is a national brand and therefore must be having campaigns running in areas other than Bengal. This ad is in Bengali and from all indications it is not a translation of the national campaign. It is an ad conceived and created in the local language.”

The Enquirer - Translator helps patients overcome language barrier “”When you have a child dying, you can barely remember your own language,” Morales said.”

MetroWestDailyNews.com - News & Opinion: Police who speak the language “Just two hours into his shift for the night, Milford Police Officer Carlos Sousa encountered three drivers who spoke little or no English.
Sousa slipped easily into Portuguese to talk with a Brazilian teen whose car was towed from East Main Street because he had no valid license. After pulling over a pickup truck, Sousa broke into Spanish to explain a traffic ticket.
In a town with growing Ecuadorian and Brazilian immigrant populations, Sousa’s fluency in three languages is a valuable skill on the force.”

Quantum Information Retrieval Mechanics?

October 29th, 2006

Equal parts terrifying and fascinating:

C. J. van Rijsbergen - The Geometry of Information Retrieval

Keith Van Rijsbergen demonstrates how different models of information retrieval (IR) can be combined in the same framework used to formulate the general principles of quantum mechanics. All the standard results can be applied to address problems in IR, such as pseudo-relevance feedback, relevance feedback and ostensive retrieval. The relation with quantum computing is examined. Appendices with background material on physics and mathematics are also included.

Head a splodes.

Oh, and while I’m on the topic of geometry and IR , there’s another recent book out that seems pretty great:

Dominic Widdows - Geometry and Meaning

From the earliest applications in astronomy, music, and biology, to the design of today’s user interfaces and search engines, geometric insights have provided powerful tools and accurate scientific predictions. In Geometry and Meaning, these threads are gathered together and told as a single evolving story. Mathematical models from ancient times to the present are described for the general reader, together with the stories behind their discovery, and their applications in the new and vibrant field of natural language processing.

How zeitgeisty, huh?

It’s funny how IR as a field has this reputation of being very bland and nerdy. It seems to me that statistics has the same sort of reputation. I was never very good at math, I did just okay in Calculus and so on, but to be honest, for whatever reason, I just never cared about physics. And that’s what you get, mostly, in calculus examples. Plug in formula blah blah.

Once I started learning about the weird relationship between numbers and words and letters and all that voodoo (and this is after college), I suddenly cared. Even now, I see something like the vector space model as very mysterious, very amazing, and even, maybe, profound. Dude, you can use cosines to figure shit out. Crazy.

And the more you learn, the more you start thinking that way.

Language Search

October 26th, 2006

I put together one of those Google Co-op search engines for sites related to language, translation, linguistics, and stuff like that. Check it out:

Site recommendations welcome.

Languages that don’t use Spaces?

October 23rd, 2006

I’m trying to build a list:

  • Thai
  • Japanese
  • Chinese (Cantonese, Mandarin, etc)
  • Korean (oops, thanks dda)
  • Khmer
  • Burmese
  • Lao

I found this W3C slide which is relevant: W3C I18N Tutorial: CSS3 and International Text

I just haven’t checked yet, but don’t Burmese and Khmer fall into this category as well?



Update: dda has some insight on Korean in the comments, and I’ve added several other South Asian languages, which seem to have the same system as Thai.

It’s not actually correctly to say that Thai, Burmese, Lao, and Khmer don’t use spaces, they do. It’s just that they use them to separate long phrases, not single words. So, the problems of indexing are similar to languages that don’t use spaces at all. The title of this post should have been “Languages that don’t use spaces between words, like, all the time?”

Using search engines to find content in any language

October 20th, 2006

Here’s a little trick I use a lot. I don’t know if it’s common knowledge, maybe so. But in any case I find it pretty useful (and, I admit, fun).

Just now, I was reading about Madonna stealing adopting a baby from Malawi. I was wondering if there were any blogs in a Malawian language that mentioned Madonna.

(I’ll tell you right now, I didn’t find any.)

A friend of mine did Peace Corp in Malawi, and she taught me a few phrases, so I happened to know that one local language is called Chichewa.

So here’s my technique:

  1. Find some text in Chichewa.
  2. Find the most frequent words.
  3. Pick a few, preferably short ones.
  4. Stick those in a search engine, optionally with the term you’re looking for (in my case, it was “Madonna”).

For step one, a convenient source is Eric Muller’s UDHR in Unicode. (And while you’re there, take a minute to read it if you never have…)

I had to run a few Google searches to figure out which name Eric was using — nopenope … finally I loaded up this sampler and did a find for “Chewa”… bingo:

Nyanja (Chechewa)
Anthu onse amabadwa aufulu ndiponso ofanana mu ulemu ndi ufulu wao. Iwowa ndi wodalitsidwa ndi mphamvu zoganiza ndi chikumbumtima ndipo achitirane wina ndi mnzake mwaubale.

Nyanja (Chinyanja)
Anthu onse amabadwa mwa ufulu ndiponso olinganga m’ makhalidwe ao. Iwo amakhala ndi nzeru za cibadwidwe kotero ayenera kucitirana zabwino wina ndi mnzace.

It’s spelled “Chechewa”… hmm, come to think of it, I wonder why there are two versions there? I should drop Eric an email. (He’s very diligent about fixing things, so if you happen to run across a bug at http://udhrinunicode.org/ do let him know.)

So anyway. Then I went and found it on the OHCHR: UN site.

Now, step 2, find some frequent words.

Javascript guru Jesse Ruderman has some Search Engine Optimization Bookmarklets , one of which happens to pop up a window with word frequency. I made a little test page here.

So, you run that bookmarklet from the aforelinked Chichewa page, and you get a window that says something like this:

1615 words


  • 133: ndi
  • 57: ufulu
  • 39: aliyense
  • 37: ali
  • 35: kapena
  • 31: m
  • 30: ndime
  • 28: pa
  • 28: wa
  • 20: anthu
  • 19: dziko
  • 18: a
  • 16: munthu
  • 15: maiko
  • 15: ndiponso
  • etc…

Pick a few really common ones — shorter is probably better; after all, the word “rights” will be all over that particular document, but as Zipf taught all us groundlings, common words tend to be short.

Survey says:

wa pa a cha ndime madonna - Google Search. Nada. Rinse and repeat:

wa a cha ndime madonna - Google Search Oh for two!

wa a cha madonna - Google Search Now we’re talkin!

Look, Chichewa! Madonna amekanusha taarifa zilizotolewa na vyombo vya habari.

Oh wait, that’s Swahili.

Well anyway. I guess it’s not the year for blogs in Chichewa to catch on yet.

Yet.

In any case, leaving “Madonna” off certainly seems like it gets you some content in Chichewa:

wa pa a cha ndime - Google Search

Which in and of itself is useful when a language might not be indexed as as that language by a search engine, but it’s indexed nonetheless.

A more industrialized version of this process could be easily scripted. I’ve got some rather crufty code lying around that will use this approach to spider content, find frequent words, submit them to the Yahoo API, spider the URLs that return, rinse and repeat. This is cool, because the more iterations you do, the more representative the frequent words become, and they function as ever-better search terms. Maybe I’ll clean it up and post it here.

There’s a paper from before the days of search engine APIs that describes a similar approach (better than I’ve described it here!):

Building Minority Language Corpora by Learning to Generate Web Search Queries - Ghani, Jones, Mladeni’c (ResearchIndex)

Happy fishing.

The Problem Is, Serendipity Works

October 18th, 2006

There are a million people in the world who want to tell you how to act. What the principles of effective life are, and crap like that.

Case in point.

The real work is happening in your brain and practically every other place that’s not an inbox. Stop allowing yourself to be brow-beaten by the latest, loudest, or most dramatic item that’s landed in your world.

The problem is, this is patently not true.

Randomly wandering around the internet, nay, pointlessly, obsessively, addictively wandering around the internet is productive. People who think that they will make themselves more efficient by not wandering around pointlessly on the internet are kidding themselves. People have an amazing ability to sort signal from noise.

But the thing is, the more noise there is, the more signal you get.

This is what the “efficiency” crowd doesn’t want to admit, because it means that their systems aren’t more productive than obsessive wandering and clicking through toplists.

A few hours ago (I really don’t even know what time it is), I screwing around with some CSS to render parallel text—basically I was looking for good ways to mark up a source text and its translation with HTML and CSS.

In the process, I started randomly sticking in sample texts from the first article of the Universal Declaration of Human Rights in a bazillion languages.

One of those languages was Thai, and I saw that the Thai text wasn’t wrapping correctly.

Ah, that old problem. Thai doesn’t use spaces (well, it does, but… erm… I don’t exactly understand when and why), so browsers don’t know where to break long strings of text. (They usually rely on spaces.)

This was not too far from my mind, because a few nights ago, while randomly looking through the web pages of NLP courses, I found this one at Stanford, which had a really interesting paper on Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences (pdf). The problem is similar in Japanese, of course, so my experience with miswrapped Thai immediately made me wonder whether the (very successful) technique from that paper could be ported to Thai.

But before I started looking into that (probably by trying to implement the paper’s algorithm for Japanese, to start with), I figured I would… wander around aimlessly a bit more googling for anything related to text wrapping and Thai.

So I started thinking of terms to lookup. One thing that popped into my mind was the name of a guy who goes by “bact” on Wikipedia. So, totally randomly, I googled: thai bact.

Look at the first hit:

Thai Words Separator :: Mozilla Add-ons :: Add Features to Mozilla …1
Thai Words Separator is an extension to fit thai words in webpage layout without … This implementation developed from bact’ (http://bact.blogspot.com/) …

And a few clicks away from that, Bact’s public domain ThaiWrap bookmarklet.

That little piece of code has some a very original and useful approach to solving my CSS text-wrapping problem. But that’s not all, it’s another piece of the puzzle that could play a role in the much more critical problem of probabilistically splitting Thai (and Japanese, and Khmer, and…) text into words.

That’s a serious problem for Blogamundo, a real problem for which we have to find a solution, or at least, an approach.

And I got closer to a solution by just wandering around aimlessly.

Yes, I stayed up all night. Yes, I ate half a box of Triscuits.

And you know what? It was pretty damn productive.