infundibulum

Office of English Language Acquisition Blog

October 31st, 2006

I’ve been following the Office of English Language Acquisition (OELA) Newsline for a while in Bloglines. It’s updated a lot, and I would say it’s *cough* fair and balanced on most issues. Here are a few recent links:

Worth a look if you’re interested in language policy issues.

Quantum Information Retrieval Mechanics?

October 29th, 2006

Equal parts terrifying and fascinating:

C. J. van Rijsbergen - The Geometry of Information Retrieval

Keith Van Rijsbergen demonstrates how different models of information retrieval (IR) can be combined in the same framework used to formulate the general principles of quantum mechanics. All the standard results can be applied to address problems in IR, such as pseudo-relevance feedback, relevance feedback and ostensive retrieval. The relation with quantum computing is examined. Appendices with background material on physics and mathematics are also included.

Head a splodes.

Oh, and while I’m on the topic of geometry and IR , there’s another recent book out that seems pretty great:

Dominic Widdows - Geometry and Meaning

From the earliest applications in astronomy, music, and biology, to the design of today’s user interfaces and search engines, geometric insights have provided powerful tools and accurate scientific predictions. In Geometry and Meaning, these threads are gathered together and told as a single evolving story. Mathematical models from ancient times to the present are described for the general reader, together with the stories behind their discovery, and their applications in the new and vibrant field of natural language processing.

How zeitgeisty, huh?

It’s funny how IR as a field has this reputation of being very bland and nerdy. It seems to me that statistics has the same sort of reputation. I was never very good at math, I did just okay in Calculus and so on, but to be honest, for whatever reason, I just never cared about physics. And that’s what you get, mostly, in calculus examples. Plug in formula blah blah.

Once I started learning about the weird relationship between numbers and words and letters and all that voodoo (and this is after college), I suddenly cared. Even now, I see something like the vector space model as very mysterious, very amazing, and even, maybe, profound. Dude, you can use cosines to figure shit out. Crazy.

And the more you learn, the more you start thinking that way.

Language Search

October 26th, 2006

I put together one of those Google Co-op search engines for sites related to language, translation, linguistics, and stuff like that. Check it out:

Site recommendations welcome.

Sic Semper Sic?

October 24th, 2006

I was reading John Battelle’s The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture (good read, by the way), and something caught my attention.

In a footnote to the chapter on the birth of Google, Batelle gives a quote from Wikipedia with the definition of a graph (this definition of a graph , as a matter of fact).

And something occurred to me.

If you quote something from a wiki, using sic in your quotation would be… well, weird.

Because if you’re quoting a wiki, you’re implicitly giving your approval of the content. After all, if you find a mistake that you would otherwise mark as “sic,” you can fix it before you quote it!

So then, you’re actually sort of quoting yourself, right?

I don’t suggest thinking about this sort of thing whilst sober.

Fuggeddaboutit

October 23rd, 2006

Oh, awesome:

“Full text search in SQLite.”

Oh, the sound of a bazillion angels crying:

“The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.”

My forehead is really starting to hurt from banging it on the desk.

Languages that don’t use Spaces?

October 23rd, 2006

I’m trying to build a list:

  • Thai
  • Japanese
  • Chinese (Cantonese, Mandarin, etc)
  • Korean (oops, thanks dda)
  • Khmer
  • Burmese
  • Lao

I found this W3C slide which is relevant: W3C I18N Tutorial: CSS3 and International Text

I just haven’t checked yet, but don’t Burmese and Khmer fall into this category as well?



Update: dda has some insight on Korean in the comments, and I’ve added several other South Asian languages, which seem to have the same system as Thai.

It’s not actually correctly to say that Thai, Burmese, Lao, and Khmer don’t use spaces, they do. It’s just that they use them to separate long phrases, not single words. So, the problems of indexing are similar to languages that don’t use spaces at all. The title of this post should have been “Languages that don’t use spaces between words, like, all the time?”

Using search engines to find content in any language

October 20th, 2006

Here’s a little trick I use a lot. I don’t know if it’s common knowledge, maybe so. But in any case I find it pretty useful (and, I admit, fun).

Just now, I was reading about Madonna stealing adopting a baby from Malawi. I was wondering if there were any blogs in a Malawian language that mentioned Madonna.

(I’ll tell you right now, I didn’t find any.)

A friend of mine did Peace Corp in Malawi, and she taught me a few phrases, so I happened to know that one local language is called Chichewa.

So here’s my technique:

  1. Find some text in Chichewa.
  2. Find the most frequent words.
  3. Pick a few, preferably short ones.
  4. Stick those in a search engine, optionally with the term you’re looking for (in my case, it was “Madonna”).

For step one, a convenient source is Eric Muller’s UDHR in Unicode. (And while you’re there, take a minute to read it if you never have…)

I had to run a few Google searches to figure out which name Eric was using — nopenope … finally I loaded up this sampler and did a find for “Chewa”… bingo:

Nyanja (Chechewa)
Anthu onse amabadwa aufulu ndiponso ofanana mu ulemu ndi ufulu wao. Iwowa ndi wodalitsidwa ndi mphamvu zoganiza ndi chikumbumtima ndipo achitirane wina ndi mnzake mwaubale.

Nyanja (Chinyanja)
Anthu onse amabadwa mwa ufulu ndiponso olinganga m’ makhalidwe ao. Iwo amakhala ndi nzeru za cibadwidwe kotero ayenera kucitirana zabwino wina ndi mnzace.

It’s spelled “Chechewa”… hmm, come to think of it, I wonder why there are two versions there? I should drop Eric an email. (He’s very diligent about fixing things, so if you happen to run across a bug at http://udhrinunicode.org/ do let him know.)

So anyway. Then I went and found it on the OHCHR: UN site.

Now, step 2, find some frequent words.

Javascript guru Jesse Ruderman has some Search Engine Optimization Bookmarklets , one of which happens to pop up a window with word frequency. I made a little test page here.

So, you run that bookmarklet from the aforelinked Chichewa page, and you get a window that says something like this:

1615 words


  • 133: ndi
  • 57: ufulu
  • 39: aliyense
  • 37: ali
  • 35: kapena
  • 31: m
  • 30: ndime
  • 28: pa
  • 28: wa
  • 20: anthu
  • 19: dziko
  • 18: a
  • 16: munthu
  • 15: maiko
  • 15: ndiponso
  • etc…

Pick a few really common ones — shorter is probably better; after all, the word “rights” will be all over that particular document, but as Zipf taught all us groundlings, common words tend to be short.

Survey says:

wa pa a cha ndime madonna - Google Search. Nada. Rinse and repeat:

wa a cha ndime madonna - Google Search Oh for two!

wa a cha madonna - Google Search Now we’re talkin!

Look, Chichewa! Madonna amekanusha taarifa zilizotolewa na vyombo vya habari.

Oh wait, that’s Swahili.

Well anyway. I guess it’s not the year for blogs in Chichewa to catch on yet.

Yet.

In any case, leaving “Madonna” off certainly seems like it gets you some content in Chichewa:

wa pa a cha ndime - Google Search

Which in and of itself is useful when a language might not be indexed as as that language by a search engine, but it’s indexed nonetheless.

A more industrialized version of this process could be easily scripted. I’ve got some rather crufty code lying around that will use this approach to spider content, find frequent words, submit them to the Yahoo API, spider the URLs that return, rinse and repeat. This is cool, because the more iterations you do, the more representative the frequent words become, and they function as ever-better search terms. Maybe I’ll clean it up and post it here.

There’s a paper from before the days of search engine APIs that describes a similar approach (better than I’ve described it here!):

Building Minority Language Corpora by Learning to Generate Web Search Queries - Ghani, Jones, Mladeni’c (ResearchIndex)

Happy fishing.

The Problem Is, Serendipity Works

October 18th, 2006

There are a million people in the world who want to tell you how to act. What the principles of effective life are, and crap like that.

Case in point.

The real work is happening in your brain and practically every other place that’s not an inbox. Stop allowing yourself to be brow-beaten by the latest, loudest, or most dramatic item that’s landed in your world.

The problem is, this is patently not true.

Randomly wandering around the internet, nay, pointlessly, obsessively, addictively wandering around the internet is productive. People who think that they will make themselves more efficient by not wandering around pointlessly on the internet are kidding themselves. People have an amazing ability to sort signal from noise.

But the thing is, the more noise there is, the more signal you get.

This is what the “efficiency” crowd doesn’t want to admit, because it means that their systems aren’t more productive than obsessive wandering and clicking through toplists.

A few hours ago (I really don’t even know what time it is), I screwing around with some CSS to render parallel text—basically I was looking for good ways to mark up a source text and its translation with HTML and CSS.

In the process, I started randomly sticking in sample texts from the first article of the Universal Declaration of Human Rights in a bazillion languages.

One of those languages was Thai, and I saw that the Thai text wasn’t wrapping correctly.

Ah, that old problem. Thai doesn’t use spaces (well, it does, but… erm… I don’t exactly understand when and why), so browsers don’t know where to break long strings of text. (They usually rely on spaces.)

This was not too far from my mind, because a few nights ago, while randomly looking through the web pages of NLP courses, I found this one at Stanford, which had a really interesting paper on Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences (pdf). The problem is similar in Japanese, of course, so my experience with miswrapped Thai immediately made me wonder whether the (very successful) technique from that paper could be ported to Thai.

But before I started looking into that (probably by trying to implement the paper’s algorithm for Japanese, to start with), I figured I would… wander around aimlessly a bit more googling for anything related to text wrapping and Thai.

So I started thinking of terms to lookup. One thing that popped into my mind was the name of a guy who goes by “bact” on Wikipedia. So, totally randomly, I googled: thai bact.

Look at the first hit:

Thai Words Separator :: Mozilla Add-ons :: Add Features to Mozilla …1
Thai Words Separator is an extension to fit thai words in webpage layout without … This implementation developed from bact’ (http://bact.blogspot.com/) …

And a few clicks away from that, Bact’s public domain ThaiWrap bookmarklet.

That little piece of code has some a very original and useful approach to solving my CSS text-wrapping problem. But that’s not all, it’s another piece of the puzzle that could play a role in the much more critical problem of probabilistically splitting Thai (and Japanese, and Khmer, and…) text into words.

That’s a serious problem for Blogamundo, a real problem for which we have to find a solution, or at least, an approach.

And I got closer to a solution by just wandering around aimlessly.

Yes, I stayed up all night. Yes, I ate half a box of Triscuits.

And you know what? It was pretty damn productive.

Bye Tower

October 14th, 2006

So Tower Records is closing.

You know, I’m surprised how bummed out that makes me. I have gone to Tower for years. that one and that one and that one and that one.

It’s sort of stupid to feel nostalgic about a retail chain. Retail chains don’t feel nostalgic about you. (Or your privacy — it always bugged the hell out of me when cashiers at Tower would ask for my zip code. I’d say “90210.” They didn’t like that.)

And yet, the fact that I can remember these places means that they have been part of my life like any other, I guess. And while Tower was a huge, corporate store peddling stuff sold by record companies that (on the whole) can only be described as clueless about recent changes in the way music is distributed, and terribly disrespectful of their customers… even then it makes me sad that Tower is going away.

I thought about it a little, and I think I put my finger on it.

Tower Records was the one holdout in suburbia that could make you think that somebody, somewhere, was still willing to invest in the idea that being a little unusual was alright.

They gave jobs to the pierced kids, the punks, the goths, and the classical wonks too. In fact, they gave the classical wonks a whole room of their own.

And, if you dug around in their book sections, you’d find stuff that you will not find at Barnes and Noble or Borders. Subversive stuff. Some stuff that I had zero interest in reading. But at least it was there, you know? At least you could still buy that stuff in suburbia. It existed in the brick and mortar world, and there was no massive catastrophe as a result of it.

As a business entity, I really won’t miss it. They didn’t figure out how to live in the age of the long tail. They never clued in to the iPod world, they never got creative, and that’s the way it goes.

But tonight I was talking to one of those hipster chicks that work there. You know her, the cute one with the slightly blue hair that you joke around with and wish you could keep talking to, but hey, transaction completed? The one who’s cooler than you. That one.

We were joking about what would be left at the bitter end — miles of NKOTB remixes? Would they turn it into a Costco-sized McDonalds? “Tower Records is the only place open at midnight for miles.” She said, “yeah.”

Then she told me not to say any more about it, it was depressing.

And it hit me: even someone who really values the idea that they’re outside of the system, and they say “screw you, system, my hair is blue and I’m putting tattoos on my eyelids!”…

They’re still stranded in suburbia too. And at quarter to midnight, where are they going to hang out now?

I have this icky feeling that a fundie somewhere is snickering.

Here Comes the Cornish…

October 11th, 2006

BBC NEWS | England | Cornwall | Beatles get the Cornish treatment

Covers of the Beatles in Cornish .

A few years back there were some interesting blog postings going about regarding whether Cornish was dead or not — see Languagehat for details. (My old blog exists only as an Internet Archive ghost.)