infundibulum

Random Thoughts on Compression and Lingustic Typology

March 22nd, 2005

Here’s a random thought I wanted to write down before I forgot it.

There has been discussion of using compression to identify languages. Here’s a neat little Python script by Dirk Holtwick that proves that the idea works. It’s based on this short paper, which was quite controversial when it came out (mostly because it was published in a physics journal, and the physicists got grumpy).

I wonder what other uses compression could be put to in the linguistic sphere. The idea that came to my mind was typology. It seems to me that languages that are agglutinative would have detectable differences in compression patterns than languages that are isolating, for instance.

For example, one would expect many strings in Turkish text to show up as substrings of other words, because Turkish is agglutinating…

I haven’t articulated this well, just wanted to get it out of my head before I forgot it.

Love, Your Friendly Neighborhood Outlying App Developer

March 22nd, 2005