Archive for MetaFieldmethods

Transliteration as Poor Man’s Translation

Here’s a thought I’ve never gotten around to implementing or really trying out.

Transliteration is the process of converting text in one script into another script. Here’s an example from Wikipedia: Greek -> English:

Greek Script: Ελληνική Δημοκρατία
Transcription: Ellēnikē Dēmokratia
Transliteration: Elliniki Dimokratia

The details of such conversion are pretty complex — there are two distinct systems of conversion here. The Wikipedia article tries to maintain a distinction between “transcription” and “transliteration,” but whatever, you get the idea: convert from one writing system to another.

Now, let’s suppose you have reason to believe, as blogger Ethan Zuckerman recently did, that there is an article written about you in a language you don’t know :

…two days ago, when ego-surfing Technorati, I discovered that a Saudi blogger had linked to me, mentioning that an interview with me had just been published in Al-Hayat. I can’t read Arabic, but the few English phrases in the piece connected to topics I’m deeply interested in. So hey, perhaps it was an interview with me.

Let’s imagine Ethan wasn’t fortunate enough to find an Arabic blogger to translate the article for him (which he in fact was in this instance). Is there some way that he might be able to determine if his name is in the thing at all?

Maybe so, using automated transliteration (or transcription, whatever!) and a bit of fuzzy matching.

When you get right down to it, the basic operation in transliteration is just making a bunch of substitutions. As with many tasks related to language processing, the best first step is often to simply think of what you’d do if you had to accomplish the task by hand.

Well, let’s say you were going to work with that Greek up there.

Ελληνική Δημοκρατία
Elliniki Dimokratia

(I picked the simpler transliteration system.)

Anyone can do a little inspection and make an educated guess as to which letter corresponds to which… something like this:

Ε E
λ l
λ l
η i
ν n
ι i
κ k
ή i

Δ D
η i
μ m
ο o
κ k
ρ r
α a
τ t
ί i
α a

And of course we’ll need more such pairs to figure out all the letters, but that’s not hard to find. In fact, we could just cut and paste 30 or 40 words from Wikipedia. (Say, city names Αθήνα Athína; Θεσσαλονίκη Thessaloníki; Πελοπόννησος Peloponnesos, etc.)

Once we’ve done that, we can write a simple program which will make those substitutions, and go from one script to the other.

And of course, this is all grossly simplified and won’t work very well at all.

More later…

Comments

Random Thoughts on Compression and Lingustic Typology

Here’s a random thought I wanted to write down before I forgot it.

There has been discussion of using compression to identify languages. Here’s a neat little Python script by Dirk Holtwick that proves that the idea works. It’s based on this short paper, which was quite controversial when it came out (mostly because it was published in a physics journal, and the physicists got grumpy).

I wonder what other uses compression could be put to in the linguistic sphere. The idea that came to my mind was typology. It seems to me that languages that are agglutinative would have detectable differences in compression patterns than languages that are isolating, for instance.

For example, one would expect many strings in Turkish text to show up as substrings of other words, because Turkish is agglutinating…

I haven’t articulated this well, just wanted to get it out of my head before I forgot it.

Comments

Programming in the browser…

Getting Unicode straight across platforms has been a huge hangup for me in trying to get together some tutorials on doing language processing with Python. And then, there’s another barrier to cross: how to deal with markup?

Generally speaking, what I’m interested in dealing with is text, but most multilingual text on the web is HTML.

One weird observation that keeps occurring to me is that you could teach text processing without teaching people to deal with setting up a programming environment at all: use Javascript.

This seems a little weird, but I think the reason that it seems weird is because people who work with text processing have never thought of Javascript as a real language. But it is a real language. And the barriers to programming in Javascript are incredibly low. (Go type javascript:alert('hello world!') on your address bar to see what I mean.)

And then, I was reading through some stuff on Crockford.com, and I came across this:

String is a sequence of zero or more Unicode characters. There is no separate character type.

Good grief! Music to my ears!

And as for dealing with HTML, well, Javascript has that abstraction built in. Try explaining to a newbie how to extract the text from an HTML page in Python. “Well, you start by subclassing a parser and…” Javascript is designed for a browser; and browsers are where all that markup stuff comes from in the first place: to turn a css rule into “put this text in a blue box in the corner,” the “text” bit is a given.

Of course, it still looks like C — or at least, certainly not as friendly as Python, but I have to say, combining these characteristics with Greasemonkey open up some very interesting possibilities… input/output becomes “go to this url.” Process the text becomes “Paste this Greasemonkey script into the editor and run it — the result will be investigate character distributions/statistical language id/sentence splittling keyword extraction/blah blah blah….”

Is it crazy to think that such things can be done in a learnable way with Javascript? I don’t think it is…

I’m just thinking out loud. But lately I’ve been thinking about all that Ajax stuff (and rolling it into my present project), and it’s gotten me thinking about the browser as a place to do programming. Kind of blue sky, yes, but certainly a fun angle on the topic of processing natural language.

Comments (1)

A Lo-Fi Stab at Automatically Finding Phrases

Part-of-speech tagging (often just called “POS tagging”) is one of the few NLP tasks that routinely gets very high accuracy scores, usually in the high nineties. So the idea is something of an old chestnut in NLP. The specific tags used vary a lot, but will give you an idea of what tagged text looks like, before and after:

Automatically add part-of-speech tags to words in text.

Automatically/adverb add/verb part-of-speech/adjective tags/noun to/to words/noun in/preposition text/noun./punctuation

So, assuming you’ve gone through some hocus-pocus to end up with a bunch of tagged text, what do you do with it?

Well, certain patterns of parts of speech tend to indicate terminology. If the same two nouns keep showing up in a text, for instance, chances are that the words are related, that is, they constitute a “term.” This approach is described section 5.1 of this chapter of Manning and Schütze’s text. It’s a very intuitive approach, and even a very simple implementation will turn up some useful stuff.

This kind of “shallow analysis” doesn’t attempt to find any long-distance relationships between phrases and words. There’s no parsing going on, in other words, just some pattern matching.

So part-of-speech tagging has its uses.

But what if we went even shallower than that? What if we tried to look exclusively at the statistical patterns of words? Can we get any useful information out?

So, I have in mind a small experiment, and I’ll just write it up here as I go.

As I’ve mentioned, one interest of mine is translation, of the human variety (as opposed to fully automated machine translation). I’m no pro at translation, so I don’t know if more qualified people feel the same way, but I find that it can be a pretty tedious endeavor. I find myself looking at a long sentence that I need to translate, and wishing that I could somehow subdivide it into phrases. So here’s a sentence I translated recently:

Es natural que exista un gran contraste entre su estilo de vida original y la nueva sociedad, que debe ser superado paulatinamente, pero la manera de resolverlo de Tatum bordea la ilegalidad.

What I want to see is something along the lines of:

Es natural
que exista
un gran contraste
entre su estilo
de vida original
y la nueva sociedad,
que debe ser superado paulatinamente,
pero la manera
de resolverlo
de Tatum bordea la ilegalidad.

I’m not even looking for indentation, I just want to see sensible subphrases on their own line so I can break down the translation process. Like I said, I don’t know if real translators work this way, but when I do this approach seems like it would be useful.

So how to automate it? If we had a part of speech tagger for Spanish (and I’m sure it wouldn’t be too difficult to find one), and a parser (which would be a bit harder to find), we’d write some system to place noun phrases on their own line, etc etc.

But that seems like overkill… we’re just splitting up a sentence into more manageable chunks, after all. So here’s my observation: in the sentence I split up manually above, the words which begin each line seem to be frequent. What if we “split on frequency”? We’ll take the most frequent words in the document, and split up the sentence into subphrases beginning with those words. Will it be useful? I have no idea, let’s see. Tomorrow. 8^)

Comments

My Favorite Techie/Language Books

Here’s a list of books that I like that are more or less related to the intersection of language and computing. I make no attempt to justify the grouping — it’s just that I refer to them enough that somebody else out there might be interested.

The Elements of Typographic Style, Robert Bringhurst
Although this is very much a book about print, I still think it’s a great introduction to the nature of typography. There’s an appendix which is especially useful for looking up the names of funny characters like Ą and Đ and so forth. You may think that’s something you can do by just searching for the character in the Unicode tables, but LATIN CAPITAL LETTER A WITH OGONEK only tells you so much. Bringhurst gives you much, much more. Besides, the book itself is one of the most beautiful pieces of typography I’ve ever seen.
Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard, Richard Gillam
If you really want to dig into Unicode (doesn’t everyone?), this is the book. If you’re a geeky-leaning language nerd, and are wondering if getting into internationalization and localization and programming and stuff like that is for you, then this is probably also the book to start with. Even reference tomes like Daniels (see below) are now out of date in the sense that they don’t convey how various writing systems are represented electronically. This book does that capably and readably, as opposed to the dry-as-dust Unicode specification itself. Even I haven’t read that. People don’t seem to realize what amazing, amazing thing Unicode is. Just browsing this book conveys that.
Jurafsky & Martin and Manning & Schuetze
These two are NLP(Natural Language Processing) textbooks. They’re a more on the mathematical side, and contain no code to speak of, outside of pseudocode for describing algorithms. They’re often mentioned together because they’re sort of complementary — J&M leans toward symbolic approaches (it’s heavy on parsing), whereas M&S is leans more toward the statistical approach (which I personally find more interesting). Both require a significant dedication to understand. (I’ve only made dents.)
Text Processing With Python, David Mertz. (also free online)
Some pretty sound advice on handling text in Python. I don’t particularly like the approach he takes to Unicode, however.
The World’s Major Languages, Bernard Comrie, ed.
This is linguistics stuff. It’s probably the best single book for syntheses of grammar, phonetics/phonology, and writing systems of a broad variety of “important” languages. Of course, in this context “important” can be interpreted to mean “Let’s argue!” In my humble opinion, it’s absurd that Mayan or Quechua or Guaraní or at least one American language wasn’t included. But whatever, it’s still a useful book: if you need to know just a little about the structure of a language, and if it’s in here, it’s an excellent place to start.
The World’s Writing Systems, Daniels & Bright.
This is definitely a library-only kind of book (But if you have a spare $170 bucks lying around, my birthday is coming up next January.) As theory-independent as possible (and much better than Geoffrey Sampson’s Writing Systems in that respect), Daniels & Bright groans under the sheer amount of information it contains. It also groans under the weight of its weight: 919 pages. I’ve xeroxed a few zillion chapters out of here in my day. Endless bemusement.
Longman Dictionary of Contemporary English (searchable online)
This is a bit of an odd choice for this list, but my respect for this dictionary has grown and grown since I first started using it back when I was teaching English as a Second Language. I picked it up because I thought it would be good for learners — and it was. Many of my students ended up buying a copy for themselves. Oddly enough, I found myself using it on a regular basis, just because it’s so clear. I believe its utility is firmly based on one feature: it was built with corpora of actual usage. Not just frequencies of words, but frequencies of phrases. So it gives examples, for instance, of how the word “careful” is actually used: be careful is the most common, followed by careful person/work etc (that is, as an adjective), careful to do sth, and so on. It’s all about exemplification, and nothing about useless grammatical terminology. For a learner, that kind of information is solid gold, and it could only be obtained with statistical approaches to studying language.

Comments (2)

Python as a Multilingual Command Line

Here’s a bit on what’s holding up progress on Fieldmethods, and some thoughts about Python and Unicode.

Consider what’s probably the simplest multilingual application imaginable: open some text from a file and print it out again. As an example, I used some text in Georgian, which I snagged from this Unicode.org page. (A good place for finding random samples of Unicode text in different languages.)


>>> import os,sys
>>> sys.getdefaultencoding()
'ascii'
>>> # Let's open up our utf-8 encoded Georgian text:
>>> georgian = open('georgian.txt').read()
>>> # Now we convert the text to Unicode:
>>> georgian = unicode(georgian)
 
Traceback (most recent call last):
  File "<pyshell #12 />", line 1, in - toplevel -
    georgian = unicode(georgian)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>> # Oops! Kartvelian chaos!

Do you want to get into character encoding and code pages and codecs and…? Well, maybe you do. I think it’s kind of interesting, myself, in the way that, uh, Rubik’s Cubes are interesting.

But my readers at Fieldmethods won’t. They’ll turn around and say “Sorry, too techie, I don’t have time for this.”

So I just want to tell them how to avoid thinking about encodings as much as possible. That means making utf-8 the default, and here’s what has to happen.

They need to take a few steps to set up a multilingual prompt.

Here’s what it looks like when everything is working:

a multilingual prompt

Great! See that Georgian text in there? If you have the fonts, and you have Python configured correctly, we get to deal directly with the text. You see text, not escape codes everywhere. (Although you can see that in certain circumstances, as in that sixth line, you still see escape codes. I’m not sure why that is, but we’ll have to be happy with visible text after print statements.) Note also that we’re not going to be inputting random stuff from the keyboard, we’re just going to read files, but we still want to be able to see it.

So here are the three platforms I’m targetting:

  1. Linux
  2. Windows XP
  3. OSX

Now, one of Python’s strong points is that it’s pretty consistent across these platforms (the IDLE editor, in particular, is almost identical across all three). And the Unicode support is there.

Unfortunately, the default configuration for a new install of Python, on any platform, is not set up to encourage use of Unicode in this way. Namely, the default encoding is not utf-8 but ASCII. Whyascii? Inertia, I guess, but I really have no idea. Arguments about default encodings, codecs, conversions, etc, etc, go on endlessly in Python newsgroups. But pretty much everyone agrees that you can’t cover too many languages with ASCII. Not really even English, if you ask me.

Mostly such arguments are based on the idea that everything has to be portable between systems. But that’s not my prime consideration. My prime consideration is that programs be portable between human languages. We’re going to be “doing science” on language, and it only makes sense if we can apply it to any language. If I write a function count_letters(), I want it to work with English, French, Georgian, Persian, Cakchiquel, WHATEVER.

With that goal in mind, having the default encoding set to utf-8 is the way to go.

And that’s what I might need your help with:

How do I make it as painless as possible for users to set that as the default under all three systems?

I learned from Mark Pilgrim’s book how to change Python’s default encoding. What has to happen is that a file named sitecustomize.py has to be put in the Python library. The trickiness comes in with the fact that my readers won’t necessarily be as savvy about things like file permissions as Mark’s are.

The file needs to contain just two lines:


import sys
sys.setdefaultencoding('ascii')

That’s it!

So I plan to make a “Preliminaries” page on Fieldmethods that goes step-by-step through what you need to do to create that file, and where and how to put it where it needs to be on each platform. (And change it back, for whatever reason.)

Because of some nuttiness when Python starts up, you can’t just stick sitecustomize.py in your current directory, or some other directory that’s in your Python path.

Linux isn’t much of a problem: become root, and create the file in

/usr/lib/Python32/site-packages/

Or whereever your Linux distro puts your site-packages directory.

I’m not much of a Windows guy, but it’s my impression that under XP all you have to do is save sitecustomize.py in C:\Python23\ or C:\Python23\site-packages\. I don’t think Windows has any concept of “root” to speak of, so it seems that it’s just a matter of opening the file file in one of those directories and saving it.

Now, OSX, I’m not so sure about: how do you log in as root? How do you write to that directory? If anyone can help me out I’d appreciate it.


Update:

I’ve gotten some help on the OSX front. It looks like there is a distinction between “admin” users and non-admin users, and the “admin” users pretty much have root. Most people who have their own Macs will probably be admin users, so I’ll make that assumption. (Basically you either use sudo or you drag the file with the file system, and get prompted for a password. Not too complex.) Off to write up these three sets of steps.

(Thanks for the help, anonymous Mac guy.)

python unicode

Comments

Hi.

Hi there. Check out About me for the big picture.

Comments