infundibulum

Text and Meaning

February 25th, 2005

An interesting post at “The Translator’s Blog“: The translation of text vs. the translation of meaning.

A colleague raised the issue of translation at the beginner stage, when you basically just “run” through the text word by word to polish the style afterwards, again and again until it works for you.

The experienced translator, in contrast, will extract the meaning of a text and start from there rather than “copy” the actual word into his target language.

I’ve been experiencing this distinction firsthand of late. I’ve been using a new, open source CAT(Computer Aided Translation) tool called “OmegaT”:http://www.omegat.org/omegat/omegat.html to do some translations of my own, from Portuguese (oi Jonas) and also from Welsh (hylo Nic). I definitely fall into the “polish the style afterwards” camp, although I’d have to say that “polish” may be too ambitious a word– my Portuguese is rusty, and my Welsh is… well, no comment. Give me a couple more years.

Translation is a strange game–in a weird way, I can imagine being a good translator of a language without being a terribly fluent speaker. Becoming fluent is a process of internalizing the language completely, to the point where you speak by intuition. Translation is more like being hyper aware of all the details of both languages at once: you have to know every possible rendering of a phrase in the target language, in order to reflect the original text as idiomatically as possible. That’s the impression I have had, in any case, in my limited attempts at translation.

(By the way, I’ve been caught up in another project, but that little phrase-splitting script I mentioned in the previous post will be coming up soon. I promise.)

A Lo-Fi Stab at Automatically Finding Phrases

February 21st, 2005

Part-of-speech tagging (often just called “POS tagging”) is one of the few NLP tasks that routinely gets very high accuracy scores, usually in the high nineties. So the idea is something of an old chestnut in NLP. The specific tags used vary a lot, but will give you an idea of what tagged text looks like, before and after:

Automatically add part-of-speech tags to words in text.

Automatically/adverb add/verb part-of-speech/adjective tags/noun to/to words/noun in/preposition text/noun./punctuation

So, assuming you’ve gone through some hocus-pocus to end up with a bunch of tagged text, what do you do with it?

Well, certain patterns of parts of speech tend to indicate terminology. If the same two nouns keep showing up in a text, for instance, chances are that the words are related, that is, they constitute a “term.” This approach is described section 5.1 of this chapter of Manning and Schütze’s text. It’s a very intuitive approach, and even a very simple implementation will turn up some useful stuff.

This kind of “shallow analysis” doesn’t attempt to find any long-distance relationships between phrases and words. There’s no parsing going on, in other words, just some pattern matching.

So part-of-speech tagging has its uses.

But what if we went even shallower than that? What if we tried to look exclusively at the statistical patterns of words? Can we get any useful information out?

So, I have in mind a small experiment, and I’ll just write it up here as I go.

As I’ve mentioned, one interest of mine is translation, of the human variety (as opposed to fully automated machine translation). I’m no pro at translation, so I don’t know if more qualified people feel the same way, but I find that it can be a pretty tedious endeavor. I find myself looking at a long sentence that I need to translate, and wishing that I could somehow subdivide it into phrases. So here’s a sentence I translated recently:

Es natural que exista un gran contraste entre su estilo de vida original y la nueva sociedad, que debe ser superado paulatinamente, pero la manera de resolverlo de Tatum bordea la ilegalidad.

What I want to see is something along the lines of:

Es natural
que exista
un gran contraste
entre su estilo
de vida original
y la nueva sociedad,
que debe ser superado paulatinamente,
pero la manera
de resolverlo
de Tatum bordea la ilegalidad.

I’m not even looking for indentation, I just want to see sensible subphrases on their own line so I can break down the translation process. Like I said, I don’t know if real translators work this way, but when I do this approach seems like it would be useful.

So how to automate it? If we had a part of speech tagger for Spanish (and I’m sure it wouldn’t be too difficult to find one), and a parser (which would be a bit harder to find), we’d write some system to place noun phrases on their own line, etc etc.

But that seems like overkill… we’re just splitting up a sentence into more manageable chunks, after all. So here’s my observation: in the sentence I split up manually above, the words which begin each line seem to be frequent. What if we “split on frequency”? We’ll take the most frequent words in the document, and split up the sentence into subphrases beginning with those words. Will it be useful? I have no idea, let’s see. Tomorrow. 8^)

Web-Friendly Translation

February 15th, 2005

I heard about a project called IRMA a while back, and their booth at a recent desktop Linux convention was described like this:

There was an IRMA booth at the conference. IRMA is a distributed group of translators that coordinate the task of internationalizing open source software.

They have a really slick website for brokering the resources of volunteer translators in order to aid open source developers. Since the translations are all open-sourced, they can be reused in other open source projects. They actually have a recommendation feature that finds similar strings in other projects that have already been translated in order to suggest translations for new apps. They currently support 45 different languages.

It seems to be the case that most open source efforts related to translation are about i18n and l10n, that is to say, collaboration on translating dialog boxes, GUIs, all that stuff — all about translating software.

Of course, software translation has been going on for a long time — what’s new is the presence of web interfaces to the actual translation. IRMA isn’t the only one, there’s also Pootle, a product of the fine work done at Translate.org.za.

The role of multilingualism in blogging, on the other hand, is a relatively new domain. Stephanie Booth has come up with an interesting Wordpress plugin for bloggers who blog in more than one language. There’s also been an interesting conversation running between Tim Oren and Kevin Marks about improving the findability of translations of content.

All this stuff is very good news.

Still, while these tools are certainly important in firing up bridge blogging, there is still room for other tools for translating content. I’ve been mulling a few ideas of my own, which aren’t really ready for prime time, but I’ll talk about them here sooner or later.

Why I Love Linux

February 13th, 2005

Actually, I love Linux for lots of reasons.

But here’s one you have to see to believe:

I was about to start a new image in The Gimp, so I opened up the template dialog:

Selecting a template size in The Gimp

They thought of everything.

Gimp will print on ANYTHING.

That kind of insanity doesn’t happen on those other operating systems.

:-)

My Favorite Techie/Language Books

February 8th, 2005

Here’s a list of books that I like that are more or less related to the intersection of language and computing. I make no attempt to justify the grouping — it’s just that I refer to them enough that somebody else out there might be interested.

The Elements of Typographic Style, Robert Bringhurst
Although this is very much a book about print, I still think it’s a great introduction to the nature of typography. There’s an appendix which is especially useful for looking up the names of funny characters like Ą and Đ and so forth. You may think that’s something you can do by just searching for the character in the Unicode tables, but LATIN CAPITAL LETTER A WITH OGONEK only tells you so much. Bringhurst gives you much, much more. Besides, the book itself is one of the most beautiful pieces of typography I’ve ever seen.
Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard, Richard Gillam
If you really want to dig into Unicode (doesn’t everyone?), this is the book. If you’re a geeky-leaning language nerd, and are wondering if getting into internationalization and localization and programming and stuff like that is for you, then this is probably also the book to start with. Even reference tomes like Daniels (see below) are now out of date in the sense that they don’t convey how various writing systems are represented electronically. This book does that capably and readably, as opposed to the dry-as-dust Unicode specification itself. Even I haven’t read that. People don’t seem to realize what amazing, amazing thing Unicode is. Just browsing this book conveys that.
Jurafsky & Martin and Manning & Schuetze
These two are NLP(Natural Language Processing) textbooks. They’re a more on the mathematical side, and contain no code to speak of, outside of pseudocode for describing algorithms. They’re often mentioned together because they’re sort of complementary — J&M leans toward symbolic approaches (it’s heavy on parsing), whereas M&S is leans more toward the statistical approach (which I personally find more interesting). Both require a significant dedication to understand. (I’ve only made dents.)
Text Processing With Python, David Mertz. (also free online)
Some pretty sound advice on handling text in Python. I don’t particularly like the approach he takes to Unicode, however.
The World’s Major Languages, Bernard Comrie, ed.
This is linguistics stuff. It’s probably the best single book for syntheses of grammar, phonetics/phonology, and writing systems of a broad variety of “important” languages. Of course, in this context “important” can be interpreted to mean “Let’s argue!” In my humble opinion, it’s absurd that Mayan or Quechua or Guaraní or at least one American language wasn’t included. But whatever, it’s still a useful book: if you need to know just a little about the structure of a language, and if it’s in here, it’s an excellent place to start.
The World’s Writing Systems, Daniels & Bright.
This is definitely a library-only kind of book (But if you have a spare $170 bucks lying around, my birthday is coming up next January.) As theory-independent as possible (and much better than Geoffrey Sampson’s Writing Systems in that respect), Daniels & Bright groans under the sheer amount of information it contains. It also groans under the weight of its weight: 919 pages. I’ve xeroxed a few zillion chapters out of here in my day. Endless bemusement.
Longman Dictionary of Contemporary English (searchable online)
This is a bit of an odd choice for this list, but my respect for this dictionary has grown and grown since I first started using it back when I was teaching English as a Second Language. I picked it up because I thought it would be good for learners — and it was. Many of my students ended up buying a copy for themselves. Oddly enough, I found myself using it on a regular basis, just because it’s so clear. I believe its utility is firmly based on one feature: it was built with corpora of actual usage. Not just frequencies of words, but frequencies of phrases. So it gives examples, for instance, of how the word “careful” is actually used: be careful is the most common, followed by careful person/work etc (that is, as an adjective), careful to do sth, and so on. It’s all about exemplification, and nothing about useless grammatical terminology. For a learner, that kind of information is solid gold, and it could only be obtained with statistical approaches to studying language.

Basque Looks Neat

February 5th, 2005

LuistxoBlog - Ingelesen Hilerria

I have a story about visiting Bilbao and thereabouts, but I’m out the door in a moment.

Nonetheless, I feel compelled to mention that I often find myself just looking at stuff in Basque and thinking “that looks so neat.”

In any case, I follow that guy’s English and Spanish blogs, where he writes interesting stuff, including i18n and l10n and Python. He also writes about Zope, which I’ve always been sort of afraid of.

“Parsing” the State of the Union

February 4th, 2005

I found a really interesting way to look at the distribution of words in text by way of del.icio.us: Parsing the State of the Union.

Aside from the obvious current political interest of this tool, it’s really quite a unique and useful way to look at patterns of words in text.

It would be quite interesting (to me, anyway!) to apply this tool to looking at the distribution of words in translated text, with two columns: the source on the left, and the translation on the right. One should be able to eyeball, to some extent, how well a word and its putative translation corresponded between the two columns.

(On a technical note, it’s also interesting to look at how the tool is constructed: rather than using some complicated image generation package or futzing around with css properties, he just uses a single pixel with a “width” attribute to map sentences. Very clever.)

Guess what Universe! Unicode will make you INSANE!

February 3rd, 2005

Yes! Yes it will!

At six in the morning you will say “UnicodeEncodeError, why do you treat me thusly?”

And then you will say “Universe, can’t we just go back to the age of ENIAC and tell all those guys ‘PEOPLE, YOUR CHARS MUST HAVE 16 BITS. I KNOW, IT SUCKS, AND IT MEANS YOU’LL ONLY BE ABLE TO STORE 3 OR 4 WORDS IN YOUR ENIACS, BUT BELIEVE ME, IT WILL SAVE YOU A LOT OF UNICODEENCODEERRORS IN THE FUTURE.’?”

Yes, that’s what you’ll ask.

And you will ponder, looking up at the twinkling stars in the sky through your bleary, monitor-begoggled eyes, and you will start seeing constellations spelling out

ユニコードとは何か?

Instead of

ユニコードとは何か?

And then, finally, you look at it again…

ユニコードとは何か?

And you’ll say… Sigh. I guess I’ll just have to get used to reading it that way.

Dumb.

February 2nd, 2005

Dear people who make websites,

Could you please stop making web pages formatted like this?

kthxbye.

Robotic Nation Evidence is Back!

February 1st, 2005

Rad!

There’s a new post on Robotic Nation Evidence, Marshall Brain’s blog on robotic technology and how it’s going to change… well… everything.

If you’ve not heard of this guy then you are missing out. He’s written some really amazing stuff (and he also founded howstuffworks.com, which has some amazing stuff of its own).

But if you’ve not read anything he’s written on robotics, then go read Robotic Nation right now. I’m not joking when I say that it’s had a serious impact on my understanding of the future.

I’m so glad the accompanying blog is back in action, I hope it stays that way.