Archive for Translation

The Education Equality Act

Here’s a translation issue that will probably end up becoming a media circus in New York, if not elsewhere:

The Education Equality Act (Gotham Gazette. June, 2005)

Intro 464: The Education Equity Act was introduced in the City Council by council members Hiram Monserrate and David Yassky. The legislation requires the Department of Education to translate documents, such as report cards and notices, into the eight most widely spoken languages — Spanish, Chinese, Russian, Italian, French, Yiddish, Korean and Polish — and provide interpretation services for parents who don’t speak English.

Hmm, a quick look at Technorati already uncovers some indignation: Multicultural Madness in NYC implies that the “victims” of this legislation would be the students, except that it’s aimed at parents, who are attempting to help the students learn English. The example cited is a parent who doesn’t read English and doesn’t know that their child is skipping class.

Which is in English.

But whatever.

The conversation around this bill should prove interesting. (Granted, it is a pretty vague title, given what the legislation does.)


Volunteer Translation Banks

I ran across an article from last year on something called a “language bank”: Volunteer translators break down barriers

It describes a program at the Seattle Red Cross that brings together translators for over 75 languages. They help with all kinds of needs that immigrants run into:

The bank and its volunteers negotiate with apartment managers, communicate with citizenship and immigration services, decipher cable bills, and even assist in emergency situations such as residential fires; it all adds up to about 4,000 cases a year.

I was unsurprised to find, after a little digging, that there’s a similar program in my own Montgomery County, Maryland: the Montgomery County, MD - Language Bank.


I’ve done a tiny bit of interpreting and also some translation before, and lemme tell ya, it’s hard work. To do it under the kind of pressure that I’m sure these programs run into must be at least, uh, stressful.

The administrators and translations at these language banks deserve a lot of appreciation.

It seems like the only language policy stories you’ll ever read in big media in the States is about the English only movement. But language banks are also concrete reminders of the fact that the US is actually an incredibly multilingual society, probably one of the most multilingual societies in the world.

We should be proud of that.


Web equivalents to the OSX dictionary application

I’ve heard mixed reviews of OSX Tiger, but the little dictionary widget seems to be universally popular. There’s actually a class of applications that do something similar on the web, usually through a proxy:

I’d be interested to know of any others!


Transliteration as Poor Man’s Translation

Here’s a thought I’ve never gotten around to implementing or really trying out.

Transliteration is the process of converting text in one script into another script. Here’s an example from Wikipedia: Greek -> English:

Greek Script: Ελληνική Δημοκρατία
Transcription: Ellēnikē Dēmokratia
Transliteration: Elliniki Dimokratia

The details of such conversion are pretty complex — there are two distinct systems of conversion here. The Wikipedia article tries to maintain a distinction between “transcription” and “transliteration,” but whatever, you get the idea: convert from one writing system to another.

Now, let’s suppose you have reason to believe, as blogger Ethan Zuckerman recently did, that there is an article written about you in a language you don’t know :

…two days ago, when ego-surfing Technorati, I discovered that a Saudi blogger had linked to me, mentioning that an interview with me had just been published in Al-Hayat. I can’t read Arabic, but the few English phrases in the piece connected to topics I’m deeply interested in. So hey, perhaps it was an interview with me.

Let’s imagine Ethan wasn’t fortunate enough to find an Arabic blogger to translate the article for him (which he in fact was in this instance). Is there some way that he might be able to determine if his name is in the thing at all?

Maybe so, using automated transliteration (or transcription, whatever!) and a bit of fuzzy matching.

When you get right down to it, the basic operation in transliteration is just making a bunch of substitutions. As with many tasks related to language processing, the best first step is often to simply think of what you’d do if you had to accomplish the task by hand.

Well, let’s say you were going to work with that Greek up there.

Ελληνική Δημοκρατία
Elliniki Dimokratia

(I picked the simpler transliteration system.)

Anyone can do a little inspection and make an educated guess as to which letter corresponds to which… something like this:

λ l
λ l
η i
ν n
ι i
κ k
ή i

η i
μ m
ο o
κ k
ρ r
α a
τ t
ί i
α a

And of course we’ll need more such pairs to figure out all the letters, but that’s not hard to find. In fact, we could just cut and paste 30 or 40 words from Wikipedia. (Say, city names Αθήνα Athína; Θεσσαλονίκη Thessaloníki; Πελοπόννησος Peloponnesos, etc.)

Once we’ve done that, we can write a simple program which will make those substitutions, and go from one script to the other.

And of course, this is all grossly simplified and won’t work very well at all.

More later…


“Môme du script???”

This is so bizarre… the Office québécois de la langue française proposing official translations for “script kiddie”:

script kiddy / pirate adolescent

Note(s) : Les pirates adolescents utilisent des programmes de script conçus par d’autres au lieu d’en créer eux-mêmes. Généralement, ils laissent des traces pour marquer leur passage. Leur but est souvent la célébrité (ou tout au moins d’impressionner les copains). Ils constituent une menace pour tous les systèmes informatiques, puisqu’ils font habituellement une sélection aléatoire de leurs cibles.

Les termes "pirate adolescent" et "pirate ado" ont été proposés par l’Office de la langue française comme équivalents de script kiddy. Une traduction littérale de l’anglais donnerait : enfant du script, môme du script ou gamin du script.

My lousy translation (my French is really bad):

Note(s): Adolescent pirates use script programs written by others instead of creating them themselves. Generally, they leave traces that mark their passage. Their goal is above all celebrity status (or at least to impress their friends). They are a menace for all computer systems, since they tend to select their targets randomly.

The terms "adolescent pirate" and "teen pirate" have been proposed by the Office of the French Language as equivalents of script kiddie. A literal translation to English would give: script kiddie, script monkey, or script urchin.

I don’t think that language academies are necessarily as pointless as most linguists would argue (few programmers would argue that the W3C is pointless, by way of comparison — standardization is sometimes necessary), but this is beyond absurd.

Official translations for slang?



If you’re interested in (machine) translation, check out Jeremy Faludi’s post on the Phraselator, a handheld real-time translation device.

It got its start, as many technologies do, in military use, but aparently the things are beginning to become available to the public. The actual data is stored in flash cards. I think there’s a niche for these devices, but one has to wonder how long they’ll survive in the face of the increasing ubiquity and processing power of cell phones.


Text and Meaning

An interesting post at “The Translator’s Blog“: The translation of text vs. the translation of meaning.

A colleague raised the issue of translation at the beginner stage, when you basically just “run” through the text word by word to polish the style afterwards, again and again until it works for you.

The experienced translator, in contrast, will extract the meaning of a text and start from there rather than “copy” the actual word into his target language.

I’ve been experiencing this distinction firsthand of late. I’ve been using a new, open source CAT(Computer Aided Translation) tool called “OmegaT”: to do some translations of my own, from Portuguese (oi Jonas) and also from Welsh (hylo Nic). I definitely fall into the “polish the style afterwards” camp, although I’d have to say that “polish” may be too ambitious a word– my Portuguese is rusty, and my Welsh is… well, no comment. Give me a couple more years.

Translation is a strange game–in a weird way, I can imagine being a good translator of a language without being a terribly fluent speaker. Becoming fluent is a process of internalizing the language completely, to the point where you speak by intuition. Translation is more like being hyper aware of all the details of both languages at once: you have to know every possible rendering of a phrase in the target language, in order to reflect the original text as idiomatically as possible. That’s the impression I have had, in any case, in my limited attempts at translation.

(By the way, I’ve been caught up in another project, but that little phrase-splitting script I mentioned in the previous post will be coming up soon. I promise.)

Comments (1)

A Lo-Fi Stab at Automatically Finding Phrases

Part-of-speech tagging (often just called “POS tagging”) is one of the few NLP tasks that routinely gets very high accuracy scores, usually in the high nineties. So the idea is something of an old chestnut in NLP. The specific tags used vary a lot, but will give you an idea of what tagged text looks like, before and after:

Automatically add part-of-speech tags to words in text.

Automatically/adverb add/verb part-of-speech/adjective tags/noun to/to words/noun in/preposition text/noun./punctuation

So, assuming you’ve gone through some hocus-pocus to end up with a bunch of tagged text, what do you do with it?

Well, certain patterns of parts of speech tend to indicate terminology. If the same two nouns keep showing up in a text, for instance, chances are that the words are related, that is, they constitute a “term.” This approach is described section 5.1 of this chapter of Manning and Schütze’s text. It’s a very intuitive approach, and even a very simple implementation will turn up some useful stuff.

This kind of “shallow analysis” doesn’t attempt to find any long-distance relationships between phrases and words. There’s no parsing going on, in other words, just some pattern matching.

So part-of-speech tagging has its uses.

But what if we went even shallower than that? What if we tried to look exclusively at the statistical patterns of words? Can we get any useful information out?

So, I have in mind a small experiment, and I’ll just write it up here as I go.

As I’ve mentioned, one interest of mine is translation, of the human variety (as opposed to fully automated machine translation). I’m no pro at translation, so I don’t know if more qualified people feel the same way, but I find that it can be a pretty tedious endeavor. I find myself looking at a long sentence that I need to translate, and wishing that I could somehow subdivide it into phrases. So here’s a sentence I translated recently:

Es natural que exista un gran contraste entre su estilo de vida original y la nueva sociedad, que debe ser superado paulatinamente, pero la manera de resolverlo de Tatum bordea la ilegalidad.

What I want to see is something along the lines of:

Es natural
que exista
un gran contraste
entre su estilo
de vida original
y la nueva sociedad,
que debe ser superado paulatinamente,
pero la manera
de resolverlo
de Tatum bordea la ilegalidad.

I’m not even looking for indentation, I just want to see sensible subphrases on their own line so I can break down the translation process. Like I said, I don’t know if real translators work this way, but when I do this approach seems like it would be useful.

So how to automate it? If we had a part of speech tagger for Spanish (and I’m sure it wouldn’t be too difficult to find one), and a parser (which would be a bit harder to find), we’d write some system to place noun phrases on their own line, etc etc.

But that seems like overkill… we’re just splitting up a sentence into more manageable chunks, after all. So here’s my observation: in the sentence I split up manually above, the words which begin each line seem to be frequent. What if we “split on frequency”? We’ll take the most frequent words in the document, and split up the sentence into subphrases beginning with those words. Will it be useful? I have no idea, let’s see. Tomorrow. 8^)


Web-Friendly Translation

I heard about a project called IRMA a while back, and their booth at a recent desktop Linux convention was described like this:

There was an IRMA booth at the conference. IRMA is a distributed group of translators that coordinate the task of internationalizing open source software.

They have a really slick website for brokering the resources of volunteer translators in order to aid open source developers. Since the translations are all open-sourced, they can be reused in other open source projects. They actually have a recommendation feature that finds similar strings in other projects that have already been translated in order to suggest translations for new apps. They currently support 45 different languages.

It seems to be the case that most open source efforts related to translation are about i18n and l10n, that is to say, collaboration on translating dialog boxes, GUIs, all that stuff — all about translating software.

Of course, software translation has been going on for a long time — what’s new is the presence of web interfaces to the actual translation. IRMA isn’t the only one, there’s also Pootle, a product of the fine work done at

The role of multilingualism in blogging, on the other hand, is a relatively new domain. Stephanie Booth has come up with an interesting Wordpress plugin for bloggers who blog in more than one language. There’s also been an interesting conversation running between Tim Oren and Kevin Marks about improving the findability of translations of content.

All this stuff is very good news.

Still, while these tools are certainly important in firing up bridge blogging, there is still room for other tools for translating content. I’ve been mulling a few ideas of my own, which aren’t really ready for prime time, but I’ll talk about them here sooner or later.


Next entries » ·