infundibulum

Um, any day now.

July 29th, 2005

Well begun is not even halfway finished. I KNOW YOU WERE WONDERING… what the titles of the unfinished posts in my stack of drafts were.

…weren’t you?

  • The Unicode Problem in Python
  • Unicode Again…
  • The Way Multilingualism Looks
  • Google Toolbar: Autolink and WordTranslator
  • cutting snippet
  • Localization, Firefox, and Google
  • 6 O’Clock Links
  • Google and Machine Translation
  • Wolof and Unicode
  • A Hmong Messianic Script
  • More wacky ideas on remote hospital interpretation
  • Home Grown Search
  • Down with the Count: Regular Expressions vs. Statistics: PART THE FIRST

Sitepoint’s CSS and DHTML Books

July 25th, 2005

I’ve recently become a fan of Sitepoint’s books on programming. They’re very cleanly put together, and generally speaking seem to be quite up to date. Here are a couple of titles I went ahead and took the plunge on:

HTML Utopia: Designing Without Tables Using CSS
I like this book quite a bit. The CSS reference in the back is almost worth the price of admission… there are references online (duh) but I guess I’m just still a sucker for paper. There’s a lot of useful info on styling text, which turns out to have more tricks available than I’d ever heard of. One thing about this book that annoyed me intensely was in chapter 6, “Putting Things in Their Place,” when he gives a Javascript solution to the problem of getting columns to flow to equal heights. Admittedly, he gives an alternative, but there are a lot of pure CSS solutions to this problem out there, and one would think that if there’s a reliable one out there, that this would be the book to find it. So yeah, that bit rubbed me the wrong way.
DHTML Utopia: Modern Web Design Using JavaScript & DOM
I’ve been looking forward to this one for quite a while. At that link you can get the first four chapters for free. To be honest, I debated whether to buy the book, because judging from the table of contents, it seems that most of the stuff that I had doubts about was in the free sample chapters. But I’m a big fan of the author and editor: Stuart Langridge through the ridiculously awesome LugRadio (or listen on Odeo) and Javascript/Python guru Simon Willison. So in the end I felt pretty good about picking up a copy. Haven’t started digging in yet. One nit to pick: forty smackers is a lot to ask for a book that’s just 300 pages. Not saying it won’t turn out to be worth it in the end, but dag.
update… The sample chapters are available as HTML now: DHTML Utopia: Modern Web Design Using JavaScript & DOM. I can’t seem to get the example from this chapter to work, though, can you?

All this DHTML stuff is surprisingly fun. And I’ve mentioned before that Javascript has the right policy on Unicode, which makes me pretty happy.

Like this ☞ ☺

Especially considering the headache that is dealing with multibyte stuff in just about every other scripting language. Which makes me kind of sad.

☹ ☜ Like that.

Unicode News

July 23rd, 2005

I thought it might be interesting to look through what a Technorati search for “Unicode” turns up recently. This may be of no interest to others… but I like whiling away hours reading about Unicode.

Heh, stop that snickering. I could have a crack habit.

  • Urdu Blogging: Discussion on Urdu Content Management: “I have deliberately chosen only the sites that use Unicode Urdu.”
  • A pretty long thread over SitePoint Forums - how to differenciate between unicode and plain text. This is surprisingly complex task, especially when you’re talking about web apps (and isn’t everyone?). There is what looks to be a link to a pretty interesting reference at the end of the thread, but it was down when I checked…
  • Interesting: a Chinese blogger explicitly requesting that users switch to Unicode. love is beautiful. Isn’t it though? Now first of all… I thought all Blogspot blogs were sent as UTF-8 in the first place. But my browser (Firefox) defaulted to ISO-8859-1 (which is equivalent, mostly to latin-1, IIRC). So I had to heed the blogger’s request to see the Chinese: change yr character coding of yr browser to unicode if u cant c e Chinese characters above. Weird.
  • Okay, doubly weird, another Blogspot blog with the same problem: fallen angel says: “*to view -> view -> encoding -> unicode (utf-8)” I’m not sure what’s going on here, and I’m too tired to venture a guess. Explanations welcome. It seems to be related to Blogspot, and it’s not just a Chinese thing — here’s the same problem at on an Urdu blog. Ok, one more:
  • Malayalam Related Topics Oh, jackpot. A whole blog about the Malayalam and Unicode. Yikes, according to the Malayalam Unicode font tester, neither Opera nor Firefox passes (under Linux, anyway). Here are some screenshots, see for yourself (I added the red boxes): part a, part b. Opera does slightly better. Man, Malayalam is one complex script.

“The Most Dangerous Civilian Job in Iraq”

July 18th, 2005

An opinion piece from the Japan Times:

The most dangerous civilian job in Iraq.

Being an interpreter, of course.

Interpreting is the most dangerous civilian job among employees of private contractors with the U.S. Labor Department. Interpreters’ deaths accounted for more than 40 percent of the more than 300 death claims filed by all private contractors operating in Iraq.

One interpreter said if he were caught by insurgents his head would be cut off because imams say interpreters are spies. This interpreter has been threatened 15 times, including by a neighbor. One female interpreter was shot execution-style at her home in front of her family.

Yikes.

Meet Blòg d’Oc!

July 18th, 2005

A few nights back I spent a while talk a friend of mine into starting a blog about her studies of Occitan and thereabouts. I even hacked up a personalized blog layout to persuade her… to make a long story short, I think we may have a new linguablogger on the block ☺

Check it out: Blòg d’Oc.

As she’s just dipping her toes into the blogosphere at this point, she’s adopting the penname of Jo, which is the Occitan word for “I”. Her recent work has consisted of studying Gascon dialectology as well as general theoretical linguistics. She tells me that the blog will be multilingual: English, Occitan (the Gascon dialect, right Jo?), German, and perhaps some French as well. So give her the official linguablog welcome!

Er… is there an official linguablog welcome? There should be. ☺

(By the way, anyone know of other blogs in Occitan?)

Yahoo’s Cross-Language Search

July 15th, 2005

Yahoo! Search blog: Sprechen Sie Deutsch?

Machine translation was once a rather obscure field — until Babelfish hit the web, I suppose.

Wait, I take that back. There was a period of time back in the 50’s when MT was very much in the public eye — until it became clear that it wasn’t going to be useful (well, not for a few more decades, anyway). Check out this nice history of MT in a nutshell for details.

But I digress.

If you’ve nosed around in academic MT within the last decade or so (or even just poked dilettanteishly at its periphery, as I have), then you were surely inundated by the torrents of boffin-speak.

I find it interesting to watch public-facing search engine companies like Google and Yahoo are being forced to find simple terminology to describe their work in MT . I often find myself mentally, uh, translating from these more verbose descriptions back into the terminology of academia. From the above link, for instance:

So what does this really mean? We apply our Yahoo! Search Translation Technology by taking your query, looking across the entire Web and across languages to assemble the most comprehensive set of relevant results, and then returning that information in your local language.

“Oh, you mean this thing does CLIR…”

People complain a lot about technical terminology, but of course it’s actually useful. It’s just that it’s more trouble than it’s worth, for most people. In any case, it’s great to see this kind of tech seeping out onto the web.

Elvish Opera

July 11th, 2005

Oh boy…

Made-up language makes for tough translation

“Many of the singers have sung in a dead language, but never in a made-up language,” laughs chorus Music Director L. Brett Scott. “Normally when you sing in a foreign language there is someone in the choir who speaks it, or you can track down someone who has studied the language or who can help. But you can’t do that with Elvish.

You gotta be kidding me!

Obviously these orchestra types don’t fraternize with D&D types.

Computglot on Language and Business

July 8th, 2005

compuglot, a nice blog on computing and language that I’ve been following for a while now, comments on how language skills affect business :

Monoglot Means Mono-Business:

It seems obvious that to sell in foreign markets you need to speak the language, but apparently it isn’t obvious enough for some companies. The most complacent seem to be those in English speaking countries.

He points to a BBC article about the UK’s shrinking language skills, and concommitant effects on trade:

…when trading with English-speaking countries, UK businesses export more than they import - but when they trade with non-English speakers, the exports are much greater than the imports .

This trade gap is linked by the report to the language gap - with claims that contracts are lost and opportunities missed because UK firms are failing to develop the language skills needed to communicate.

I’ve mentioned before that I suspect that multilingual nature of the US is in fact one of its greatest assets; and that we should recognize the fact that the US is already a very polyglot society , even if the law doesn’t recognize that fact. Not a single day goes here in Maryland that I don’t hear at least four languages; usually including at least English, Spanish, and Amharic; I’ve also heard Haitian Creole, French, Wolof, Arabic, German, French, Persian, Portuguese (both Brazilian and continental), several flavors of Chinese, Thai, Vietnamese, and too many others to name (or identify).

The report described in the BBC article doesn’t describe the situation in the US, but I wonder, just how much of the importing and exporting done in the US is arranged in some language besides English? How critical a part of our economy is such trade? I’d like to know; I suspect it’s more than most media opinion and kvetching over the “monoglot American” would suggest.

Explosions on the London metro (translation)

July 7th, 2005

Here is a translation I just did of a blog post by someone who was on the Picadilly Line this morning:
::Hora Cero::: Explosiones en el metro de Londres

NOTE: I am not a professional translator. I found this post through a search engine and I can’t verify its veracity in any way.

When I left the Picadilly metro I was surprised becuase the Picadilly line itself had been suspended. It’s one of the most important lines, and to suspend it there must have been something very serious…

When I arrived at ComSec they told me that there had been an explosion in the metro on Liverpool Street. One of the architects said: “We won the Olympics, and the next day the metro system goes to shit because of… rain,” since that was the first story that the BBC was giving, that several generators had exploded because of the rain.

Now there is general chaos in the city. They have shut down all the metro lines, and rumors about terroist attacks have started to circulate, etc…

Update (10:30):

They just found a device that seems to be a bomb in one of the metros…

It’s for sure that the GSM networks have collapsed…

Update (10:43):

They’re saying that several buses “flew” through the air… the matter is getting more compliacated, although there is a lot of disinformation going around. A few people were just interviewed and describing the situation on television

Update (10:59):

Scotland Yard just gave an order to hospitals that they not accept any emergencies except for critical cases. They are evacuating Liverpool Sreet and Edware Road there are two trains trapped with people inside. Edware Road is one of most problematic cases. You have to enter and leave by elevator, and the service stairs are horrible.

Tags:

Tamil Blogs and Unicode

July 6th, 2005

A letter in the Tamil script.

Seems to have become “languages of India week” here at Infundibulum.

Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…

Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:

Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.

After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing. (Update: That blog is now utf-8 now as well.)

(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)

The rather magnificent-looking Tamil letter up there in the corner is TAMIL LETTER I (U+0B87). It looks fun to write. ☺

By the way, am I weird to be obsessed with figuring out how languages I can’t speak a word of are encoded?