infundibulum

AAARGH

November 2nd, 2006

They’re sending messages from all over the world to MARS

But the page is encoded in Latin-1!!!

Oh the humanity.

Language Search

October 26th, 2006

I put together one of those Google Co-op search engines for sites related to language, translation, linguistics, and stuff like that. Check it out:

Site recommendations welcome.

Fuggeddaboutit

October 23rd, 2006

Oh, awesome:

“Full text search in SQLite.”

Oh, the sound of a bazillion angels crying:

“The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.”

My forehead is really starting to hurt from banging it on the desk.

Testing, testing… Chinese utf-8…

December 22nd, 2005

Some test content from the Unicode site: What is Unicode? in Simplified Chinese

什么是Unicode(统一码)?

Testing Wordpress’s Unicode Chinese support. Move along, nothing to see here! (Hi Viv . :P )

Mojibake

November 2nd, 2005

Now there’s a useful word:

Mojibake is a Japanese loanword which refers to the incorrect, unreadable characters shown when a piece of computer software fails to render a text correctly according to its character encoding.

Mojibake - Wikipedia, the free encyclopedia

Laser Printer Fonts

October 28th, 2005

Ah, the endless confusion of all the little squiggles on the intarweb.

I bought a Samsung laser printer, which is really quite nice. It’s just black and white, which is fine for me, and it’s really fast, and the quality is much better than the last two crappy inkjets I’ve owned.

Thing is though… fonts. I vaguely remember reading some stuff about “where the fonts live” being different between inkjets and laser printers… or “real postscript”only being available in laser printers… or something like that.

Bottom line: sigh.

Original text: What my printer printed:
Hallo Welt!	German
你好,世界!	Chinese
Hello world!	English
Olá mundo!	Portuguese
Hallo wereld!	Dutch
こんにちは 世界!	Japanese
Καλημέρα κόσμε!	Greek
Merhaba dünya!	Turkish
Hola mundo!	Spanish
Halo dunia!	Bahasa Indonesia
Helló Világ!	Hungarian
Salut le monde!	French
Hallo verden!	Norwegian/Bokmal
Chào thế giới!	Vietnamese
Hejsan, världen!	Swedish
Привет, мир!	Russian
Tere, maailm!	Estonian
안녕, 세상!	Korean
Saluton Mondo!	Esperanto
Ahoj svet!	Czech
Hylô byd!	Welsh
Terve maailma!	Finnish
Laba ryta, pasauli!	Lithuanian
Halló heimur!	Icelandic
Sveika, pasaule!	Latvian
哈佬世界!	Cantonese
สวัสดีราคาถูก!	Thai
Hallo, wrâld	Frisian
Ave, Munde!	Latin
photo of lousy font handling by printer

It’s so RANDOM. Okay, so I can determine that there are missing fonts for several Asian languages by looking at this stuff. But what about Greek? Why does the “mu” show up but the rest is just blank?

And where do I look to start debugging such a problem? Which kinds of fonts does my printer “understand”?

In situations like this I generally think to myself… uh… I’ll solve this later.

And then I don’t.

Translation watch…

August 25th, 2005

I’m subscribed to some news feeds that send me updates with articles about translation, and here’s the latest one I came across:

Speaking the same language

Police are reaching out to migrant communities, with information in seven languages now available on the force’s national website.

Chinese, Arabic, Hindi, Japanese, Korean, Somali and Vietnamese speakers can now access police information online.

The site which was translated by NZ Translation Centre explains how to contact local police and liaison officers as well as giving tips on crime prevention and safety tips.

So I took a look at the site itself:

New Zealand Police Official Website

All said, it’s a pretty nice site — they’ve done a good job localizing it. One interesting bit, however: the character encodings aren’t consistent.

Arabic UTF-8
English ISO-8859–1 (Latin-1)
Hindi UTF-8
Japanese Shift_JIS
Korean EUC-KR
(Simplified) Chinese UTF-8
Somali UTF-8
Vietnamese UTF-8

I suppose there are compelling reasons to use those legacy encodings for Korean and Japanese — but it really doesn’t make sense to encode English as Latin-1, when the same site is using UTF-8 for a language like Somali, whose alphabet is strictly “roman” characters.

It seems to me that they’ll be looking at more headaches down the road as a result of not just going ahead and serving the whole site in a single encoding.

Font Problems with Hindi in Firefox

August 1st, 2005

Debugging font issues is a pain , in my experience. If something isn’t rendering correctly, my first reaction is usually “I have absolutely no idea why that’s happening.” Gentle reader, feel my pain.

I find myself working with an awful lot of languages (you’ll see why when Jonas and I launch our project), and I often have to learn just enough characters to determine that a particular script seems to be rendering correctly. We have to know if rendering problems are caused by some kind of configuration problem that we can fix, or if it’s something out of our control: “Sorry, no hieroglyphics in Unicode, not our problem!”

Debugging such stuff is not the same thing as actually being able to read in all these languages: in most cases it’s enough to learn just a bit about how the script is put together and how characters combine, and perhaps a few words for testing purposes.

So here’s an example of a typical problem that I face. Compare a the two screenshot clips I took this morning. I added the red-bordered boxes to point out the differences:

Even if you don’t know Devanāgarī from a salad fork, it doesn’t take much to guess that something is askew in my Firefox’s rendering of that page. (Never mind the fact that the word “Hindi” is actually spelled incorrectly… Doh!) Opera seems to get it right.

Now I’m not going to get into the details of how Devanagari works in Hindi at the moment (primarily because I don’t know much, heheh). The main problem for me is that there are so many possible causes for any problem in text rendering. Is this a configuration problem on my end, or is it some pernicious software problem buried in a library underneath the text?

  1. The font could be bad.
  2. The browser?
  3. Is it the case that my operating system is missing some library? (Linux, in my case.) If so, what library? Can I upgrade something to fix it? Who ya gonna call?
  4. Or maybe it’s part of my desktop environment? I wonder if it works in that other desktop environment… blech, switching desktops is a pain…
  5. Could it be an encoding problem? Maybe the HTML page is encoded incorrectly in the first place.
  6. Or, maybe their server is futzing up the encoding somehow?
  7. Is it part of that “font shaping” thing, Pango? Am I even using Pango?

nd, but dag.

update…Σμς suggests an eighth potential culprit to this situation: there could be a problem with CSS. He also found a relevant bug in the bug database for Firefox. (See the comments. Thanks, Simos!)

In this particular case, the comparison above leads me to suspect #2, of course. But you get the picture here: these kinds of problems are a mess. Particularly in the open source world, it’s hard to know what to do in this situation. And I’m moderately techie. Imagine what a run of the mill user faces.

I was chatting with Chad Fowler and he made an interesting observation: for the development of any given application, in order to be sure, really sure, that everything is okay for every particular writing system, each development group would have to have someone who can read each language. Which, er, ain’t gonna happen.

And it shouldn’t really have to: the operating system is supposed to abstract the basic rendering of text away from coding.

OSX is pretty darn good at this. But then, it’s also a very closed system: it’s all tested, Apple owns and delivers a wide variety of high-quality (proprietary) fonts with its machines, and there are far fewer points of variation than you’ll see in your average Linux distribution.

Matters in Windows are less variable than Linux, but more complex than OSX, as Michael Kaplan can attest in great detail at his excellent blog.

I think these complexities are makes many programmers reticent about Unicode: they’ve been burned in the past with encoding matters, gotten a glimpse of the gruesome entrails underlying text rendering on their platform, and decided I just don’t have time to really learn how all these text rendering variables fit together.

And quite frankly, despite being something of a Unicode zealot myself, I can sympathize.

Most developers accept that they need to know the absolute minimum about Unicode. They already know that Unicode is good. The thing is, as a previous commenter pointed out, and as this tiny example demonstrates, the “Unicode” part of handling text is only the tip of the iceberg.

And it’s a big iceberg.

Unicode News

July 23rd, 2005

I thought it might be interesting to look through what a Technorati search for “Unicode” turns up recently. This may be of no interest to others… but I like whiling away hours reading about Unicode.

Heh, stop that snickering. I could have a crack habit.

  • Urdu Blogging: Discussion on Urdu Content Management: “I have deliberately chosen only the sites that use Unicode Urdu.”
  • A pretty long thread over SitePoint Forums - how to differenciate between unicode and plain text. This is surprisingly complex task, especially when you’re talking about web apps (and isn’t everyone?). There is what looks to be a link to a pretty interesting reference at the end of the thread, but it was down when I checked…
  • Interesting: a Chinese blogger explicitly requesting that users switch to Unicode. love is beautiful. Isn’t it though? Now first of all… I thought all Blogspot blogs were sent as UTF-8 in the first place. But my browser (Firefox) defaulted to ISO-8859-1 (which is equivalent, mostly to latin-1, IIRC). So I had to heed the blogger’s request to see the Chinese: change yr character coding of yr browser to unicode if u cant c e Chinese characters above. Weird.
  • Okay, doubly weird, another Blogspot blog with the same problem: fallen angel says: “*to view -> view -> encoding -> unicode (utf-8)” I’m not sure what’s going on here, and I’m too tired to venture a guess. Explanations welcome. It seems to be related to Blogspot, and it’s not just a Chinese thing — here’s the same problem at on an Urdu blog. Ok, one more:
  • Malayalam Related Topics Oh, jackpot. A whole blog about the Malayalam and Unicode. Yikes, according to the Malayalam Unicode font tester, neither Opera nor Firefox passes (under Linux, anyway). Here are some screenshots, see for yourself (I added the red boxes): part a, part b. Opera does slightly better. Man, Malayalam is one complex script.

Tamil Blogs and Unicode

July 6th, 2005

A letter in the Tamil script.

Seems to have become “languages of India week” here at Infundibulum.

Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…

Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:

Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.

After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing. (Update: That blog is now utf-8 now as well.)

(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)

The rather magnificent-looking Tamil letter up there in the corner is TAMIL LETTER I (U+0B87). It looks fun to write. ☺

By the way, am I weird to be obsessed with figuring out how languages I can’t speak a word of are encoded?