AAARGH
November 2nd, 2006They’re sending messages from all over the world to MARS…
But the page is encoded in Latin-1!!!
Oh the humanity.
They’re sending messages from all over the world to MARS…
But the page is encoded in Latin-1!!!
Oh the humanity.
I put together one of those Google Co-op search engines for sites related to language, translation, linguistics, and stuff like that. Check it out:
Site recommendations welcome.
Oh, awesome:
“Full text search in SQLite.”
Oh, the sound of a bazillion angels crying:
“The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.”
My forehead is really starting to hurt from banging it on the desk.
Some test content from the Unicode site: What is Unicode? in Simplified Chinese
什么是Unicode(统一码)?
Testing Wordpress’s Unicode Chinese support. Move along, nothing to see here! (Hi Viv .
)
Now there’s a useful word:
Mojibake is a Japanese loanword which refers to the incorrect, unreadable characters shown when a piece of computer software fails to render a text correctly according to its character encoding.
Ah, the endless confusion of all the little squiggles on the intarweb.
I bought a Samsung laser printer, which is really quite nice. It’s just black and white, which is fine for me, and it’s really fast, and the quality is much better than the last two crappy inkjets I’ve owned.
Thing is though… fonts. I vaguely remember reading some stuff about “where the fonts live” being different between inkjets and laser printers… or “real postscript”only being available in laser printers… or something like that.
Bottom line: sigh.
| Original text: | What my printer printed: |
|---|---|
Hallo Welt! German 你好,世界! Chinese Hello world! English Olá mundo! Portuguese Hallo wereld! Dutch こんにちは 世界! Japanese Καλημέρα κόσμε! Greek Merhaba dünya! Turkish Hola mundo! Spanish Halo dunia! Bahasa Indonesia Helló Világ! Hungarian Salut le monde! French Hallo verden! Norwegian/Bokmal Chào thế giới! Vietnamese Hejsan, världen! Swedish Привет, мир! Russian Tere, maailm! Estonian 안녕, 세상! Korean Saluton Mondo! Esperanto Ahoj svet! Czech Hylô byd! Welsh Terve maailma! Finnish Laba ryta, pasauli! Lithuanian Halló heimur! Icelandic Sveika, pasaule! Latvian 哈佬世界! Cantonese สวัสดีราคาถูก! Thai Hallo, wrâld Frisian Ave, Munde! Latin |
|
It’s so RANDOM. Okay, so I can determine that there are missing fonts for several Asian languages by looking at this stuff. But what about Greek? Why does the “mu” show up but the rest is just blank?
And where do I look to start debugging such a problem? Which kinds of fonts does my printer “understand”?
In situations like this I generally think to myself… uh… I’ll solve this later.
And then I don’t.
I’m subscribed to some news feeds that send me updates with articles about translation, and here’s the latest one I came across:
Police are reaching out to migrant communities, with information in seven languages now available on the force’s national website.
Chinese, Arabic, Hindi, Japanese, Korean, Somali and Vietnamese speakers can now access police information online.
The site which was translated by NZ Translation Centre explains how to contact local police and liaison officers as well as giving tips on crime prevention and safety tips.
So I took a look at the site itself:
New Zealand Police Official Website
All said, it’s a pretty nice site — they’ve done a good job localizing it. One interesting bit, however: the character encodings aren’t consistent.
| Arabic | UTF-8 |
| English | ISO-8859–1 (Latin-1) |
| Hindi | UTF-8 |
| Japanese | Shift_JIS |
| Korean | EUC-KR |
| (Simplified) Chinese | UTF-8 |
| Somali | UTF-8 |
| Vietnamese | UTF-8 |
I suppose there are compelling reasons to use those legacy encodings for Korean and Japanese — but it really doesn’t make sense to encode English as Latin-1, when the same site is using UTF-8 for a language like Somali, whose alphabet is strictly “roman” characters.
It seems to me that they’ll be looking at more headaches down the road as a result of not just going ahead and serving the whole site in a single encoding.
Debugging font issues is a pain , in my experience. If something isn’t rendering correctly, my first reaction is usually “I have absolutely no idea why that’s happening.” Gentle reader, feel my pain.
I find myself working with an awful lot of languages (you’ll see why when Jonas and I launch our project), and I often have to learn just enough characters to determine that a particular script seems to be rendering correctly. We have to know if rendering problems are caused by some kind of configuration problem that we can fix, or if it’s something out of our control: “Sorry, no hieroglyphics in Unicode, not our problem!”
Debugging such stuff is not the same thing as actually being able to read in all these languages: in most cases it’s enough to learn just a bit about how the script is put together and how characters combine, and perhaps a few words for testing purposes.
So here’s an example of a typical problem that I face. Compare a the two screenshot clips I took this morning. I added the red-bordered boxes to point out the differences:

Even if you don’t know Devanāgarī from a salad fork, it doesn’t take much to guess that something is askew in my Firefox’s rendering of that page. (Never mind the fact that the word “Hindi” is actually spelled incorrectly… Doh!) Opera seems to get it right.
Now I’m not going to get into the details of how Devanagari works in Hindi at the moment (primarily because I don’t know much, heheh). The main problem for me is that there are so many possible causes for any problem in text rendering. Is this a configuration problem on my end, or is it some pernicious software problem buried in a library underneath the text?
nd, but dag.
In this particular case, the comparison above leads me to suspect #2, of course. But you get the picture here: these kinds of problems are a mess. Particularly in the open source world, it’s hard to know what to do in this situation. And I’m moderately techie. Imagine what a run of the mill user faces.
I was chatting with Chad Fowler and he made an interesting observation: for the development of any given application, in order to be sure, really sure, that everything is okay for every particular writing system, each development group would have to have someone who can read each language. Which, er, ain’t gonna happen.
And it shouldn’t really have to: the operating system is supposed to abstract the basic rendering of text away from coding.
OSX is pretty darn good at this. But then, it’s also a very closed system: it’s all tested, Apple owns and delivers a wide variety of high-quality (proprietary) fonts with its machines, and there are far fewer points of variation than you’ll see in your average Linux distribution.
Matters in Windows are less variable than Linux, but more complex than OSX, as Michael Kaplan can attest in great detail at his excellent blog.
I think these complexities are makes many programmers reticent about Unicode: they’ve been burned in the past with encoding matters, gotten a glimpse of the gruesome entrails underlying text rendering on their platform, and decided I just don’t have time to really learn how all these text rendering variables fit together.
And quite frankly, despite being something of a Unicode zealot myself, I can sympathize.
Most developers accept that they need to know the absolute minimum about Unicode. They already know that Unicode is good. The thing is, as a previous commenter pointed out, and as this tiny example demonstrates, the “Unicode” part of handling text is only the tip of the iceberg.
And it’s a big iceberg.
I thought it might be interesting to look through what a Technorati search for “Unicode” turns up recently. This may be of no interest to others… but I like whiling away hours reading about Unicode.
Heh, stop that snickering. I could have a crack habit.
Seems to have become “languages of India week” here at Infundibulum.
Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…
Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:
Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.
After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing. (Update: That blog is now utf-8 now as well.)
(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)
The rather magnificent-looking Tamil letter up there in the corner is TAMIL LETTER I (U+0B87). It looks fun to write. ☺
By the way, am I weird to be obsessed with figuring out how languages I can’t speak a word of are encoded?