infundibulum

Archive for Unicode

Translation watch…

August 25, 2005 @ 8:53 pm · Filed under Translation, Language, Unicode

I’m subscribed to some news feeds that send me updates with articles about translation, and here’s the latest one I came across:

Speaking the same language

Police are reaching out to migrant communities, with information in seven languages now available on the force’s national website.
Chinese, Arabic, Hindi, Japanese, Korean, Somali and Vietnamese speakers can now access police information online.

The site which was translated by NZ Translation Centre explains how to contact local police and liaison officers as well as giving tips on crime prevention and safety tips.

So I took a look at the site itself:

New Zealand Police Official Website

All said, it’s a pretty nice site — they’ve done a good job localizing it. One interesting bit, however: the character encodings aren’t consistent.

Arabic	UTF-8
English	ISO-8859–1 (Latin-1)
Hindi	UTF-8
Japanese	Shift_JIS
Korean	EUC-KR
(Simplified) Chinese	UTF-8
Somali	UTF-8
Vietnamese	UTF-8

I suppose there are compelling reasons to use those legacy encodings for Korean and Japanese — but it really doesn’t make sense to encode English as Latin-1, when the same site is using UTF-8 for a language like Somali, whose alphabet is strictly “roman” characters.

It seems to me that they’ll be looking at more headaches down the road as a result of not just going ahead and serving the whole site in a single encoding.

Permalink Comments

Font Problems with Hindi in Firefox

August 1, 2005 @ 7:17 am · Filed under Language, Tech, Unicode

Debugging font issues is a pain , in my experience. If something isn’t rendering correctly, my first reaction is usually “I have absolutely no idea why that’s happening.” Gentle reader, feel my pain.

I find myself working with an awful lot of languages (you’ll see why when Jonas and I launch our project), and I often have to learn just enough characters to determine that a particular script seems to be rendering correctly. We have to know if rendering problems are caused by some kind of configuration problem that we can fix, or if it’s something out of our control: “Sorry, no hieroglyphics in Unicode, not our problem!”

Debugging such stuff is not the same thing as actually being able to read in all these languages: in most cases it’s enough to learn just a bit about how the script is put together and how characters combine, and perhaps a few words for testing purposes.

So here’s an example of a typical problem that I face. Compare a the two screenshot clips I took this morning. I added the red-bordered boxes to point out the differences:

Hindi Font Problems

Even if you don’t know Devanāgarī from a salad fork, it doesn’t take much to guess that something is askew in my Firefox’s rendering of that page. (Never mind the fact that the word “Hindi” is actually spelled incorrectly… Doh!) Opera seems to get it right.

Now I’m not going to get into the details of how Devanagari works in Hindi at the moment (primarily because I don’t know much, heheh). The main problem for me is that there are so many possible causes for any problem in text rendering. Is this a configuration problem on my end, or is it some pernicious software problem buried in a library underneath the text?

The font could be bad.
The browser?
Is it the case that my operating system is missing some library? (Linux, in my case.) If so, what library? Can I upgrade something to fix it? Who ya gonna call?
Or maybe it’s part of my desktop environment? I wonder if it works in that other desktop environment… blech, switching desktops is a pain…
Could it be an encoding problem? Maybe the HTML page is encoded incorrectly in the first place.
Or, maybe their server is futzing up the encoding somehow?
Is it part of that “font shaping” thing, Pango? Am I even using Pango?

nd, but dag.

update…Σμς suggests an eighth potential culprit to this situation: there could be a problem with CSS. He also found a relevant bug in the bug database for Firefox. (See the comments. Thanks, Simos!)

In this particular case, the comparison above leads me to suspect #2, of course. But you get the picture here: these kinds of problems are a mess. Particularly in the open source world, it’s hard to know what to do in this situation. And I’m moderately techie. Imagine what a run of the mill user faces.

I was chatting with Chad Fowler and he made an interesting observation: for the development of any given application, in order to be sure, really sure, that everything is okay for every particular writing system, each development group would have to have someone who can read each language. Which, er, ain’t gonna happen.

And it shouldn’t really have to: the operating system is supposed to abstract the basic rendering of text away from coding.

OSX is pretty darn good at this. But then, it’s also a very closed system: it’s all tested, Apple owns and delivers a wide variety of high-quality (proprietary) fonts with its machines, and there are far fewer points of variation than you’ll see in your average Linux distribution.

Matters in Windows are less variable than Linux, but more complex than OSX, as Michael Kaplan can attest in great detail at his excellent blog.

I think these complexities are makes many programmers reticent about Unicode: they’ve been burned in the past with encoding matters, gotten a glimpse of the gruesome entrails underlying text rendering on their platform, and decided I just don’t have time to really learn how all these text rendering variables fit together.

And quite frankly, despite being something of a Unicode zealot myself, I can sympathize.

Most developers accept that they need to know the absolute minimum about Unicode. They already know that Unicode is good. The thing is, as a previous commenter pointed out, and as this tiny example demonstrates, the “Unicode” part of handling text is only the tip of the iceberg.

And it’s a big iceberg.

Permalink Comments (2)

Unicode News

July 23, 2005 @ 3:54 am · Filed under Unicode

I thought it might be interesting to look through what a Technorati search for “Unicode” turns up recently. This may be of no interest to others… but I like whiling away hours reading about Unicode.

Heh, stop that snickering. I could have a crack habit.

Urdu Blogging: Discussion on Urdu Content Management: “I have deliberately chosen only the sites that use Unicode Urdu.”
A pretty long thread over SitePoint Forums - how to differenciate between unicode and plain text. This is surprisingly complex task, especially when you’re talking about web apps (and isn’t everyone?). There is what looks to be a link to a pretty interesting reference at the end of the thread, but it was down when I checked…
Interesting: a Chinese blogger explicitly requesting that users switch to Unicode. love is beautiful. Isn’t it though? Now first of all… I thought all Blogspot blogs were sent as UTF-8 in the first place. But my browser (Firefox) defaulted to ISO-8859-1 (which is equivalent, mostly to latin-1, IIRC). So I had to heed the blogger’s request to see the Chinese: change yr character coding of yr browser to unicode if u cant c e Chinese characters above. Weird.
Okay, doubly weird, another Blogspot blog with the same problem: fallen angel says: “*to view -> view -> encoding -> unicode (utf-8)” I’m not sure what’s going on here, and I’m too tired to venture a guess. Explanations welcome. It seems to be related to Blogspot, and it’s not just a Chinese thing — here’s the same problem at on an Urdu blog. Ok, one more:
Malayalam Related Topics Oh, jackpot. A whole blog about the Malayalam and Unicode. Yikes, according to the Malayalam Unicode font tester, neither Opera nor Firefox passes (under Linux, anyway). Here are some screenshots, see for yourself (I added the red boxes): part a, part b. Opera does slightly better. Man, Malayalam is one complex script.

Permalink Comments (2)

Tamil Blogs and Unicode

July 6, 2005 @ 2:53 am · Filed under Language, Blogging, Unicode

A letter in the Tamil script.

Seems to have become “languages of India week” here at Infundibulum.

Pankaj Narula dropped a friendly comment on my previous post on Hindi and Unicode, explaining that Hindi blogs are in fact almost universally encoded as Unicode, thanks in large part to Blogger.com’s good Unicode support. And so it seems that among Hindi bloggers at least, everyone is quite up-to-date with their language technology…

Just for fun I decided to poke around in the Tamil blogosphere to see if the situation was similar, and it turns out that Blogspot is equally prominent among Tamil bloggers:

Of the 613 blogs in Tamil listed at the directory at the Tamil Bloggers List, 513 are hosted on Blogspot–so we can assume that most Tamil blogs are encoded in Unicode.

After a bit of digging I could only find one blog among the non-blogspot crowd that seemed to have encoding troubles — “peyariliyin pinaaththalgaL.” At first I thought it was in some mysterious legacy encoding, but it turns out that blogdrive.com seems to have its servers set to send Windows 1252. This one, on the same server, specified more fonts and ended up being visible to me. So it was mainly a font thing.

(Incidentally, poking about in the occasional English comment in Tamil blogs, it seems to be the case that the language name is often transliterated as Tamizh rather than Tamil.)

The rather magnificent-looking Tamil letter up there in the corner is TAMIL LETTER I (U+0B87). It looks fun to write. ☺

By the way, am I weird to be obsessed with figuring out how languages I can’t speak a word of are encoded?

Permalink Comments (8)

On-the-fly ASCII to Unicode Transliteration with Javascript?

July 2, 2005 @ 5:14 pm · Filed under Language, Tech, Javascript, Unicode

Here’s an interesting little script I found on the Reta Vortaro (that is, the Esperanto web dictionary).

Try typing the string jxauxdo in that box. And press “Trovu”, if you like, that will search Google for ĵaŭdo (Esperanto for “Thursday”). Notice that jx → ĵ and ux → ŭ “on the fly,” as you type. (Come to think of it, maybe “transliteration” isn’t the right word for this process…)

So, backing up a bit, Esperanto has a few odd characters in its orthography:

Letter	Pronunciation (IPA)	Unicode	x-system
ĉ	[ʧ]	U+0109	cx
ĝ	[ʤ]	U+011D	gx
ĥ	[x]	U+0125	hx
ĵ	[ʒ]	U+0135	jx
ŝ	[ʃ]	U+015D	sx
ŭ (as aŭ, eŭ)	[u̯]	U+016D	ux

Even today those characters are relatively rare in fonts–if you can’t see them I imagine this post may not make too terribly much sense. 8^)

The good doktoro even got a little flak back in the day, for choosing to include such unusual characters in a supposedly universal language. Nowadays, however, they’re all in Unicode–here’s the full info for ŝ, for example:

U+015D LATIN SMALL LETTER S WITH CIRCUMFLEX
ŝ

But pragmatically speaking, there’s still a problem with input. Suppose you are a gold-star-wearing green-flag-waving Esperanto afficionado, and you want to post something on the internet. How do you actually type these characters? The “right” answer is that you install a keyboard layout for the language in question, and you memorize its layout.

This is a pain, of course.

And it’s nothing new: in the (typographical) bad old days of all-ASCII USENET, Unicode wasn’t widely available, and what people would generally do (for many languages, not just Esperanto) was come up with all-ASCII transliteration systems. The “x-system” added to the table above was probably the most popular. It so happens that there is no letter x in Esperanto, so it didn’t cause any massive problems with ambiguity.

So let’s look at the script in question, it’s quite simple:

function xAlUtf8(t) {
  if (document.getElementById("x").checked) {
    t = t.replace(/c[xX]/g, "\u0109");
    t = t.replace(/g[xX]/g, "\u011d");
    t = t.replace(/h[xX]/g, "\u0125");
    t = t.replace(/j[xX]/g, "\u0135");
    t = t.replace(/s[xX]/g, "\u015d");
    t = t.replace(/u[xX]/g, "\u016d");
    t = t.replace(/C[xX]/g, "\u0108");
    t = t.replace(/G[xX]/g, "\u011c");
    t = t.replace(/H[xX]/g, "\u0124");
    t = t.replace(/J[xX]/g, "\u0134");
    t = t.replace(/S[xX]/g, "\u015c");
    t = t.replace(/U[xX]/g, "\u016c");
    document.getElementById("q").value=t;
  }
}

Include it with something like:

< script type="text/javascript" src="http://example.com/translit.js"> < /script >

And the function gets called with an onkeyup="xAlUtf8(this.value)" inside the input tag.

(Using onkeyup is actually sort of verboten these days–it should be done with unobtrusively, etc.)

So anyway, that’s a pretty interesting way to enter some unusual characters. It’s interesting to muse on just how far one could take this approach. Would it be possible to create a script that would handle an entire writing system? Say, a script that would convert an entire textarea from an ASCII-based transliteration to Unicode characters, on the fly? Japanese and Chinese are definitely excluded from this approach (every Chinese character in RAM? Er, no.) but people who use those languages generally already have keyboard input taken care of.

That would be neat: you could, for instance, have textareas where users without keyboard layouts could input something in Amharic or Persian or whatever without having the keyboard layout actually installed.

But as it stands, it’s just simple substitution, and no string which is to be substituted can be a substring of another such string. In order to handle a more generalized set of substitutions, you’d probably need to use a Trie structure. (nice trie implementation in Python by James Tauber. )

I’m sure there are complications that would arise from what’s called “font shaping” — that is, how operating systems combine adjacent characters. In Arabic or Thai, for instance, characters vary depending on which characters they’re adjacent to. How does this process affect text in textareas, for instance, or text which is mushed around with Javascript?

I’ll be playing around with this.

Permalink Comments (2)

Continuing Adventures in Ruby and Unicode

June 25, 2005 @ 3:00 am · Filed under Ruby, Unicode

I’m like reeeeeally tired right now, but here goes anyway.

Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.

Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.

Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting utf-8 stuff in Ruby doesn’t work so hot, right?

Er, that’s what I thought. But behold, from the docs for that script:

distance(str1, str2) Calculate the Levenshtein distance between two strings str1 and str2. str1 and str2 should be ASCII or UTF-8 .

¿Como say what?

Behold, black magic (Paul Battley, whoever you are, you have Unicode fu!):

        s = str1.unpack('U*')
        t = str2.unpack('U*')
        n = s.length
        m = t.length
        return m if (0 == n)
        return n if (0 == m)

And n and m now contain the length of str1 and str2.

That is, the “real” length.

The number of characters, not bytes.

Hmm, yeah. The jlength.

Except we didn’t even require 'jcode'.

So just how does this String#unpack doohickey work, anyway…

Well one could go look at the docs…

IT’S LOOKS LIKE C!!! *runs screaming*

Okay yeah. Well, that’s interesting. I’m going to have to read about that.

And maybe this, because that guy seems to know what the heck this jcount thing I found in /usr/lib/ruby/1.8/jcode.rb is all about, except that I’m going to have to use this because I couldn’t really tell you off hand what ここではまずはじめに jcode モジュールを呼び出します．つぎに，漢字コードをEUCに指定しています．その上で日本語用に追加された jlength により文字数を計算しています means.

Okay well specifically I don’t know what 指定する and 計算する mean.

It’s always the verbs that get you.

We’ll figure all that out tomorrow, mmkay?

ruby unicode

Permalink Comments (4)

Getting Unicode, MySql, and Rails to Cooperate

June 23, 2005 @ 11:22 am · Filed under Ruby, Unicode

In a post about Ruby and Unicode a while back, I mentioned the page at the Rails wiki called How To Use Unicode Strings in Rails. (Btw, check out Why the Lucky Stiff’s response to my post, some useful code there.)

It turns out that there were a few more mostly MySQL-specific steps involved in getting Unicode to work correctly with Rails. So I thought I’d describe all the steps we went through to get it set up in one place. This has only been tested with MySQL 4.1.

In MySQL: Set the Encoding when you Create Tables

You need to explicity tell MySQL that you want your tables to be encoded in UTF-8. Here’s a sample table with 3 columns, id, foo, and bar:

create table samples (
    id int not null auto_increment,
    foo varchar(100) not null,
    bar text not null,
    primary key (id)
) Type=MyISAM CHARACTER SET utf8;

The line Type=MyISAM CHARACTER SET utf8; is where the action is. The table type has to be MyISAM, not innoDB, because unfortunately innoDB tables don’t support full text searching of UTF-8 encoded content (details here). Apparently innoDB tables are more flexible in general, but if full-text search is crucial for you, you’ll have to go with MyISAM.

Bummer, that.

In any case, then add the CHARACTER SET utf8 directive, as shown.

(I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)

Set the “charset” and “Content-type” in the Application Controller

This is also described at How To Use Unicode Strings in Rails.

class ApplicationController < ActionController::Base
  before_filter :set_charset
 
  def set_charset
    @headers["Content-Type"] = "text/html; charset=utf-8" 
  end
end

The previous steps were enough to get Unicode showing up in a little test app I generated with scaffold (the app just consisted of an input field and a textarea).

To test it, I pasted in some sample text in various languages. It worked okay for text containing only the characters found in ASCII or latin1, but among other characters there were weird cases of random characters being removed or added. The problem seemed not to have anything to do with the script (i.e., the Unicode block). For instance, in an Esperanto text, “ĉ” (U+0109 LATIN SMALL LETTER C WITH CIRCUMFLEX) came out fine, but “ĝ” (U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX) was borked. Go figure.

It took some digging to find the next bit — thanks to Ben Jackson of INCOMUM Design & Conceito for getting the straight story on the Rails list. The solution is…

Tell Rails to tell MySQL to Use UTF-8. (Got that?)

It came down to a MySQL configuration option: You have to tell MySQL to SET NAMES UTF8, as DHH pointed out in the previous link. You can either do it in the source to ActiveRecord, in mysql_adapter.rb, or you can just make the change in your own application. We chose the latter route.

So, here’s our app/controllers/application.rb as it stands:

class ApplicationController < ActionController::Base
  before_filter :set_charset
  before_filter :configure_charsets
 
  def set_charset
        @headers["Content-Type"] = "text/html; charset=utf-8"
  end
 
  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

(In case you’re wondering, yes, you can have more than one before_filter.)

UPDATE Er, but you can get away with just one: Jonas refactored it. Use this version instead.

class ApplicationController < ActionController::Base
  before_filter :configure_charsets
 
  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

UPDATED AGAIN Argh, I forgot to remove the set_charset definition before. It should be correct now…

Anyway, now our Rails installation seems to handle (almost) any nutty writing system that we throw at it.

I’m not sure these are all the Right Way™, I’m just saying it’s worked for us (so far).

For one thing, it seems like it might make sense to just go ahead and set the default character set and collation for MySQL to UTF-8, independently of any Rails stuff at all. Aren’t all character sets supposed to be slouching toward Unicode, anyway? But that sounds like a rather apocalyptic measure, for some reason… I guess I’ll hold off on the rapture of the character sets.

Collation is it’s own topic — but then we’ve not gotten to sorting anything yet. Here’s a recommendation for building the tables with:

CHARACTER SET utf8 COLLATE utf8_general_ci;

To set the collation order. Wading through the MySQL docs on that is next on the agenda.

Permalink Comments (4)

Hindi and Unicode

June 19, 2005 @ 1:05 pm · Filed under Language, Tech, Unicode

यूनिकोड क्या है?
What is Unicode? in Hindi

DIT gives push to language software : HindustanTimes.com

The contents of the free CD will include Hindi language true type fonts with keyboard driver, Hindi Language Unicode Compliant Open Type Fonts, generic fonts code and storage code converter for Hindi, Hindi language version of Bharateeya OO, Firefox Browser in Hindi, Multi Protocol Messenger in Hindi, Email Client in Hindi among others.

This is forward-thinking on the part of the Indian government; for a long time it seemed to be the case that the only major website that encoded Hindi in UTF-8 was a foreign site, BBCHindi. Most news sites in Hindi use any of a bewildering array of proprietary encodings, with a proprietary font to accompany it. (Intended presumably to lock in users).

But India is a country which stands to benefit more than most from Unicode: not only does it have a huge variety of languages, it has a large number of scripts (which are already defined in Unicode). Standardizing on a single character set will make it much easier to localize software and spread digital literacy.

And literacy, period…

Whether these efforts will be officially extended to other languages and scripts in India remains to be seen, but the fact that it’s been done in Unicode for Hindi will make the path much easier.

Incidentally, all of this is related to other domains besides news — email, for instance. Consider one blogger’s criticism of Yahoo Mail… gaping void: Why Yahoo will not be my primary mail client?)

See also: वेब पर हिन्दी - हिन्दी - hindi A blog on the Hindi language, in Hindi and English.

Permalink Comments (10)

Ruby and Unicode

June 11, 2005 @ 4:28 am · Filed under Language, Ruby, Unicode

I’ve been looking into Ruby’s Unicode support, since I’m working on a Rails project. I had to jump through some hoops to figure out how to get Ruby to handle UTF-8 — it’s not too well documented.

The short answer can be found here: How To Use Unicode Strings in Rails. Bottom line: prefix your code with:

$KCODE = 'u' require 'jcode'

… and replace length with jlength. You don’t have to change anything else in your source, which is rather nice. (In Python, for instance, you have to label Unicode strings.) You can just put Unicode stuff right in your source files, and pretty much think of those strings as “letters” in an intuitive way. Pretty much.

That’s the way I think of Unicode: it allows you to think of letters as letters , and not as “a letter of n bytes.” (Remember letters, o children of the computer age?)

Behind the scenes, your content will (probably) be stored in UTF-8, which is a variable-length, multi-byte encoding. This means that when you see the following:

U+062A ARABIC LETTER TEH
ت

That single letter is actually two bytes behind the scenes (0xD8 and 0xAA).

However, when you see:

U+3041 HIRAGANA LETTER SMALL A
ぁ

…there are three bytes behind the scenes (0xE3, 0x81, and 0x81). ASCII letters are still one byte, as ASCII is a subset of Unicode.

Unicode hides all that nonsense from you as a programmer. You can tell program to count the number of letters in an Arabic word or a Japanese word and it will tell you what you really want to know: how many letters are in those words, not how many bytes. Who cares how many bytes happen to be used to encode a U+062A ARABIC LETTER TEH? It’s just a letter!

So yeah. End rant.

But there are still some rough patches in Ruby’s Unicode support (or in my understanding of it; a correction of either would be appreciated).

For instance… in Mike Clark intro to learning Ruby through Unit Testing, he suggests testing some of the string methods like this:

require 'test/unit'
 
class StringTest < Test::Unit::TestCase
  def test_length
    s = "Hello, World!"
    assert_equal(13, s.length)
  end
end

So let’s use the Greek translation of “Hello, World!”

Καλημέρα κόσμε!

That has 15 letters, including the space. Let’s test it:

require 'test/unit'
require 'jcode'
$KCODE = 'UTF8'
 
class StringTest < Test::Unit::TestCase
  def test_jlength
    s = "Καλημέρα κόσμε!"
    assert_equal(15, s.jlength) # Note the 'j'
    assert_not_equal(15, s.length) # Normal, non unicode length
    assert_equal(28, s.length) # Greek letters happen to take two-bytes
  end
end

All that works as expected. In fact, I went and looking in the Pickaxe book and there was an example just like this.

But I’ll leave you with a few tests that fail, and seem to me like they shouldn’t (or am I misunderstanding?).

  def test_reverse
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    reversed = "!εμσόκ αρέμηλαΚ"
    srev = s.reverse
    assert_equal(reversed,srev) # fails
  end
 
  def test_index
    # String#index isn't Unicode-aware, it's counting bytes
    # there are ways aorund this, but...
    s = "Καλημέρα κόσμε!"
    assert_equal(0, s.index('Κ')) # passes
    assert_equal(1, s.index('α')) # fails!
    assert_equal(3, s.index('α')) # passes; 3rd byte!
  end

Neither of those work.

As Mike mentioned, there are about a bazillion methods in String, so there’s a lot more testing that could be done. I guess one approach to problems like these would be to write jindex, jreverse, and so on. The approach I have in mind (converting strings to arrays) would probably be slow… these are the kind of functions that I imagine would best be implemented way down in C, where linguistics geeks like myself dare not tread.

Thanks to Chad Fowler for catching an error in an earlier version of this post… oops!

UPDATE
Why the Lucky Stiff has some interesting ideas about how to get around these limitations, at least until they’re fixed in Ruby: RedHanded » Closing in on Unicode with Jcode

Permalink Comments (4)

Programming in the browser…

March 10, 2005 @ 2:23 pm · Filed under MetaFieldmethods, Javascript, Unicode

Getting Unicode straight across platforms has been a huge hangup for me in trying to get together some tutorials on doing language processing with Python. And then, there’s another barrier to cross: how to deal with markup?

Generally speaking, what I’m interested in dealing with is text, but most multilingual text on the web is HTML.

One weird observation that keeps occurring to me is that you could teach text processing without teaching people to deal with setting up a programming environment at all: use Javascript.

This seems a little weird, but I think the reason that it seems weird is because people who work with text processing have never thought of Javascript as a real language. But it is a real language. And the barriers to programming in Javascript are incredibly low. (Go type javascript:alert('hello world!') on your address bar to see what I mean.)

And then, I was reading through some stuff on Crockford.com, and I came across this:

String is a sequence of zero or more Unicode characters. There is no separate character type.

Good grief! Music to my ears!

And as for dealing with HTML, well, Javascript has that abstraction built in. Try explaining to a newbie how to extract the text from an HTML page in Python. “Well, you start by subclassing a parser and…” Javascript is designed for a browser; and browsers are where all that markup stuff comes from in the first place: to turn a css rule into “put this text in a blue box in the corner,” the “text” bit is a given.

Of course, it still looks like C — or at least, certainly not as friendly as Python, but I have to say, combining these characteristics with Greasemonkey open up some very interesting possibilities… input/output becomes “go to this url.” Process the text becomes “Paste this Greasemonkey script into the editor and run it — the result will be investigate character distributions/statistical language id/sentence splittling keyword extraction/blah blah blah….”

Is it crazy to think that such things can be done in a learnable way with Javascript? I don’t think it is…

I’m just thinking out loud. But lately I’ve been thinking about all that Ajax stuff (and rolling it into my present project), and it’s gotten me thinking about the browser as a place to do programming. Kind of blue sky, yes, but certainly a fun angle on the topic of processing natural language.

Permalink Comments (1)

Archive for Unicode

Translation watch…

Font Problems with Hindi in Firefox

Unicode News

Tamil Blogs and Unicode

On-the-fly ASCII to Unicode Transliteration with Javascript?

Continuing Adventures in Ruby and Unicode

Getting Unicode, MySql, and Rails to Cooperate

In MySQL: Set the Encoding when you Create Tables

Set the “charset” and “Content-type” in the Application Controller

Tell Rails to tell MySQL to Use UTF-8. (Got that?)

Hindi and Unicode

Ruby and Unicode

Programming in the browser…

Pages

Recent Comments

Recent Trackbacks

Archives

Categories

Search

Meta