infundibulum

Fuggeddaboutit

October 23rd, 2006

Oh, awesome:

“Full text search in SQLite.”

Oh, the sound of a bazillion angels crying:

“The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.”

My forehead is really starting to hurt from banging it on the desk.

Languages that don’t use Spaces?

October 23rd, 2006

I’m trying to build a list:

  • Thai
  • Japanese
  • Chinese (Cantonese, Mandarin, etc)
  • Korean (oops, thanks dda)
  • Khmer
  • Burmese
  • Lao

I found this W3C slide which is relevant: W3C I18N Tutorial: CSS3 and International Text

I just haven’t checked yet, but don’t Burmese and Khmer fall into this category as well?



Update: dda has some insight on Korean in the comments, and I’ve added several other South Asian languages, which seem to have the same system as Thai.

It’s not actually correctly to say that Thai, Burmese, Lao, and Khmer don’t use spaces, they do. It’s just that they use them to separate long phrases, not single words. So, the problems of indexing are similar to languages that don’t use spaces at all. The title of this post should have been “Languages that don’t use spaces between words, like, all the time?”