Fuggeddaboutit
October 23rd, 2006Oh, awesome:
“Full text search in SQLite.”
Oh, the sound of a bazillion angels crying:
“The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort.”
My forehead is really starting to hurt from banging it on the desk.
That’s completely and utterly fucking useless. And very, very strange given the fact that SQLite does a great job with UTF-8 in general.
But there is some hope:
“Soon, we hope to allow applications to define their own tokenizers (we in fact already have a generic tokenizer mechanism in our code; we just have yet to expose it to the outside world).”
Maybe we should start a fund for buying and sending infrastructure developers the Unicode 5 book? Hm…
- Thijs van der Vossen @ 24 October 2006I’m a little afraid to know what their definition of “generic” is… O.o
- pat @ 24 October 2006generic as in your international language has to be written with letters, a to z, and their squiggly variants [àñÿôè etc…]?
- dda @ 26 October 2006Hey, who authorized your ÿ???
- pat @ 26 October 2006Are they completely, totally off their rocker?
I just caught up with Mark Liberman’s post and was still in the “oh, awsome” state.. Grmph.
(But then, there’s a tool I use at work that has similar ideas at what text should be. I’ve taken to communicating with the maintainers carefully replacing all occurrences of the letters D and O with “?”. Verrrry slowly some people seem to be getting the point.)
- chris @ 28 October 2006You know, it just occurred to me that the last sentence in that paragraph is funny:
I know what they mean but… haha.
- pat @ 28 October 2006