Languages that don’t use Spaces?
October 23rd, 2006I’m trying to build a list:
- Thai
- Japanese
- Chinese (Cantonese, Mandarin, etc)
Korean(oops, thanks dda)- Khmer
- Burmese
- Lao
I found this W3C slide which is relevant: W3C I18N Tutorial: CSS3 and International Text
I just haven’t checked yet, but don’t Burmese and Khmer fall into this category as well?
Update: dda has some insight on Korean in the comments, and I’ve added several other South Asian languages, which seem to have the same system as Thai.
It’s not actually correctly to say that Thai, Burmese, Lao, and Khmer don’t use spaces, they do. It’s just that they use them to separate long phrases, not single words. So, the problems of indexing are similar to languages that don’t use spaces at all. The title of this post should have been “Languages that don’t use spaces between words, like, all the time?”
Nope, Korean uses spaces. 띄어쓰기, aka the proper use of spaces [which is not as easy as in western languages] is very important in Korean.
- dda @ 23 October 2006Interesting, thanks dda. Kind of a “duh” on my part.
I googled “띄어쓰기” and I ran into this:
한국어 자동 띄어쓰기 교정기 데모 “Automatic Korean Word Spacing Error Corrector (KSPACER)”
Any idea what that’s about, or why it’s necessary?
- pat @ 23 October 2006Hmm, digging around some more there seems to be a fair amount of research on automated word spacing in Korean, perhaps for the very reason you mention: the rules are complex and people make mistakes.
Automatic Segmentation of Words using Syllable Bigram Statistics (pdf) says, for instance:
From my barely-informed-on-Korean POV, it seems that these rules you describe engender enough variation that the same sorts of NLP approaches that are used for Thai and Japanese are applied to Korean.
- pat @ 24 October 2006yep. in Thai, we rarely use spaces between words.
and in a case that we intentionally do, to some people, it means to emphasizes the word/phrase. like:
hiwhereyougoing? → hiwhere you going?
- bact' @ 25 October 2006Hindu, Arabic and Hebrew may not as well. No spaces around words.
- bact' @ 25 October 2006But I’m not sure.
I’m pretty sure most Indic languages, including Hindi, and Arabic and Hebrew all use spaces to distinguish words.
Actually, I thought of a bit more systematic way to test this issue, I’ll write it up in another post.
- pat @ 25 October 2006pat
The Korean link you showed is a “spell checker” for proper space usage. In Korean, space usage is grammatical, ie it is neither optional nor custom-built. But this is the 21st Century, where people think a pen is use to be whirled around your fingers, or decorate your shirt pocket, no to write, and where typing generally equates to “I am too busy to pay attention to hair-splitting details like grammar or spelling”; thus, in Korea like in the West, the young ‘uns are starting to forget how to write proper [insert your favourite language]. And this page helps fix the problem – at least for the space problem…
- dda @ 26 October 2006Hebrew and Arabic both put spaces between words. Not sure about Hindi, but I believe it does as well.
Ethiopic script (Amharic), on the other hand, is kind of weird in that it has a word separator character in place of a space, or at least in traditional usage. Computers seem to have instituted the use of spaces in texts typeset in Ethiopic.
- don hosek @ 2 November 2006