Here’s a bit on what’s holding up progress on Fieldmethods, and some thoughts about Python and Unicode.
Consider what’s probably the simplest multilingual application imaginable: open some text from a file and print it out again. As an example, I used some text in Georgian, which I snagged from this Unicode.org page. (A good place for finding random samples of Unicode text in different languages.)
>>> import os,sys
>>> sys.getdefaultencoding()
'ascii'
>>> # Let's open up our utf-8 encoded Georgian text:
>>> georgian = open('georgian.txt').read()
>>> # Now we convert the text to Unicode:
>>> georgian = unicode(georgian)
Traceback (most recent call last):
File "<pyshell #12 />", line 1, in - toplevel -
georgian = unicode(georgian)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>> # Oops! Kartvelian chaos!
Do you want to get into character encoding and code pages and codecs and…? Well, maybe you do. I think it’s kind of interesting, myself, in the way that, uh, Rubik’s Cubes are interesting.
But my readers at Fieldmethods won’t. They’ll turn around and say “Sorry, too techie, I don’t have time for this.”
So I just want to tell them how to avoid thinking about encodings as much as possible. That means making utf-8
the default, and here’s what has to happen.
They need to take a few steps to set up a multilingual prompt.
Here’s what it looks like when everything is working:
Great! See that Georgian text in there? If you have the fonts, and you have Python configured correctly, we get to deal directly with the text. You see text, not escape codes everywhere. (Although you can see that in certain circumstances, as in that sixth line, you still see escape codes. I’m not sure why that is, but we’ll have to be happy with visible text after print
statements.) Note also that we’re not going to be inputting random stuff from the keyboard, we’re just going to read files, but we still want to be able to see it.
So here are the three platforms I’m targetting:
- Linux
- Windows XP
- OSX
Now, one of Python’s strong points is that it’s pretty consistent across these platforms (the IDLE editor, in particular, is almost identical across all three). And the Unicode support is there.
Unfortunately, the default configuration for a new install of Python, on any platform, is not set up to encourage use of Unicode in this way. Namely, the default encoding is not utf-8
but ASCII
. Whyascii
? Inertia, I guess, but I really have no idea. Arguments about default encodings, codecs, conversions, etc, etc, go on endlessly in Python newsgroups. But pretty much everyone agrees that you can’t cover too many languages with ASCII
. Not really even English, if you ask me.
Mostly such arguments are based on the idea that everything has to be portable between systems. But that’s not my prime consideration. My prime consideration is that programs be portable between human languages. We’re going to be “doing science” on language, and it only makes sense if we can apply it to any language. If I write a function count_letters()
, I want it to work with English, French, Georgian, Persian, Cakchiquel, WHATEVER.
With that goal in mind, having the default encoding set to utf-8
is the way to go.
And that’s what I might need your help with:
How do I make it as painless as possible for users to set that as the default under all three systems?
I learned from Mark Pilgrim’s book how to change Python’s default encoding. What has to happen is that a file named sitecustomize.py
has to be put in the Python library. The trickiness comes in with the fact that my readers won’t necessarily be as savvy about things like file permissions as Mark’s are.
The file needs to contain just two lines:
import sys
sys.setdefaultencoding('ascii')
That’s it!
So I plan to make a “Preliminaries” page on Fieldmethods that goes step-by-step through what you need to do to create that file, and where and how to put it where it needs to be on each platform. (And change it back, for whatever reason.)
Because of some nuttiness when Python starts up, you can’t just stick sitecustomize.py
in your current directory, or some other directory that’s in your Python path.
Linux isn’t much of a problem: become root, and create the file in
/usr/lib/Python32/site-packages/
Or whereever your Linux distro puts your site-packages
directory.
I’m not much of a Windows guy, but it’s my impression that under XP all you have to do is save sitecustomize.py
in C:\Python23\
or C:\Python23\site-packages\
. I don’t think Windows has any concept of “root” to speak of, so it seems that it’s just a matter of opening the file file in one of those directories and saving it.
Now, OSX, I’m not so sure about: how do you log in as root? How do you write to that directory? If anyone can help me out I’d appreciate it.
Update:
I’ve gotten some help on the OSX front. It looks like there is a distinction between “admin” users and non-admin users, and the “admin” users pretty much have root. Most people who have their own Macs will probably be admin users, so I’ll make that assumption. (Basically you either use sudo
or you drag the file with the file system, and get prompted for a password. Not too complex.) Off to write up these three sets of steps.
(Thanks for the help, anonymous Mac guy.)
python unicode