UnicodeEncodeError in add_action_print with non ascii files names
Nir Soffer
nirs at freeshell.org
Sun Feb 5 06:31:38 GMT 2006
On 5 Feb, 2006, at 6:50, John A Meinel wrote:
> It isn't that Mac doesn't support unicode filenames, but that it
> normalizes them.
It decompose them into multiple characters 'a with a circle' -> 'a' +
'circle', as explained here
<http://developer.apple.com/documentation/MacOSX/Conceptual/
BPInternational/Articles/FileEncodings.html>
Note that according to the doc above, any system routine should be
called with the decomposed form, but both forms work with Python
os.path.exists.
> Probably this doesn't matter for Hebrew characters,
> because they don't have combiners. But for the European character 'å'
> (u'\xe5') this has 2 forms. u'a\u030a', and u'\xe5', The former is 'a +
> circle', the latter is 'a with circle'.
>
> The issue is that XML states that the latter should be used, while Mac
> OS X creates files with the former normalization.
>
> So if you go to Mac and do:
>
> python
>>>> import os
>>>> open(u'\xe5', 'wb').write('hello')
>>>> os.listdir(u'.')
> [u'a\u030a']
>>>> print open(u'\xe5', 'rb').read()
> hello
>>>> print open(u'a\u030a', 'rb').read()
> hello
>
> Mac will let you access the file with either method, as it treats them
> the same.
>
> The problem for bzr is that on Linux, you might create the file
> '\xe5.txt', and then bzr will record that filename. Then if you check
> that project out on Mac, it will create what it thinks is '\xe5.txt',
> but when it tries to list the directory, that file has disappeared, and
> this unknown 'a\u030a.txt' file has appeared.
>
> Anyway, right now Mac OS X is the only filesystem that seems to do
> this.
> Windows & Linux leave the normalization alone. That means on Linux you
> can have 2 files which *look* like the same filename, Windows doesn't
> seem to understand \u030a, and just puts a box for the unknown
> character.
>
> We discussed the issue, and decided that it made the most sense to
> always normalize filenames internally. And complain if the user tries
> to
> add a non-normalized filename. (On Mac you can't create one).
Using PyObjC, this function precompose back to the common form:
>>> def normalize(name):
... return
NSString.stringWithString_(name).precomposedStringWithCanonicalMapping()
>>> normalize(u'a\u030a')
u'\xe5'
There is also - (NSString *)precomposedStringWithCompatibilityMapping,
which give the same results with this test string. The first use
Unicode Normalization Form C, the second Unicode Normalization Form KC
(I don't have any idea what is difference :-) )
<http://developer.apple.com/documentation/Cocoa/Reference/Foundation/
ObjC_classic/Classes/NSString.html#//apple_ref/occ/instm/NSString/
precomposedStringWithCanonicalMapping>
I guess the same call is available through Carbon/CoreFoundation.
I'll be happy to help with the Unicode support. I have some free time
later this month.
> By the way, it is nice to have some hebrew characters. Do you have a
> specific meaning for 'שלום'? I've been collecting non-english words,
> and
> I prefer to have a translation with them.
שלום (sha-lom) is both hello and peace :-)
Maybe you will like Limon, which is a Free (GPL) Hebrew English online
dictionary for Mac OS X, written using PyObjC. You can writh English
words and get the Hebrew translation.
http://nirs.freeshell.org/limon/
Best Regards,
Nir Soffer
More information about the bazaar
mailing list