UnicodeEncodeError in add_action_print with non ascii files names

Nir Soffer nirs at freeshell.org
Sun Feb 5 06:31:38 GMT 2006


On 5 Feb, 2006, at 6:50, John A Meinel wrote:

> It isn't that Mac doesn't support unicode filenames, but that it
> normalizes them.

It decompose them into multiple characters 'a with a circle' -> 'a' +  
'circle', as explained here  
<http://developer.apple.com/documentation/MacOSX/Conceptual/ 
BPInternational/Articles/FileEncodings.html>

Note that according to the doc above, any system routine should be  
called with the decomposed form, but both forms work with Python  
os.path.exists.

> Probably this doesn't matter for Hebrew characters,
> because they don't have combiners. But for the European character 'å'
> (u'\xe5') this has 2 forms. u'a\u030a', and u'\xe5', The former is 'a +
> circle', the latter is 'a with circle'.
>
> The issue is that XML states that the latter should be used, while Mac
> OS X creates files with the former normalization.
>
> So if you go to Mac and do:
>
> python
>>>> import os
>>>> open(u'\xe5', 'wb').write('hello')
>>>> os.listdir(u'.')
> [u'a\u030a']
>>>> print open(u'\xe5', 'rb').read()
> hello
>>>> print open(u'a\u030a', 'rb').read()
> hello
>
> Mac will let you access the file with either method, as it treats them
> the same.
>
> The problem for bzr is that on Linux, you might create the file
> '\xe5.txt', and then bzr will record that filename. Then if you check
> that project out on Mac, it will create what it thinks is '\xe5.txt',
> but when it tries to list the directory, that file has disappeared, and
> this unknown 'a\u030a.txt' file has appeared.
>
> Anyway, right now Mac OS X is the only filesystem that seems to do  
> this.
> Windows & Linux leave the normalization alone. That means on Linux you
> can have 2 files which *look* like the same filename, Windows doesn't
> seem to understand \u030a, and just puts a box for the unknown  
> character.
>
> We discussed the issue, and decided that it made the most sense to
> always normalize filenames internally. And complain if the user tries  
> to
> add a non-normalized filename. (On Mac you can't create one).

Using PyObjC, this function precompose back to the common form:

 >>> def normalize(name):
...     return  
NSString.stringWithString_(name).precomposedStringWithCanonicalMapping()

 >>> normalize(u'a\u030a')
u'\xe5'

There is also - (NSString *)precomposedStringWithCompatibilityMapping,  
which give the same results with this test string. The first use  
Unicode Normalization Form C, the second Unicode Normalization Form KC  
(I don't have any idea what is difference :-) )

<http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ 
ObjC_classic/Classes/NSString.html#//apple_ref/occ/instm/NSString/ 
precomposedStringWithCanonicalMapping>

I guess the same call is available through Carbon/CoreFoundation.

I'll be happy to help with the Unicode support. I have some free time  
later this month.

> By the way, it is nice to have some hebrew characters. Do you have a
> specific meaning for 'שלום'? I've been collecting non-english words,  
> and
> I prefer to have a translation with them.

שלום (sha-lom) is both hello and peace :-)

Maybe you will like Limon, which is a Free (GPL) Hebrew English online  
dictionary for Mac OS X, written using PyObjC. You can writh English  
words and get the Hebrew translation.
http://nirs.freeshell.org/limon/


Best Regards,

Nir Soffer





More information about the bazaar mailing list