Unicode through filesystem tricks
John A Meinel
john at arbash-meinel.com
Fri Jan 13 07:03:16 GMT 2006
I just found something interesting about Mac's filesystem, when dealing
with unicode filenames.
Specifically, we have this problem, if we create a file named:
räksmörgås
This corresponds to the unicode string:
u"r\xe4ksm\xf6rg\xe5s"
Where \xe4 is the letter 'a' with two dots on it.
However, the string we get back from the filesystem is:
u"ra\u0308ksmo\u0308rga\u030as"
You'll notice that this string uses:
u"a\u0308"
I assume \u0308 is the 'put two dots above the previous character' code.
If I print either of the above 2 strings, they both look correct.
What is also interesting (and problematic), is:
>>> u"ra\u0308ksmo\u0308rga\u030as".encode('iso-8859-1')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0308' in
position 2: ordinal not in range(256)
Now, one interesting thing, is that if Mac will translate either one. So
you get the same file if you use open() with either filename. But to
'bzr' they look like 2 different unicode paths.
Note that these paths do have different forms in utf-8 as well.
Basically, it seems we need some sort of unicode normalization. Or at
least use the same normalization that Mac does. I'm guessing that
projects with unicode filenames created on a mac will work everywhere
else, but that projects created under linux might fail to checkout under
mac, since it translates behind the scenes.
Oh, and because of the lack of normalization under linux, I can do:
x = u"r\xe4ksm\xf6rg\xe5s"
y = u"ra\u0308ksmo\u0308rga\u030as"
open(x, 'wb').write('x\n')
open(y, 'wb').write('y\n')
$ ls r*
räksmörgås räksmörgås
(well, it looks correct on my terminal, that both files have the exact
same name, in the same directory)
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060113/63401560/attachment.pgp
More information about the bazaar
mailing list