[win32] non-ascii/non-english file names: internal usage of file names

Thu Dec 1 07:58:39 GMT 2005

On Wed, Nov 30, 2005 at 16:31:36 -0600, John A Meinel wrote:
> Jan Hudec wrote:
> > On Wed, Nov 30, 2005 at 10:23:23 -0500, Aaron Bentley wrote:
> 
> ...
> 
> > Converting filenames from local encoding to unicode is not a problem (as bzr
> > can always refuse to work if it is not possible). But it IS a problem the
> > other way round. Say someone on iso-8859-2 system creates a file named 'kř'
> > (k&#rcaron; for those who can't display that character). And someone else on
> > iso-8859-1 system tries to check it out. Then bzr should not just throw up
> > it's hands and say it's not possible.
> 
> But what would you have it do? It has no way to legally represent the
> file on the local system. I can think of possible workarounds, but what
> would you recommend?
>
> Also, we will run into it in other places. For example, Windows does not
> allow many characters (", \, :, *, etc) which are legitimate under unix.
> So we need some sort of resolution. Which could be not allowing checking
> out a current version which has bogus names, but allowing a checkout
> even if the names used to be invalid.
> Or somehow munging the names. But if you have munged the names, how do
> you munge them. Do you try to do it so that they are still considered
> internally to have the old name, the naive implementation would have
> them show up as a new file, and the old version being deleted. A second
> method would just have them automatically renamed.

I would prefer using xml character entity encoding -- so 'kř' ('k&#rcaron;')
would be 'k&#345;'. But I am not sure &, # and ; are allowed in
filenames on Windows. If it's not, we can either use some other
characters, or uri-encoded utf-8, so the same filename would be
'k%c5%99', which I think is allowed everywhere and even does not need
quoting. It extends to the other funny characters. Eg. 'a*b' ends up
'a%2ab' etc.

Note, that if the escape character is allowed and merely hard to type, I
think it's not a problem. It is a fallback mechanism.

> I would imagine that there might be a valid near replacement for k&#rcaron;
> 
> But I'm also positive that I can write something in arabic, which has no
> replacement in latin-1. ??????????

It does not have a replacement in iso8859-2 either and this time I'm
posting from a system I did not yet switch to unicode... Anyway, they
have unicode representation, which can be encoded to entities or
uri-escapes.

> I believe on windows (NT/XP) the real encoding is actually UTF-16, so it
> shouldn't be a problem there.

I believe it actually depends on the filesystem type. Ie. that they use
utf-16 on NTFS, but cp<whatever> on FAT. And they have two ways of
calling syscalls, one using cp<whatever> and another using utf-16.
But unices certainly do syscalls with whatever is locale, so legacy
unices do have problem there.

> >> And if people scream, we can go to a more complex approach of requiring
> >> versioned files to be unicode, but not unversioned files in the tree.
> >>
> >> And if people scream, we can find ways to jam binary data into unicode,
> >> in one of the user-defined sections.
> > 
> > Well, 'latin-1' can always be decoded to unicode, so that part is not too
> > hard.
> 
> Sure, but then you always have to decode it into something, which can
> get really ugly.

Yes. Actually I think it's no big deal if users can't add and rename to
names they can't express in their locale encoding. They will still be
able to use the files someone else added (but the software using such
names likely won't, because it won't be able to use them).

--
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051201/61d36f3d/attachment.pgp