ignore files with invalid filenames

John Arbash Meinel john at arbash-meinel.com
Wed Aug 8 16:02:11 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 8/7/07, Fabio Machado de Oliveira <absfabio at terra.com.br> wrote:
>> Hi again Martin,
>>
>> I found that "bzr pull" also have a problem with the existance of
>> unversioned files with invalid filenames, and I expect it to happen
>> with many other commands.
>>
>> I am wondering if its a case of replacing all of the "os.listdir"
>> with something that already exclude these files, but I think it could
>> have some performance decrease, as there is a utf8 encoding cache that
>> would probably lose part or all of its performance gain.
>>
>> Or if the patches I submitted are going in the right way, so I will wait
>> for someone to review that patch before trying to continue.
> 
> I think rather than replacing the calls individually, you probably
> want to put access to workingtree files under the control of the
> workingtree so that this policy is centralized.
> 
> I think it would be nice if files with invalid/unrepresentable names
> were not seen outside of the workingtree.
> 
> We need to decide just what should happen to files with invalid names.
>  Should they just be ignored entirely, or should we give the user some
> kind of notification.  I think a good tradeoff would be:
> 
> 1 - if they explicitly name the file, give an error
> 2 - if it's just unknown or ignored, ignore it
> 
> I think we can accomplish that by
> 
> 1- when a filename is given, if we can't decode it on the command
> line, or can't convert it into the fsencoding, error
> 2- otherwise, when listing the workingtree, skip files that can't be decoded.
> 
> Not totally sure though...
> 

It would be nice if we could warn if the file is 'unknown' (not ignored,
not versioned) and cannot be interpreted. (It obviously can't be
versioned.).

My idea is that you could ignore it, by using an appropriate regex which
leaves out those characters. So to ignore "fo\xff\xff" you could ignore
"fo??". Or something like that.

I should also chime in a bit on implemention information.

Python os.listdir() has the api that if you pass a Unicode string, you
get back Unicode paths. However, if you pass a Unicode string, and the
paths cannot be represented, they come back as 8-bit strings.

So actually, one way to detect bad filenames is to do:

for path in os.listdir(u'.'):
  if isinstance(path, str):
    # This cannot be represented as Unicode
    ...


However, our walkdirs_utf8 code doesn't do this. Specifically because
converting every path we encounter to Unicode is slower than we would
like. So we have _walkdirs_utf8 which is designed such that if the
filesystem is (theoretically) utf-8 encoded, we just return the paths
'as is'. So we have to do the detection later.

Ultimately, I don't think we want a os.listdir() that returns utf-8
paths. I think catching it at an appropriate time (during _iter_changes,
etc) is fine. (Note that _iter_changes doesn't know whether files are
ignored or unknown, just that they are not versioned.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGudrzJdeBCYSNAAMRAlLNAJ9fnv1Ajo6GISSaljelh0AUuszEWgCgtSxa
JaULchzNtviXjjR7f9oA0p8=
=bwOL
-----END PGP SIGNATURE-----



More information about the bazaar mailing list