x.split('/') < y.split('/') is still wrong

John Arbash Meinel john at arbash-meinel.com
Tue Jul 10 19:31:14 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been looking into my pyrex dirstate reader, and I figured I would cleanup
the bisect-on-disk code while I'm there. And it turns out that some of our
assumptions about the paths were sorted are incorrect.

At one point we were comparing "path1 < path2" which we know is incorrect.
So we switched to "path1.split('/') < path2.split('/')".
But I now think that it is incorrect as well.

Our sorting is actually:

 1) create an entry for the root, set root as 'current' directory
 2) list all children of the current directory
 3) put all children which are directories into a (FIFO) queue
 4) move to the next directory in the queue
 5) goto step 2 until done

If you are more programmatically minded it is:

        def _key(entry):
            # sort by: directory parts, file name, file id
            return entry[0][0].split('/'), entry[0][1], entry[0][2]

This is subtly, but importantly, different from "path.split('/')".

Specifically, take 4 paths: a, a-b, a/b, a=b (note that - < / < =)

If we just sort by string we get:
  a, a-b, a/b, a=b

path.split('/) will yeild:

  ['a']
  ['a-b']
  ['a', 'b']
  ['a=b']

And sorting by path.split('/') will give:
  a, a/b, a-b, a=b

But sorting based on:
  dirname, basename = os.path.split(path)
  dir_parts = dirname.split('/')
  (dir_parts, basename)

  Will give:
  a, a-b, a=b, a/b

Which is actually what we want.

Now, I've tried to work through '_iter_changes' and it seems like we can't
actually run into this bug. The specific bug is that it sorts paths like 'a-a'
after 'a/a', however, because we iterate in dirblock order, we've already
processed 'a-a' by the time we get to 'a/a'. I haven't really gone through all
the logical work to figure it out.

I only know that it is a specific issue, because when bisecting for a entry, it
makes a big difference (because you are comparing paths that are not going to
be next to next to eachother when iterating).

Maybe I just need to go get some food :). Anyway, it is a little hazy, but I
*think* we might have yet another latent DirState sorting bug. Certainly I know
that for arbitrary paths:
   cmp(x.split('/'), y.split('/'))
will give different values than:
   cmp((dirname_x.split('/'), basename_x), (dirname_y.split('/'), basename_y)
(if only because dirname('a') == '', and '', 'a' sorts differently than 'a'.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGk9BiJdeBCYSNAAMRAurxAKCEWfrXqEyNwziBwKOr1bczV+NBuQCfcfP6
73ZKQcdobPe4/y4qNX3fQcg=
=ToPU
-----END PGP SIGNATURE-----



More information about the bazaar mailing list