x.split('/') < y.split('/') is still wrong
John Arbash Meinel
john at arbash-meinel.com
Tue Jul 10 19:31:14 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I've been looking into my pyrex dirstate reader, and I figured I would cleanup
the bisect-on-disk code while I'm there. And it turns out that some of our
assumptions about the paths were sorted are incorrect.
At one point we were comparing "path1 < path2" which we know is incorrect.
So we switched to "path1.split('/') < path2.split('/')".
But I now think that it is incorrect as well.
Our sorting is actually:
1) create an entry for the root, set root as 'current' directory
2) list all children of the current directory
3) put all children which are directories into a (FIFO) queue
4) move to the next directory in the queue
5) goto step 2 until done
If you are more programmatically minded it is:
def _key(entry):
# sort by: directory parts, file name, file id
return entry[0][0].split('/'), entry[0][1], entry[0][2]
This is subtly, but importantly, different from "path.split('/')".
Specifically, take 4 paths: a, a-b, a/b, a=b (note that - < / < =)
If we just sort by string we get:
a, a-b, a/b, a=b
path.split('/) will yeild:
['a']
['a-b']
['a', 'b']
['a=b']
And sorting by path.split('/') will give:
a, a/b, a-b, a=b
But sorting based on:
dirname, basename = os.path.split(path)
dir_parts = dirname.split('/')
(dir_parts, basename)
Will give:
a, a-b, a=b, a/b
Which is actually what we want.
Now, I've tried to work through '_iter_changes' and it seems like we can't
actually run into this bug. The specific bug is that it sorts paths like 'a-a'
after 'a/a', however, because we iterate in dirblock order, we've already
processed 'a-a' by the time we get to 'a/a'. I haven't really gone through all
the logical work to figure it out.
I only know that it is a specific issue, because when bisecting for a entry, it
makes a big difference (because you are comparing paths that are not going to
be next to next to eachother when iterating).
Maybe I just need to go get some food :). Anyway, it is a little hazy, but I
*think* we might have yet another latent DirState sorting bug. Certainly I know
that for arbitrary paths:
cmp(x.split('/'), y.split('/'))
will give different values than:
cmp((dirname_x.split('/'), basename_x), (dirname_y.split('/'), basename_y)
(if only because dirname('a') == '', and '', 'a' sorts differently than 'a'.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGk9BiJdeBCYSNAAMRAurxAKCEWfrXqEyNwziBwKOr1bczV+NBuQCfcfP6
73ZKQcdobPe4/y4qNX3fQcg=
=ToPU
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list