[MERGE] faster iter_changes
Robert Collins
robertc at robertcollins.net
Tue Sep 23 01:08:48 BST 2008
On Mon, 2008-09-22 at 19:21 -0400, John Arbash Meinel wrote:
> John Arbash Meinel has voted comment.
> Status is now: Semi-approved
> Comment:
> I think this is going to take a few rounds of review of the different
> pieces. I think I like where it is going, but it isn't ready to land for
> 'tweak' (doesn't work on win32). But I don't really want it out of the
> queue via 'resubmit'. So for now I'm just voting 'comment'.
I think resubmit is appropriate, because it needs to be taken away and
worked on :).
> === modified file 'bzrlib/_dirstate_helpers_c.pyx'
>
> On win32 we don't have:
>
> +cdef extern from "arpa/inet.h":
> + unsigned long htonl(unsigned long)
This one is apparently in Winsock2.h.
>
> or
>
> +cdef extern from 'sys/stat.h':
> + int S_ISDIR(int mode)
> + int S_ISREG(int mode)
> + int S_ISLNK(int mode)
> + int S_IXUSR
>
> This may depend on the compiler tools. I'm currently using mingw32, and
> it
> seems to be fine with "S_ISDIR and S_ISREG" but fails trying to find
> S_ISLNK.
> More exploration is needed to know what VS 7/8/9 would have available.
> (They
> tend to put POSIX stuff into _foo, like _open and _stat sort of things.)
>
> Perhaps we just need a compatibility header, that on win32 S_ISLINK is
> defined
> as: "#define S_ISLNK(mode) (0)"
Could be.
Can we just not build this extension for windows for the moment? (Or can
I get you/markh/alexander/someone to port it?)
> And I don't think you meant to keep:
> +cdef extern from "stdio.h":
> + void printf(char *format, ...)
True, thanks.
>
> +from bzrlib import cache_utf8, errors, osutils
> +from bzrlib.dirstate import DirState, pack_stat
> +from bzrlib.osutils import pathjoin, splitpath
>
> ^- you don't actually use 'pack_stat' here. Also, I think it would be
> better to
> do "pathjoin = osutils.pathjoin" at the appropriate times, rather than
> importing them again here.
I'll look into these.
> +#cdef object _encode
> +_encode = binascii.b2a_base64
> +
>
> ^- I don't think cdef object gets you anything here, but commenting it
> out gets
> you even less.
Heh. I think I know why I did that. It will buy less name lookups if it
works.
> ...
>
> +
> +from struct import pack
> +cdef _pack_stat(stat_value):
>
>
> ^- This seems like a place, that if you could determine you had
> _walkdirs_win32._Win32Stat or _readdir_pyx._Stat you could access the
> values as
> C types, rather than converting into python and back down to C.
>
> I would also be *really* curious to poke a sprintf("%8X.%8X.%8X") in
> there, instead of
> the base64.encode() and see what we get.
I've profiled that; not significant in the past; overall performance may
be low enough for it to matter now.
Using .pyd files would be nice, but again - minimum pyrex version
dependencies.
> ...
> +cdef _update_entry(self, entry, abspath, stat_value):
> + """Update the entry based on what is actually on disk.
> +
> + :param entry: This is the dirblock entry for the file in question.
> + :param abspath: The path on disk for this file.
> + :param stat_value: (optional) if we already have done a stat on the
> + file, re-use it.
> +
> + :return: The sha1 hexdigest of the file (40 bytes) or link target
> of a
> + symlink.
> + """
>
> ^- 'stat_value' isn't optional anymore. We moved that code up the stack,
> and
> just forgot to update the docstring. We need to update the "def
> update_entry"
> docstring as well.
Thanks for noting that, will do while I'm here.
> + # TODO - require pyrex 0.8, then use a pyd file to define access to
> the _st
> + # mode of the compiled stat objects.
>
> ^- Do you really mean pyrex 0.8? That seems like a perfectly reasonable
> request. If you mean 0.9.8 that might be more difficult.
Ah yes, it was 0.9.8.
> + cdef int minikind, saved_minikind
> + cdef void * details
> + # pyrex 0.9.7 would allow cdef list details_list, and direct access
> rather
> + # than PyList_GetItem_void_void below
>
> I didn't realize this. Is that pyrex 0.9.7, or cython?
pyrex 0.9.7, one of the cython improvements.
> Regardless, 'details' is a Tuple, not a List object.
>
> Also "cdef _update_entry(self, ...)" since _update_entry is no longer a
> member
> of DirState, I think it might be more obvious to use
> "_update_entry(state,
> ...)".
Do you mean, move _update_entry from the iterator class to a top level
method ? Or just that it has a confusion parameter name?
> +cdef char _minikind_from_string(object string):
> + """Convert a python string to a char."""
> + return PyString_AsString(string)[0]
> +
>
> ^- This is a little unsafe in the case that 'string' is an empty string.
> Perhaps a quick "if PyString_Size(string) > 0" ?
Well, its unsafe in a bunch of ways; I'm trusting the rest of our code
here. To be _safe_ it needs a check-exact call, and a size call.
> ...
>
> +cdef object _minikind_to_kind(char minikind):
> + """Create a string kind for minikind."""
> + cdef char _minikind[1]
> + if minikind == c'f':
> + return _kind_file
> + elif minikind == c'd':
> + return _kind_directory
> + elif minikind == c'a':
> + return _kind_absent
> + elif minikind == c'r':
> + return _kind_relocated
> + elif minikind == c'l':
> + return _kind_symlink
> + elif minikind == c't':
> + return _kind_tree_reference
> + _minikind[0] = minikind
> + raise KeyError(PyString_FromStringAndSize(_minikind, 1))
>
> ^- Why do you use the "_minikind[0]" workaround, rather than &minikind?
pyrex preferred it IIRC. Anyhow I think its clearer like this.
> I'm skipping the big ProcessEntryC code block for now, I'll try to get
> back to
> it later.
>
> === modified file 'bzrlib/delta.py'
> --- bzrlib/delta.py 2008-04-24 07:22:53 +0000
> +++ bzrlib/delta.py 2008-09-15 03:23:53 +0000
> @@ -241,7 +241,7 @@
> (executable[0] != executable[1])))
> elif kind[0] != kind[1]:
> delta.kind_changed.append((path[1], file_id, kind[0],
> kind[1]))
> - elif content_change is True or executable[0] != executable[1]:
> + elif content_change or executable[0] != executable[1]:
> delta.modified.append((path[1], file_id, kind[1],
> content_change,
> (executable[0] != executable[1])))
>
>
> ^- any comment on why you had to do this?
yes, I'm returning 0 and 1, not False and True; so 'is True' does not
work anymore. This was purely performance based.
> ...
>
> === modified file 'bzrlib/dirstate.py'
> --- bzrlib/dirstate.py 2008-07-30 08:55:10 +0000
> +++ bzrlib/dirstate.py 2008-09-18 06:36:28 +0000
> @@ -211,6 +211,7 @@
> import time
> import zlib
>
> +import bzrlib
> from bzrlib import (
>
> ^- As near as I can tell, you do the import here, so that later you can
> do:
>
> def iter_changes(self):
> """Iterate over the changes."""
> utf8_decode = cache_utf8._utf8_decode
> ==> cmp_by_dirs = bzrlib.dirstate.cmp_by_dirs
> _process_entry = self._process_entry
>
> It seems like it would be a lot cleaner to just use a different variable
> name.
> Like "_cmp_by_dirs = cmp_by_dirs".
Sure, I can do that.
> ...
> +from bzrlib.osutils import pathjoin, splitpath
>
> Similarly, I find it better to use "pathjoin = osutils.pathjoin" when
> you want
> it as a simple function rather than a module lookup.
A lot of this code was moved; its easier I find to review moved code
when its not willy-nilly changed; these imports were like that in the
source file.
> ...
>
> @@ -2774,6 +2738,709 @@
> raise errors.ObjectNotLocked(self)
>
>
> +def py_update_entry(self, entry, abspath, stat_value,
> + _stat_to_minikind=DirState._stat_to_minikind,
> + _pack_stat=pack_stat):
>
> ^- Shouldn't this be moved into _dirstate_helpers_py.py ?
>
> Similarly for:
> +class ProcessEntryPython(object):
Oh, I guess. Not playing with the idiom quite-right, sorry.
> === modified file 'bzrlib/tests/test__dirstate_helpers.py'
>
> ...
>
> +
> +class TestUpdateEntry(test_dirstate.TestCaseWithDirState):
>
>
> I'm assuming this is mostly just a Cut & Paste from the test_dirstate
> tests?
Its a move from unparameterised to parameterised test, thats all. The
only change is to use the right function.
> I'll also note that you don't seem to have direct tests for the other
> helpers
> like pack_stat and ProcessEntry[C/Py]. I don't know that we have to, but
> I
> usually feel better having implementation tests to make sure that the
> Pyrex code performs identically to the Python code.
ProcessEntry is directly tested via InterTree:
> I suppose you are testing it indirectly with iter_changes and:
s/indirect/direct/ :)
> As mentioned, I haven't gone over the ProcessEntryX classes yet, but I
> figured
> I could at least start the discussion.
>
> Overall, I think the basic refactoring you've done is good. We still
> need to
> figure out how to get everything to compile on all platforms, etc.
I'd really appreciate a windows using dev to step up and help port it.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080923/5bbb9cab/attachment.pgp
More information about the bazaar
mailing list