[MERGE] faster iter_changes

Robert Collins robertc at robertcollins.net
Tue Sep 23 01:08:48 BST 2008


On Mon, 2008-09-22 at 19:21 -0400, John Arbash Meinel wrote:
> John Arbash Meinel has voted comment.
> Status is now: Semi-approved
> Comment:
> I think this is going to take a few rounds of review of the different 
> pieces. I think I like where it is going, but it isn't ready to land for 
> 'tweak' (doesn't work on win32). But I don't really want it out of the 
> queue via 'resubmit'. So for now I'm just voting 'comment'.

I think resubmit is appropriate, because it needs to be taken away and
worked on :).

> === modified file 'bzrlib/_dirstate_helpers_c.pyx'
> 
> On win32 we don't have:
> 
> +cdef extern from "arpa/inet.h":
> +    unsigned long htonl(unsigned long)

This one is apparently in Winsock2.h. 

> 
> or
> 
> +cdef extern from 'sys/stat.h':
> +    int S_ISDIR(int mode)
> +    int S_ISREG(int mode)
> +    int S_ISLNK(int mode)
> +    int S_IXUSR
> 
> This may depend on the compiler tools. I'm currently using mingw32, and 
> it
> seems to be fine with "S_ISDIR and S_ISREG" but fails trying to find 
> S_ISLNK.
> More exploration is needed to know what VS 7/8/9 would have available. 
> (They
> tend to put POSIX stuff into _foo, like _open and _stat sort of things.)
> 
> Perhaps we just need a compatibility header, that on win32 S_ISLINK is 
> defined
> as: "#define S_ISLNK(mode) (0)"

Could be.

Can we just not build this extension for windows for the moment? (Or can
I get you/markh/alexander/someone to port it?)

> And I don't think you meant to keep:
> +cdef extern from "stdio.h":
> +    void printf(char *format, ...)

True, thanks.

> 
> +from bzrlib import cache_utf8, errors, osutils
> +from bzrlib.dirstate import DirState, pack_stat
> +from bzrlib.osutils import pathjoin, splitpath
> 
> ^- you don't actually use 'pack_stat' here. Also, I think it would be 
> better to
> do "pathjoin = osutils.pathjoin" at the appropriate times, rather than
> importing them again here.

I'll look into these.

> +#cdef object _encode
> +_encode = binascii.b2a_base64
> +
> 
> ^- I don't think cdef object gets you anything here, but commenting it 
> out gets
> you even less.

Heh. I think I know why I did that. It will buy less name lookups if it
works.

> ...
> 
> +
> +from struct import pack
> +cdef _pack_stat(stat_value):
> 
> 
> ^- This seems like a place, that if you could determine you had
> _walkdirs_win32._Win32Stat or _readdir_pyx._Stat you could access the 
> values as
> C types, rather than converting into python and back down to C.
> 
> I would also be *really* curious to poke a sprintf("%8X.%8X.%8X") in 
> there, instead of
> the base64.encode() and see what we get.

I've profiled that; not significant in the past; overall performance may
be low enough for it to matter now.

Using .pyd files would be nice, but again - minimum pyrex version
dependencies.

> ...
> +cdef _update_entry(self, entry, abspath, stat_value):
> +    """Update the entry based on what is actually on disk.
> +
> +    :param entry: This is the dirblock entry for the file in question.
> +    :param abspath: The path on disk for this file.
> +    :param stat_value: (optional) if we already have done a stat on the
> +        file, re-use it.
> +
> +    :return: The sha1 hexdigest of the file (40 bytes) or link target 
> of a
> +            symlink.
> +    """
> 
> ^- 'stat_value' isn't optional anymore. We moved that code up the stack, 
> and
> just forgot to update the docstring. We need to update the "def 
> update_entry"
> docstring as well.

Thanks for noting that, will do while I'm here.

> +    # TODO - require pyrex 0.8, then use a pyd file to define access to 
> the _st
> +    # mode of the compiled stat objects.
> 
> ^- Do you really mean pyrex 0.8? That seems like a perfectly reasonable
> request. If you mean 0.9.8 that might be more difficult.

Ah yes, it was 0.9.8.

> +    cdef int minikind, saved_minikind
> +    cdef void * details
> +    # pyrex 0.9.7 would allow cdef list details_list, and direct access 
> rather
> +    # than PyList_GetItem_void_void below
> 
> I didn't realize this. Is that pyrex 0.9.7, or cython?

pyrex 0.9.7, one of the cython improvements.

> Regardless, 'details' is a Tuple, not a List object.
> 
> Also "cdef _update_entry(self, ...)" since _update_entry is no longer a 
> member
> of DirState, I think it might be more obvious to use 
> "_update_entry(state,
> ...)".

Do you mean, move _update_entry from the iterator class to a top level
method ? Or just that it has a confusion parameter name?

> +cdef char _minikind_from_string(object string):
> +    """Convert a python string to a char."""
> +    return PyString_AsString(string)[0]
> +
> 
> ^- This is a little unsafe in the case that 'string' is an empty string.
> Perhaps a quick "if PyString_Size(string) > 0" ?

Well, its unsafe in a bunch of ways; I'm trusting the rest of our code
here. To be _safe_ it needs a check-exact call, and a size call.

> ...
> 
> +cdef object _minikind_to_kind(char minikind):
> +    """Create a string kind for minikind."""
> +    cdef char _minikind[1]
> +    if minikind == c'f':
> +        return _kind_file
> +    elif minikind == c'd':
> +        return _kind_directory
> +    elif minikind == c'a':
> +        return _kind_absent
> +    elif minikind == c'r':
> +        return _kind_relocated
> +    elif minikind == c'l':
> +        return _kind_symlink
> +    elif minikind == c't':
> +        return _kind_tree_reference
> +    _minikind[0] = minikind
> +    raise KeyError(PyString_FromStringAndSize(_minikind, 1))
> 
> ^- Why do you use the "_minikind[0]" workaround, rather than &minikind?

pyrex preferred it IIRC. Anyhow I think its clearer like this.

> I'm skipping the big ProcessEntryC code block for now, I'll try to get 
> back to
> it later.
> 
> === modified file 'bzrlib/delta.py'
> --- bzrlib/delta.py     2008-04-24 07:22:53 +0000
> +++ bzrlib/delta.py     2008-09-15 03:23:53 +0000
> @@ -241,7 +241,7 @@
>                                     (executable[0] != executable[1])))
>           elif kind[0] != kind[1]:
>               delta.kind_changed.append((path[1], file_id, kind[0], 
> kind[1]))
> -        elif content_change is True or executable[0] != executable[1]:
> +        elif content_change or executable[0] != executable[1]:
>               delta.modified.append((path[1], file_id, kind[1],
>                                      content_change,
>                                      (executable[0] != executable[1])))
> 
> 
> ^- any comment on why you had to do this?

yes, I'm returning 0 and 1, not False and True; so 'is True' does not
work anymore. This was purely performance based.

> ...
> 
> === modified file 'bzrlib/dirstate.py'
> --- bzrlib/dirstate.py  2008-07-30 08:55:10 +0000
> +++ bzrlib/dirstate.py  2008-09-18 06:36:28 +0000
> @@ -211,6 +211,7 @@
>   import time
>   import zlib
> 
> +import bzrlib
>   from bzrlib import (
> 
> ^- As near as I can tell, you do the import here, so that later you can 
> do:
> 
>      def iter_changes(self):
>          """Iterate over the changes."""
>          utf8_decode = cache_utf8._utf8_decode
> ==>     cmp_by_dirs = bzrlib.dirstate.cmp_by_dirs
>          _process_entry = self._process_entry
> 
> It seems like it would be a lot cleaner to just use a different variable 
> name.
> Like "_cmp_by_dirs = cmp_by_dirs".

Sure, I can do that.

> ...
> +from bzrlib.osutils import pathjoin, splitpath
> 
> Similarly, I find it better to use "pathjoin = osutils.pathjoin" when 
> you want
> it as a simple function rather than a module lookup.

A lot of this code was moved; its easier I find to review moved code
when its not willy-nilly changed; these imports were like that in the
source file.

> ...
> 
> @@ -2774,6 +2738,709 @@
>               raise errors.ObjectNotLocked(self)
> 
> 
> +def py_update_entry(self, entry, abspath, stat_value,
> +                 _stat_to_minikind=DirState._stat_to_minikind,
> +                 _pack_stat=pack_stat):
> 
> ^- Shouldn't this be moved into _dirstate_helpers_py.py ?
> 
> Similarly for:
> +class ProcessEntryPython(object):

Oh, I guess. Not playing with the idiom quite-right, sorry.

> === modified file 'bzrlib/tests/test__dirstate_helpers.py'
> 
> ...
> 
> +
> +class TestUpdateEntry(test_dirstate.TestCaseWithDirState):
> 
> 
> I'm assuming this is mostly just a Cut & Paste from the test_dirstate 
> tests?

Its a move from unparameterised to parameterised test, thats all. The
only change is to use the right function.

> I'll also note that you don't seem to have direct tests for the other 
> helpers
> like pack_stat and ProcessEntry[C/Py]. I don't know that we have to, but 
> I
> usually feel better having implementation tests to make sure that the
> Pyrex code performs identically to the Python code.

ProcessEntry is directly tested via InterTree:

> I suppose you are testing it indirectly with iter_changes and:

s/indirect/direct/ :)


> As mentioned, I haven't gone over the ProcessEntryX classes yet, but I 
> figured
> I could at least start the discussion.
> 
> Overall, I think the basic refactoring you've done is good. We still 
> need to
> figure out how to get everything to compile on all platforms, etc.

I'd really appreciate a windows using dev to step up and help port it.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080923/5bbb9cab/attachment.pgp 


More information about the bazaar mailing list