'bzr status' stats each file multiple times

John A Meinel john at arbash-meinel.com
Sun Dec 4 14:43:54 GMT 2005


Michael Ellerman wrote:
> On Fri, 2 Dec 2005 23:35, John A Meinel wrote:
>> [snip]
>> I've noticed that "bzr status" on an clean tree seems to take quite a
>> while.
>> [snip] 
> 
> Me too.
> 
> On my tree, it seems to recompute the sha1 for about 1/3 rd of
> the files every time. It then caches them, but never writes the hashcache?
> I also don't see the point of calling hashcache.scan() prematurely?

The idea of calling hashcache.scan() early, is that you can stat in
inode order. Which according to git, should be faster. And who would
know better than kernel hackers. :)

Now, the other issue is that we still re-stat the file when we come back
later. Just to make sure that it hasn't changed in the mean time.

I think we should have a timeout value. So that if I stat'd the file
within the last X seconds, I assume the file hasn't changed.
We can even go one better, and do a double check saying, If the files
mtime is older than Y seconds, and I have stat'd the file within X
seconds, don't stat again.

I could see using Y=1day, and X=5seconds.
So if the file is more than 1 day old, we don't really expect it to
change frequently, certainly not within the last 5 seconds.

All it really means is that if you start a "bzr status" and something
changes *while* you are running status, it won't show up until the next
status.

"bzr commit" also won't pick up changes that occur while it is
committing. *However* there is a nice property that whenever you commit,
you do compute the sha1 sum from the actual lines that you are
committing. You don't pay any attention to the cache at that point.
Actually, at that point you do a sha1 check multiple times on the same
strings loaded in memory. Specifically at line 574 in inventory.py:

   new_lines = work_tree.get_file(self.file_id).readlines()
   self._add_text_to_weave(new_lines, file_parents, weave_store,
                           transaction)
   self.text_sha1 = sha_strings(new_lines)
   self.text_size = sum(map(len, new_lines))

So we read the text, add it to the weave (which computes a sha1 sum),
and then we compute the sha1 sum, to add it to the inventory entry.
sha1 is relatively cheap (when compared with stuff like adding a text to
a weave file), so it doesn't really matter.

> 
> This makes it a bit quicker.

Actually, on my tree, I have much worse behavior. I saw it stat each
file stat times.
I think what is happening, is that we have multiple code paths creating
a working_tree, each of which opens a new hash_cache, and they can get
out of sync.
I'm going to test creating a weakref dictionary of hashcache paths, and
see if it is getting accessed more than once.

I would like to refactor everything a little bit anyway.
For one, I would like to see the hash cache be called .bzr/hash-cache,
rather than .bzr/stat-cache, since it no longer is a stat cache (though
it sort of is). This can wait for the next branch-format bump, though.

Does anyone have a significant problem with my idea of letting hashcache
not re-stat a file within 5 seconds of the last time that it did the
stat? I'm also thinking to make it even longer (10s), considering that
bzr doesn't really need to know if something changes while it is in the
middle of doing something. Since it will still stat everything at least
1 time everytime it runs. (I'm not planning on saving the last stat'ed
time, just in memory).

John
=:->



> 
> === modified file 'bzrlib/workingtree.py'
> --- bzrlib/workingtree.py
> +++ bzrlib/workingtree.py
> @@ -202,7 +202,7 @@
>          # given path only.
>          hc = self._hashcache = HashCache(basedir)
>          hc.read()
> -        hc.scan()
> +        #hc.scan()
>  
>          if hc.needs_write:
>              mutter("write hc")
> @@ -908,6 +908,9 @@
>          between multiple working trees, i.e. via shared storage, then we 
>          would probably want to lock both the local tree, and the branch.
>          """
> +        if '_hashcache' in self.__dict__ and self._hashcache.needs_write:
> +            self._hashcache.write()
> +
>          return self.branch.unlock()
>  
>      @needs_write_lock
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051204/17701cdb/attachment.pgp 


More information about the bazaar mailing list