'bzr status' stats each file multiple times

Sun Dec 4 20:24:23 GMT 2005

Michael Ellerman wrote:
> On Sun, 4 Dec 2005 13:41, John A Meinel wrote:
>> Michael Ellerman wrote:
>>> On Sun, 4 Dec 2005 08:43, John A Meinel wrote:
>>>> The idea of calling hashcache.scan() early, is that you can stat in
>>>> inode order. Which according to git, should be faster. And who would
>>>> know better than kernel hackers. :)
>>> OK that sounds reasonable, except that we then re-stat everything. So
>>> that's two stats for every file in the cache right there, sounds look a
>>> false optimisation to me.
>>>
>>> If we're interested in being really fast, then we should do the entire
>>> status operation in inode order - we can sort the output if we want to.
>>> But step one should be to make sure status only stats each file exactly
>>> once.
>>>
>>>> I think we should have a timeout value. So that if I stat'd the file
>>>> within the last X seconds, I assume the file hasn't changed.
>>>> We can even go one better, and do a double check saying, If the files
>>>> mtime is older than Y seconds, and I have stat'd the file within X
>>>> seconds, don't stat again.
>>> Hmm, that sounds like a kludge to me - I think we can improve on the
>>> current times before we need to resort to something like that.
>> How is it a kludge? Isn't it exactly what you just requested. Stat
>> everything 1 time?
> 
> I think it's a kludge because you're potentially losing information, ie. that 
> a file has changed recently, in order to gain performance. Why 5 seconds? Why 
> not 1, 10, 60, 120 ?
> 
> What I was suggesting is that we should work on the higher level code, eg. 
> compare_trees() to make sure it only requires one stat per file - and 
> preferably in inode order.

Well, there are lots of places that need information about a file (does
file exist, get file size, is executable, etc). It sounds like you are
saying that "compare_trees" should keep a cache (possibly as part of
each file entry) of what it has statted.

I'm saying we already have a location which has the stat results, along
with the sha1 hash for each file (hashcache). Why cache the same
information twice in two different locations?

Yes, 5 seconds is an arbitrary timeout period. I'm thinking more of the
fact that bzrlib can be used as a library which is open for a long time.
Which means that for a small tree, you might be calling "compare_trees"
multiple times in a short period of time. By caching the stat results,
you get a speedup across multiple invocations, without sacrificing
accuracy over the long haul.

It is a simple number which users of the library could set. I think the
bzr command line could certainly set the timeout to very large, since it
can think of the tree as a snapshot from the time the program started.

John
=:->

> 
> cheers
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051204/60c21ed2/attachment.pgp