[MERGE] Reading pack-names triggers 3 reads

Robert Collins robertc at robertcollins.net
Mon Nov 10 23:49:21 GMT 2008


On Mon, 2008-11-10 at 16:05 -0600, John Arbash Meinel wrote:


> >  - I don't like the 'read full if we don't know size' change - I often
> >    work with bzrlib on big indices - hundred + megabytes, and this makes
> >    the api stop being memory capped - it becomes easier to get wrong.
> >    I appreciate the need to reduce round trips here, but I think simply
> >    changing it like you have will harm things. I'd like to see
> >    it require an explicit parameter or setting rather than be
> >    unconditional.
> 
> Well, the old code was wrong in the case that the number of pages was >
> 1. It read a single page at most, and set the size of the entire index
> to that size.
> 
> We only run into this if self._size is None, and that only happens for
> the pack-names file.

In bzrlib itself.

> If you have other use cases for "self._size = None", I'm happy to
> accomodate.

I do, as I said above.

>  But the existing code was wrong, and doing "transport.get()"
> over HTTP downloads the whole file anyway. So I figured it was far
> better to actually get_bytes() and make use of it.
> 
> Again, I'm open to suggestions.

Thats fine for pack-names; I concur there - I suggested having a flag or
option on the index to trigger this.

> > 
> >> There was also a subtle bug in the code if pack-names ever became larger
> >> than a single page (which is really unlikely, but theoretically possible).
> > 
> > I'm not sure where that bug was - could you enlarge on it?
> > 
> > bb:resubmit
> > 
> > -Rob
> 
> To be more explicit, the old code did:
> 
> if index == 0:
>     # Root node - special case
>     if self._size:
>         size = min(_PAGE_SIZE, self._size)
>     else:
>         stream = self._transport.get(self._name)
>         start = stream.read(_PAGE_SIZE)
>         # Avoid doing this again
>         self._size = len(start)
>         size = min(_PAGE_SIZE, self._size)
> 
> The important bits are:
> 
>         stream = self._transport.get(self._name)
>         ^- Get a file handle, for Local/SFTP this is just a file handle,
> 	   for HTTP/FTP this downloads the whole file into a StringIO
> 
>         start = stream.read(_PAGE_SIZE)
> 	^- Read at most _PAGE_SIZE bytes
> 
>         # Avoid doing this again
>         self._size = len(start)
> 	^- set the size of the index to the length of the read bytes
> 	   If the file is <= 1 PAGE this is correct, otherwise it
> 	   claims that the whole file is 1 PAGE long (due to the read
> 	   previously being capped at _PAGE_SIZE)
> 
>         size = min(_PAGE_SIZE, self._size)

Yah, I can see the bug; interestingly it only matters because we assert
that the offset is no larger than the size of the file (code isn't
quoted here) - but the index header has enough data to tell use the
total pages in the file. I think I'd be inclined to include the index
size in future revisions, so that that can be done more cleanly. Perhaps
a stat() rather than a get(), to determine the size. 

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081111/8a064b2b/attachment.pgp 


More information about the bazaar mailing list