[MERGE] Reading pack-names triggers 3 reads
Robert Collins
robertc at robertcollins.net
Mon Nov 10 23:49:21 GMT 2008
On Mon, 2008-11-10 at 16:05 -0600, John Arbash Meinel wrote:
> > - I don't like the 'read full if we don't know size' change - I often
> > work with bzrlib on big indices - hundred + megabytes, and this makes
> > the api stop being memory capped - it becomes easier to get wrong.
> > I appreciate the need to reduce round trips here, but I think simply
> > changing it like you have will harm things. I'd like to see
> > it require an explicit parameter or setting rather than be
> > unconditional.
>
> Well, the old code was wrong in the case that the number of pages was >
> 1. It read a single page at most, and set the size of the entire index
> to that size.
>
> We only run into this if self._size is None, and that only happens for
> the pack-names file.
In bzrlib itself.
> If you have other use cases for "self._size = None", I'm happy to
> accomodate.
I do, as I said above.
> But the existing code was wrong, and doing "transport.get()"
> over HTTP downloads the whole file anyway. So I figured it was far
> better to actually get_bytes() and make use of it.
>
> Again, I'm open to suggestions.
Thats fine for pack-names; I concur there - I suggested having a flag or
option on the index to trigger this.
> >
> >> There was also a subtle bug in the code if pack-names ever became larger
> >> than a single page (which is really unlikely, but theoretically possible).
> >
> > I'm not sure where that bug was - could you enlarge on it?
> >
> > bb:resubmit
> >
> > -Rob
>
> To be more explicit, the old code did:
>
> if index == 0:
> # Root node - special case
> if self._size:
> size = min(_PAGE_SIZE, self._size)
> else:
> stream = self._transport.get(self._name)
> start = stream.read(_PAGE_SIZE)
> # Avoid doing this again
> self._size = len(start)
> size = min(_PAGE_SIZE, self._size)
>
> The important bits are:
>
> stream = self._transport.get(self._name)
> ^- Get a file handle, for Local/SFTP this is just a file handle,
> for HTTP/FTP this downloads the whole file into a StringIO
>
> start = stream.read(_PAGE_SIZE)
> ^- Read at most _PAGE_SIZE bytes
>
> # Avoid doing this again
> self._size = len(start)
> ^- set the size of the index to the length of the read bytes
> If the file is <= 1 PAGE this is correct, otherwise it
> claims that the whole file is 1 PAGE long (due to the read
> previously being capped at _PAGE_SIZE)
>
> size = min(_PAGE_SIZE, self._size)
Yah, I can see the bug; interestingly it only matters because we assert
that the offset is no larger than the size of the file (code isn't
quoted here) - but the index header has enough data to tell use the
total pages in the file. I think I'd be inclined to include the index
size in future revisions, so that that can be done more cleanly. Perhaps
a stat() rather than a get(), to determine the size.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081111/8a064b2b/attachment.pgp
More information about the bazaar
mailing list