BundleReader, Containers, and file IO

Thu Oct 25 07:25:52 BST 2007

Hi all, and especially Aaron,

As mentioned in a previous mail I have written an alternative to
ContainerReader, called ContainerPushParser.  ContainerPushParser works much
like the smart protocol decoding logic: it has an “accept_bytes” method which a
caller can use to stuff bytes into it, and a “read_pending_records” method to
retrieve the resulting records[1].

So I effectively have the logic for reading containers written twice, in two
slightly different ways.  It is possible to reimplement ContainerReader in terms
of ContainerPushParser (but not vice versa), but getting good performance is
non-trivial.  Currently the only non-test use of ContainerReader in the tree is
in BundleReader, which only uses the iter_records method.

The bundle code conveniently has benchmarks to tell me if I'm making it worse.
So I'd like a replacement for the existing ContainerReader to be at least as
fast as the current implementation.

The current ContainerReader implementation never over-reads.  With a well-formed
container, it will never read even a single byte past the end-marker.  This can
matter if the file-like object being read from is actually a pipe; reading more
data than actually exists will cause a hang while the code blocks for data that
never arrives.  I've implemented a “ContainerPushParser.bytes_to_read” method
and an iter_records replacement on top of that that never over-reads.  With this
I can replace the existing ContainerReader, but the bundle benchmarks are mostly
a bit worse, some by up to 10%.  I don't think this is acceptable.

However, I'm not sure that for bundles we actually care about the requirement
that we never over-read; as far as I can tell, that code is just used with real
files that simply return '' when there are no more bytes available.  In this
case we don't need limit our reads to how much we are certain is safe to read,
and can simply keep pulling in e.g. 16K chunks (or more if we know we're reading
a large body).  The logic is simpler, and much faster.  The benchmarks are all
better than bzr.dev, and the “big_tree” ones are *twice* as fast!  

So, is this approach reasonable?

-Andrew.

[1]  I suspect passing a “record_parsed” callback may be slightly faster and
     more flexible than polling read_pending_records each time, but it's
     probably not a big difference.  read_pending_records is more convenient for
     now.