notes/plan for hpss performance work

Thu May 1 06:59:53 BST 2008

Andrew and I have been studying hpss performance using -Dhpss and the
netem hack to simulate a 1000ms rtt network.  Andrew has the full data
and more details but here are some things we've noticed.

 * Generally, RPCs do take around 1.003s, precisely what we'd expect
for this network.  So this shows we're not suffering from bad
buffering at any level: we send one packet, and get one packet back.
(We are not testing with ssh; it's possible problems could occur
there.)   There is one notable exception: writing a stream to a file
using the vfs put or append often takes 3s (3 round trips), even when
it's very small, probably either we really do accidentally have round
trips within the rpc, or something we're doing is causing tcp to
stall.  The tcp window-opening behaviour is very noticeable as we send
a large pack - it speeds up over time - but there's nothing much we
can do about this.

 * Format 5 branches (with a revision-history file) do a lot of work
to update that file, reading the graph to update it.  I think we
should just start giving suggestions to upgrade these  old formats.

 * Pushing up pack data is actually pretty fast, because we write
generally only one file and the 1MB buffer keeps even a fat/long pipe
quite full; however creating the indexes takes one round trip each,
and writing the pack names requires taking a lock (see below).  When
pushing a large new branch, performance is quite good to the network
maximum (imposed by the tcp maximum window size).  When there's less
real data to send, the unnecessary roundtrips really hurt.

 * We always create new components empty and unlocked, and then later
lock them and put content into them.  This is a straightforward api
and does mean that if the initial push fails you will generally get a
correct though empty branch.  We should possibly create the new object
locked, and avoid reading things back if we know it's empty.  A single
rpc to create bzrdirs would help.

 * Perhaps surprisingly, graph operations are not showing up as
dominant, at least in the cases we did here: pushing just one
revision, and pushing all of history.  If you have two branches with
substantial different history on both sides it may be more important,
but there is plenty of low fruit before getting into that.  We spend a
lot of time in graph operations for diverged branches but it seems
totally unnecessary, as we should already know they're diverged.

 * Taking and releasing a lockdir at the vfs layer takes about 9
roundtrips, which is pretty high, and we can get some big wins by
avoiding it - either by using a lock/unlock rpc, or by making the lock
implicit in rpcs like "add a pack" or "set last revision".  It looks
like we need to clean up LockableFiles position in locking, and then
allow both the Remote and vfs objects to share a single .lock, which
will work over rpc.  As a first step, we could make sure to take the
lock over rpc, then just allow the vfs object to at least observe it's
already held.

 * Repacking is currently done over vfs and pretty slow when it
happens; we haven't measured this yet.  We probably want it to either
be totally automatic on the server, or perhaps to have it happen on
request from the client.

 * There are some in-detail inefficiencies, where the trace makes it
clear that we're reading back data that we either should already know
or don't need to use, or where we're re-opening objects that we should
already have open.  (For example push to a diverged branch keeps
running long after it should know that it won't succeed.)

 * -Dhpss is really good; we should look out for more opportunities to
add tools that make performance easier to improve.  I hope we can add
some tests that prohibit silly behaviour, either during the test suite
or when turned on by a -D flag.  At the moment there is an option that
bans all vfs operations, but we could change that to a filter of
disallowed operations, so we can trap access to locks over vfs for
instance.

Andrew and I are going to first look at the particular case of pushing
just one new revision, which on this really slow network takes 2m, and
specifically starting with the way it locks and unlocks the Branch
repeatedly.

-- 
Martin <http://launchpad.net/~mbp/>