Autopack over dumb transports causes excess use of bandwidth.

Wed Sep 9 15:43:37 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gary van der Merwe wrote:
> I've been using bzr recently for managing websites. Autopack is really
> hurting me like this.
> 
> The histories for these sites generally look like this:
> 
> Rev 1 - Initial commit of existing site. (for one site, the example I'll
> use, this is about 20mb)
> Rev 2 - Small change
> Rev 3 - Small change
> Rev 4 - Small change
> Rev 5 - Small change
> etc.
> 
> Lets say I push at rev 2 (via ftp.) 20mb gets uploaded.
> 
> Commit some more small revs...
> 
> Push. Autopack runs, which downloads 20mb, and then uploads 20mb+1. <-
> Unnecessary.

So this would probably happen, given that you have only 10 revs, and we
don't sort by size, but only by number-of-revisions count.

> 
> Pull on computer B - Downloads 20m+1
> 
> Commit some more small revs on computer A, Push. Autopack runs, which
> downloads 20mb+1, and then uploads 20mb+2. <- Unnecessary.

This seems wrong. Once a pack file has been 'autopacked' it should then
contain 10 revs, and not be scheduled for repacking until you have 100 revs.

> 
> Pull on computer B - Downloads 20m+2 <- Unnecessary.
> 

This should also not really be happening, but it would depend on some
specifics about how the groups are laid out, etc. I'm guessing you
aren't positive about the numbers other than knowing that they are
"bigger than I would like".

> 
> Repos are in 2a format, using bzr 1.7
> 
> I don't know if this a bug, or a user problem, but it is driving me crazy.
> 
> Can any one suggest how I can avoid this excess use of bandwidth?
> 
> Regards
> 
> Gary

I realize your case is extra sensitive, because you have bandwidth caps
and fees based on amount transferred.

I'll try to break down what *I* think is happening.

1) Commit 1 includes 20MB of raw content.
2) All commits past this point have a trivial amount of changes to that
content.
3) At commit 10, we will recombine the 10 packs creating 1 20MB+epsilon
pack. At this point, all of the content for commits 2-9 will be mixed
with the data from the original commit. The actual layout will be
dependent on how the groups are laid out. However, given the short
history, I would guess the groups are probably something like:

[(f1,r1) (f2, r1) (f3, r1) (f4, r10) (f4, r9) (f4, r8) (f4, r7), ...]

Note that in *one* group we have the content for several different
files, as well as the content across the history for a single file. The
max size for a mixed group like this is generally 2MB. (For
single-file-content we grow the group to 4MB.)

4) When computer B wants to get the new content for revisions 9 & 10, it
has no choice but to grab whole groups (over a dumb transport). So when
it tries to grab (f4, r10) and (f4, r9) it grabs the whole 2MB group,
and then extracts just those texts and inserts it into the local repository.

5) At commit 20, autopack should not touch the current size-10 pack, and
only repack the new data. This should hold until you hit 100 commits,
when we will rebuild all packs into a single size-100 pack.

So what are the answers for you personally...

1) Don't use dumb transports. This may not be an option, but the smart
server already knows how to take a group and break it apart when only
some of the content is requested. (eg, fetching (f4, r10) and (f4, r9)
will create a new group 'on-the-fly' and send it over the wire, rather
than transmitting the entire 2MB group.)

The smart server *also* knows how to do autopacking locally. So when the
target repository needs to be repacked, your local client doesn't have
to download anything.

2) Poke at the internals of bzrlib/groupcompress.py and change some of
those constants. It is unfortunate that we didn't make them globals, so
they could be changed by a plugin. I would certainly approve of a patch
that changed that.

The line you are looking for is:
            elif (prefix is not None and prefix != last_prefix
                  and end_point > 2*1024*1024):

However, looking at it, I might actually be wrong about what I said.
That code says, "If I add a new key which is from a different file, and
it pushes me over 2MB, pop it off and start a new group." Which means
that a mixed-content group can actually grow to 4MB as long as the last
file inserted has a lot of content.

Anyway, don't worry about it too much, but the 4*1024*1024 and
2*1024*1024 are the key bits that control the size of groups. If you set
those to something smaller you
 a) Will probably get slightly worse compression. (your 20MB will become
    25MB or so)
 b) The minimum size transferred over dumb requests will go down. So it
    will be more likely that the little bits you need will be in
    independent groups.

That won't fix autopacks upload and download, but it will help Computer B.

3) Look into teaching autopack to use local content if possible.

This is a harder sell, and more work, but a bigger potential win for
you. Basically, at 'push needs to autopack time' we wouldn't have to
re-download all that 20MB that we already have locally. It is orthogonal
to (2), in that it wouldn't change the final content on the remote site.
It also doesn't change how much you actually upload, but it would reduce
your download a bit.

4) Teach autopack to deal in content size, rather than 'number of
revisions'. At the moment, autopack considers every commit to be
approximately the same size. Which is generally true of 'steady state',
but ignores the fact that initial commit is often an import which is >>>
every other commit. (In the case of MySQL, I believe this is actually
commit 3 or so, because of how "bk init" worked back when they started.)

This is also a bit harder than 1 or 2. The main problem is that
number-of-keys is cheap to determine (it is in the header of every btree
index), but bytes-in-the-pack is not. We only have that information by
reading all of the indexes for a given pack file and finding the largest
reference. (a is at 100, b is at 200, c is at 300, pack must be >300 bytes.)

Most likely, that object will be a text key, because of the standard
order of insertions (revs, inventories, chks, texts). Though there is
nothing that requires it. And conversions actually fetch as (texts,
inventories, chks, revs) because that was the order required by 'knit'
repositories.

I suppose we could look at modifying the btree header and have "last
reference" sort of thing, but there are some layering violations there.
(btree's don't really know what the 'value' field means, so you'd have
to pass in a callable to evaluate it, etc.)

5) Cheat.

Rather than just doing "bzr init; bzr commit; bzr push". Instead do:

  bzr init
  bzr commit
  for i in `seq 100`; do bzr commit --unchanged; done
  bzr pack
  bzr push
  bzr uncommit -r 1
  bzr push --overwrite

This will upload 100 'fake' revisions into the remote site, that are
quite well packed, and then remove them from your ancestry. When the
next autopack triggers, it won't try to repack that 100-revision file.
(At least, not until you get to 1000 revisions, though if you wanted,
you could use 1000 revs to pad your repository.)

It is hackish, but it actually fits exactly what you want *today*. Which
is to not have autopack touching that initial commit.

Note that again your repository will be larger-than-optimal. As all of
your small changes will be expanded into a fulltext in the new pack files.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqnvxkACgkQJdeBCYSNAAMt0ACghAAEF5DWYAM1StNUvCLNHC4n
a8MAnj0F2uPpM1IiXfJs7UasqSIZAaKW
=pDRh
-----END PGP SIGNATURE-----