[Bug 737234] Re: too much data transferred making a new stacked branch

Wed Jun 8 09:15:15 UTC 2011

** Also affects: bzr (Ubuntu)
   Importance: Undecided
       Status: New

** Changed in: bzr (Ubuntu)
       Status: New => Fix Released

** Also affects: bzr (Ubuntu Natty)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to bzr in Ubuntu.
https://bugs.launchpad.net/bugs/737234

Title:
  too much data transferred making a new stacked branch

Status in Bazaar Version Control System:
  Fix Released
Status in Bazaar 2.3 series:
  Fix Released
Status in “bzr” package in Ubuntu:
  Fix Released
Status in “bzr” source package in Natty:
  In Progress

Bug description:
  In thread "Linaro bzr feedback" John writes:

  Note, I just did 'bzr branch lp:gcc-linaro', and it transferred about
  500MB, about 457MB on disk. (Not bad considering lp:emacs transferred
  400-500MB and was only 200MB on disk.)

  I then ran 'bzr serve' and 'bzr branch --stacked bzr://localhost:...'.
  What was scary was:

  8141442kB 24128kB/s / Finding Revisions
  ...
  > Grepping the .bzr.log file in question, I do, indeed see about 8.1GB of
  > data transferred before we read the first .tix.
  > If my grep fu is strong, then we only read 30MB of .cix data. Which
  > leaves us with 8GB of .pack content, or actual CHK page content.

  This is a change which drops the 8GB down to 150MB:

  === modified file 'bzrlib/inventory.py'
  - --- bzrlib/inventory.py 2010-09-14 13:12:20 +0000
  +++ bzrlib/inventory.py 2011-03-17 15:38:40 +0000
  @@ -736,6 +736,13 @@
              specific_file_ids = set(specific_file_ids)
          # TODO? Perhaps this should return the from_dir so that the root is
          # yielded? or maybe an option?
  +        if from_dir is None and specific_file_ids is None:
  +            # They are iterating from the root, assume they are iterating
  +            # everything and preload all file_ids into the
  +            # _fileid_to_entry_cache. This doesn't build things into
  .children
  +            # for each directory, but that will happen later.
  +            for _ in self.iter_just_entries():
  +                continue
          if from_dir is None:
              if self.root is None:
                  return

  Basically, iter_entries_by_dir goes in a specific order which doesn't
  match the order in the repository. 'iter_just_entries' loads everything
  in repository order, and puts it into the
  CHKInventory._file_id_entry_cache, and then the rest of the requests are
  fed from there.

  We don't usually notice this effect, because of the
  chk_map._thread_caches.page_cache and the GCCHKRepository block cache.
  Once the inventory is large enough to not be in the bytes cache, we have
  to load it from the repository again.

  I just checked, and this also has a large effect for local
  repositories.

  'time list(rev_tree.inventory.iter_entries_by_dir())'
  drops from 4m30s down to 13s with the patch.

  So we certainly should think about other ramifications, but short term
  it looks quite good.