[merge] Faster 'bzr check' by doing parallel extraction

John A Meinel john at arbash-meinel.com
Mon Dec 19 20:55:54 GMT 2005


It took me a while to figure out what the proper set functions were (and
a while realizing IntSet really is slower), but I updated Weave.check()
so that it checks all of the revisions in parallel. This is part of my
wanting to change "bzr check" into checking each weave independently,
rather than having it extract each text one at a time.

Anyway, with the current code, it sets up an array of sha sums, and an
array of inclusions. It then runs through the entire weave, and updates
whatever revisions are active for each line.

What I show is about a 3 fold improvement in Weave.check() speed.
$ time PYTHONPATH='../bzr.dev' python2.4 \
	../bzr.dev/bzrlib/weave.py check .bzr/inventory.weave
3030 versions ok


real    7m13.695s
user    4m53.629s
sys     0m5.264s

$ time PYTHONPATH='.' python2.4 \
	./bzrlib/weave.py check .bzr/inventory.weave
3032 versions ok


real    1m47.000s
user    1m13.941s
sys     0m2.410s


(What I found interesting is that IntSet.update() is way faster, 0.17s
versus 3.8s. However, even if I unrolled the __contains__ and __and__
functions inline, the overall check was still 2x slower. I'm guessing
that if we switched to a C++ implementation of IntSet it would be faster
overall, but that may not be our bottleneck. Set(ints) should be pretty
fast, the hash table implementation doesn't have to do a lot to check
Long or int).

After my improvements to have Weaves check their contents, and then
InventoryFile can assume that the weave prelude is accurate (we lose
checking the text length, because weave doesn't store that. it should by
the way) I saw a very large decrease in the overall time of bzr check.

The old code was:
$ time BZR_PLUGIN_PATH="" ../bzr.dev/bzr check
checked branch /Users/jameinel/dev/bzr/bzr-jam-integration format 6

  3033 revisions
  6882 unique file texts
553642 repeated file texts
     2 ghost revisions
     2 revisions missing parents in ancestry

real    16m59.083s
user    12m16.969s
sys     0m26.660s

With the parallel extraction it took:
$ time BZR_PLUGIN_PATH="" ./bzr check
checked branch /Users/jameinel/dev/bzr/bzr-jam-integration format 6

  3033 revisions
  6882 unique file texts
553642 repeated file texts
   347 weaves
     2 ghost revisions
     2 revisions missing parents in ancestry

real    11m10.503s
user    8m41.388s
sys     0m15.454s

Also, one of the expensive parts of the new code is that it checks all
of the inventory.weave file, which the old code didn't check at all.
That by itself adds 1m47s.

All of this is available in my integration branch.
http://bzr.arbash-meinel.com/branches/bzr/jam-integration/

John
=:->


PS> On my slower machine, the times are:
$ time BZR_PLUGIN_PATH='' ./bzr check
checked branch /home/jameinel/dev/bzr/bzr-jam-integration format 6

  3028 revisions
  6873 unique file texts
552271 repeated file texts
     2 ghost revisions
     2 revisions missing parents in ancestry

real    27m33.188s
user    27m17.538s
sys     0m11.577s

versus
$ time BZR_PLUGIN_PATH='' ./bzr check
checked branch /home/jameinel/dev/bzr/bzr-jam-integration format 6

  3034 revisions
  6889 unique file texts
553911 repeated file texts
   347 weaves
     2 ghost revisions
     2 revisions missing parents in ancestry

real    18m11.597s
user    18m2.656s
sys     0m5.696s

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051219/c87fa661/attachment.pgp 


More information about the bazaar mailing list