Rev 5: Try a 'chained-btree' approach. Instead of putting all in http://bzr.arbash-meinel.com/plugins/per_file_graph

Thu Oct 29 16:44:40 GMT 2009

At http://bzr.arbash-meinel.com/plugins/per_file_graph

------------------------------------------------------------
revno: 5
revision-id: john at arbash-meinel.com-20091029164427-a2wozogxe33gm1qm
parent: john at arbash-meinel.com-20091029154811-cnsasss0shnr6r3p
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: per_file_graph
timestamp: Thu 2009-10-29 11:44:27 -0500
message:
  Try a 'chained-btree' approach. Instead of putting all
  the (file_id,revision_id) tuples into one index, create a separate index
  for each file_id. And then create one more index mapping all of the
  file_ids to the individual index they use.
  Initial results are not great. Many of the per-file graphs are
  tiny, and the resultant btree header data starts to cause significant
  overhead. In testing the result is about 4.3MB (equivalent to regular
  b+tree without 'set_optimize(for_size=True)') And the resultant file
  can be run through gzip to reduce it to 3.5MB (equivalent to setting
  the optimize flag.)
  
  
  There is probably stuff we could tweak, like moving the header info
  into the overall index since most of the info is redundant.
  (They are all the same format index, they are all ref-list=1, and
  key-elements=1, only 'len' and 'row_lengths' are unique.
  There are a small handful that have more than 1 row, and of those
  one that actually splits to 3 level (TREE_ROOT I assume).
-------------- next part --------------
=== modified file '__init__.py'

--- a/__init__.py	2009-10-29 15:18:30 +0000
+++ b/__init__.py	2009-10-29 16:44:27 +0000
@@ -33,6 +33,10 @@
 format_registry.register_lazy('simple-btree-16k',
     'bzrlib.plugins.per_file_graph.simple_btree', 'SimpleBTreeBuilder16k',
     help='a regular BTreeGraphIndex with 16k pages')
+format_registry.register_lazy('chained-btree',
+    'bzrlib.plugins.per_file_graph.chained_btree', 'ChainedBTreeBuilder',
+    help='Create one btree graph per file-id, and then index those'
+         ' so that we can find the individual graph we care about')
 format_registry.default_key = 'simple-btree'
 
 

=== added file 'chained_btree.py'
--- a/chained_btree.py	1970-01-01 00:00:00 +0000
+++ b/chained_btree.py	2009-10-29 16:44:27 +0000
@@ -0,0 +1,88 @@
+# Copyright (C) 2009 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+"""Repeated btree indexes, one for each file-id"""
+
+from bzrlib import btree_index
+
+from bzrlib.plugins.per_file_graph import builder
+
+
+class ChainedBTreeBuilder(builder.Builder):
+    """Create a BTreeGraphIndex for each file-id, and pack them all together.
+    
+    Each file-id's graph only has revision-ids, so we add one more index which
+    has the file-id => sub-index mapping.
+    """
+
+    # So far, this has *some* merit, but the tiny graphs are killing it.
+    # There are 2 bits of overhead.
+    #   1) BTreeGraphIndex always pads at least 120 bytes because of
+    #      the reserved space. This was 4.6M => 4.3M, and a patch has been sent
+    #      to bzr.dev
+    #   2) The "B+Tree Graph Index" signature header was similarly killing
+    #      performance. I don't know the specifics, but running *gzip* on the
+    #      final file dropped it down to 3.5MB, which is the same size as the
+    #      per-file btree @ 4k pages.
+
+    def _build(self):
+        parent_map = self.get_per_file_graph()
+        total = len(parent_map)
+        # Split into per-file graphs
+        by_file_id = {}
+        for idx, (file_key, parent_keys) in enumerate(parent_map.iteritems()):
+            if idx & 0x1FF:
+                self.update('splitting', idx, total)
+            file_id, revision_id = file_key
+            try:
+                file_id_map = by_file_id[file_id]
+            except KeyError:
+                file_id_map = {}
+                by_file_id[file_id] = file_id_map
+            parent_rev_keys = tuple([(p_rev_id,) for _, p_rev_id
+                                                  in parent_keys])
+            file_id_map[(revision_id,)] = parent_rev_keys
+
+        out = self.transport.open_write_stream(self.filename)
+        try:
+            file_id_offsets = {}
+            offset = 0
+            total = len(by_file_id)
+            for idx, file_id in enumerate(sorted(by_file_id)):
+                self.update('building indexes', idx, total)
+                file_id_offsets[file_id] = offset
+                file_id_map = by_file_id[file_id]
+                btbuilder = btree_index.BTreeBuilder(reference_lists=1,
+                                                     key_elements=1)
+                btbuilder.set_optimize(for_size=True)
+                for rev_key, parent_rev_keys in file_id_map.iteritems():
+                    btbuilder.add_node(rev_key, '', (parent_rev_keys,))
+                stream = btbuilder.finish()
+                bytes = stream.read()
+                out.write(bytes)
+                offset += len(bytes)
+            # Now that we have written all of the individual graphs, write out
+            # the final graph to offset info
+            btbuilder = btree_index.BTreeBuilder(reference_lists=0,
+                                                 key_elements=1)
+            btbuilder.set_optimize(for_size=True)
+            for file_id, offset in file_id_offsets.iteritems():
+                btbuilder.add_node((file_id,), str(offset))
+            stream = btbuilder.finish()
+            bytes = stream.read()
+            out.write(bytes)
+        finally:
+            out.close()