Rev 2655: Merge index improvements. in http://people.ubuntu.com/~robertc/baz2.0/repository

Sun Jul 15 08:35:58 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/repository

------------------------------------------------------------
revno: 2655
revision-id: robertc at robertcollins.net-20070715073554-ge1dpaaa12s4ex7p
parent: robertc at robertcollins.net-20070715041046-c1jgc3fjq3v4rvdy
parent: robertc at robertcollins.net-20070715073137-cd9kb764q4e40o0f
committer: Robert Collins <robertc at robertcollins.net>
branch nick: repository
timestamp: Sun 2007-07-15 17:35:54 +1000
message:
  Merge index improvements.
modified:
  bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
  bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
  doc/developers/indices.txt     indices.txt-20070713142939-m5cdnp31u8ape0td-1
  doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
    ------------------------------------------------------------
    revno: 2626.1.11
    revision-id: robertc at robertcollins.net-20070715073137-cd9kb764q4e40o0f
    parent: robertc at robertcollins.net-20070715045353-27opxm5h91ez0fjs
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 17:31:37 +1000
    message:
      Tweak documentation as per Aaron's review.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      doc/developers/indices.txt     indices.txt-20070713142939-m5cdnp31u8ape0td-1
      doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
    ------------------------------------------------------------
    revno: 2626.1.10
    revision-id: robertc at robertcollins.net-20070715045353-27opxm5h91ez0fjs
    parent: robertc at robertcollins.net-20070715044519-140kzz00uzldgt7z
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 14:53:53 +1000
    message:
      Remove some unneeded index iteration by checking if we have found all keys, and grammar improvements from Aaron's review.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      doc/developers/indices.txt     indices.txt-20070713142939-m5cdnp31u8ape0td-1
      doc/developers/repository.txt  repository.txt-20070709152006-xkhlek456eclha4u-1
    ------------------------------------------------------------
    revno: 2626.1.9
    revision-id: robertc at robertcollins.net-20070715044519-140kzz00uzldgt7z
    parent: robertc at robertcollins.net-20070715044034-121yu86vvyet4akv
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 14:45:19 +1000
    message:
      Various index tweaks and test clarity from John's review.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
    ------------------------------------------------------------
    revno: 2626.1.8
    revision-id: robertc at robertcollins.net-20070715044034-121yu86vvyet4akv
    parent: robertc at robertcollins.net-20070715043527-ub3bjsi71j9jnzum
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 14:40:34 +1000
    message:
      Check the index length is as expected, when we have done preprocessing.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 2626.1.7
    revision-id: robertc at robertcollins.net-20070715043527-ub3bjsi71j9jnzum
    parent: robertc at robertcollins.net-20070715043029-59iiywcyu729js37
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 14:35:27 +1000
    message:
      Remove duplication in the index serialisation logic with John's suggestion.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 2626.1.6
    revision-id: robertc at robertcollins.net-20070715043029-59iiywcyu729js37
    parent: robertc at robertcollins.net-20070714145757-n37rf8ezk0avc1eh
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: index
    timestamp: Sun 2007-07-15 14:30:29 +1000
    message:
      Reverse index ordering - we do not have date prefixed revids.
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_index.py     test_index.py-20070712131115-lolkarso50vjr64s-2
=== modified file 'bzrlib/index.py'

--- a/bzrlib/index.py	2007-07-14 14:57:57 +0000
+++ b/bzrlib/index.py	2007-07-15 07:31:37 +0000
@@ -67,7 +67,7 @@
     def add_node(self, key, references, value):
         """Add a node to the index.
 
-        :param key: The key. keys must be whitespace free utf8.
+        :param key: The key. keys must be whitespace-free utf8.
         :param references: An iterable of iterables of keys. Each is a
             reference to another key.
         :param value: The value to associate with the key. It may be any
@@ -106,15 +106,24 @@
         # one to pad all the data with reference-length and determine entry
         # addresses.
         # One to serialise.
-        nodes = sorted(self._nodes.items(),reverse=True)
+        
+        # forward sorted by key. In future we may consider topological sorting,
+        # at the cost of table scans for direct lookup, or a second index for
+        # direct lookup
+        nodes = sorted(self._nodes.items())
+        # if we do not prepass, we don't know how long it will be up front.
+        expected_bytes = None
         # we only need to pre-pass if we have reference lists at all.
         if self.reference_lists:
+            key_offset_info = []
             non_ref_bytes = prefix_length
             total_references = 0
             # TODO use simple multiplication for the constants in this loop.
-            # TODO: factor out the node length calculations so this loop 
-            #       and the next have less (no!) duplicate code.
             for key, (absent, references, value) in nodes:
+                # record the offset known *so far* for this key:
+                # the non reference bytes to date, and the total references to
+                # date - saves reaccumulating on the second pass
+                key_offset_info.append((key, non_ref_bytes, total_references))
                 # key is literal, value is literal, there are 3 null's, 1 NL
                 non_ref_bytes += len(key) + len(value) + 3 + 1
                 # one byte for absent if set.
@@ -136,27 +145,11 @@
             while 10 ** digits < possible_total_bytes:
                 digits += 1
                 possible_total_bytes = non_ref_bytes + total_references*digits
+            expected_bytes = possible_total_bytes + 1 # terminating newline
             # resolve key addresses.
             key_addresses = {}
-            current_offset = prefix_length
-            for key, (absent, references, value) in nodes:
-                key_addresses[key] = current_offset
-                # key is literal, value is literal, there are 3 null's, 1 NL
-                current_offset += len(key) + len(value) + 3 + 1
-                # one byte for absent if set.
-                if absent:
-                    current_offset += 1
-                elif self.reference_lists:
-                    # (ref_lists -1) tabs
-                    current_offset += self.reference_lists - 1
-                    # (ref-1 cr's per ref_list)
-                    for ref_list in references:
-                        # accrue reference bytes
-                        current_offset += digits * len(ref_list)
-                        # accrue reference separators
-                        if ref_list:
-                            # accrue reference separators
-                            current_offset += len(ref_list) - 1
+            for key, non_ref_bytes, total_references in key_offset_info:
+                key_addresses[key] = non_ref_bytes + total_references*digits
             # serialise
             format_string = '%%0%sd' % digits
         for key, (absent, references, value) in nodes:
@@ -169,6 +162,11 @@
             lines.append("%s\0%s\0%s\0%s\n" % (key, absent,
                 '\t'.join(flattened_references), value))
         lines.append('\n')
+        result = StringIO(''.join(lines))
+        if expected_bytes and len(result.getvalue()) != expected_bytes:
+            raise errors.BzrError('Failed index creation. Internal error:'
+                ' mismatched output length and expected length: %d %d' %
+                (len(result.getvalue()), expected_bytes))
         return StringIO(''.join(lines))
 
 
@@ -178,13 +176,15 @@
     The index maps keys to a list of key reference lists, and a value.
     Each node has the same number of key reference lists. Each key reference
     list can be empty or an arbitrary length. The value is an opaque NULL
-    terminated string without any newlines.
+    terminated string without any newlines. The storage of the index is 
+    hidden in the interface: keys and key references are always bytestrings,
+    never the internal representation (e.g. dictionary offsets).
 
     It is presumed that the index will not be mutated - it is static data.
 
-    Currently successive iter_entries/iter_all_entries calls will read the
-    entire index each time. Additionally iter_entries calls will read the
-    entire index always. XXX: This must be fixed before the index is 
+    Successive iter_all_entries calls will read the entire index each time.
+    Additionally, iter_entries calls will read the index linearly until the
+    desired keys are found. XXX: This must be fixed before the index is
     suitable for production use. :XXX
     """
 
@@ -214,7 +214,8 @@
             if line == '\n':
                 trailers += 1
                 continue
-            key, absent, references, value = line[:-1].split('\0')
+            key, absent, references, value = line.split('\0')
+            value = value[:-1] # remove the newline
             ref_lists = []
             for ref_string in references.split('\t'):
                 ref_lists.append(tuple([
@@ -223,7 +224,7 @@
             ref_lists = tuple(ref_lists)
             self.keys_by_offset[pos] = (key, absent, ref_lists, value)
             pos += len(line)
-        for key, absent, references, value in self.keys_by_offset.values():
+        for key, absent, references, value in self.keys_by_offset.itervalues():
             if absent:
                 continue
             # resolve references:
@@ -260,9 +261,14 @@
             efficient order for the index.
         """
         keys = set(keys)
+        if not keys:
+            return
         for node in self.iter_all_entries():
+            if not keys:
+                return
             if node[0] in keys:
                 yield node
+                keys.remove(node[0])
 
     def _signature(self):
         """The file signature for this index type."""
@@ -280,12 +286,18 @@
     
     The backing indices must implement GraphIndex, and are presumed to be
     static data.
+
+    Queries against the combined index will be made against the first index,
+    and then the second and so on. The order of index's can thus influence
+    performance significantly. For example, if one index is on local disk and a
+    second on a remote server, the local disk index should be before the other
+    in the index list.
     """
 
     def __init__(self, indices):
         """Create a CombinedGraphIndex backed by indices.
 
-        :param indices: The indices to query for data.
+        :param indices: An ordered list of indices to query for data.
         """
         self._indices = indices
 
@@ -300,6 +312,9 @@
     def iter_all_entries(self):
         """Iterate over all keys within the index
 
+        Duplicate keys across child indices are presumed to have the same
+        value and are only reported once.
+
         :return: An iterable of (key, reference_lists, value). There is no
             defined order for the result iteration - it will be in the most
             efficient order for the index.
@@ -314,6 +329,9 @@
     def iter_entries(self, keys):
         """Iterate over keys within the index.
 
+        Duplicate keys across child indices are presumed to have the same
+        value and are only reported once.
+
         :param keys: An iterable providing the keys to be retrieved.
         :return: An iterable of (key, reference_lists, value). There is no
             defined order for the result iteration - it will be in the most
@@ -321,6 +339,8 @@
         """
         keys = set(keys)
         for index in self._indices:
+            if not keys:
+                return
             for node in index.iter_entries(keys):
                 keys.remove(node[0])
                 yield node

=== modified file 'bzrlib/tests/test_index.py'
--- a/bzrlib/tests/test_index.py	2007-07-14 14:57:57 +0000
+++ b/bzrlib/tests/test_index.py	2007-07-15 04:45:19 +0000
@@ -47,7 +47,7 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
-            "akey\0\0\0data\n\n", contents)
+            "akey\x00\x00\x00data\n\n", contents)
 
     def test_add_node_empty_value(self):
         builder = GraphIndexBuilder()
@@ -55,23 +55,23 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
-            "akey\0\0\0\n\n", contents)
+            "akey\x00\x00\x00\n\n", contents)
 
-    def test_build_index_two_nodes_sorted_reverse(self):
+    def test_build_index_two_nodes_sorted(self):
         # the highest sorted node comes first.
         builder = GraphIndexBuilder()
         # use three to have a good chance of glitching dictionary hash
         # lookups etc. Insert in randomish order that is not correct
         # and not the reverse of the correct order.
+        builder.add_node('2002', (), 'data')
+        builder.add_node('2000', (), 'data')
         builder.add_node('2001', (), 'data')
-        builder.add_node('2000', (), 'data')
-        builder.add_node('2002', (), 'data')
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
-            "2002\0\0\0data\n"
-            "2001\0\0\0data\n"
-            "2000\0\0\0data\n"
+            "2000\x00\x00\x00data\n"
+            "2001\x00\x00\x00data\n"
+            "2002\x00\x00\x00data\n"
             "\n", contents)
 
     def test_build_index_reference_lists_are_included_one(self):
@@ -80,7 +80,7 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
-            "key\0\0\0data\n"
+            "key\x00\x00\x00data\n"
             "\n", contents)
 
     def test_build_index_reference_lists_are_included_two(self):
@@ -89,7 +89,7 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
-            "key\0\0\t\0data\n"
+            "key\x00\x00\t\x00data\n"
             "\n", contents)
 
     def test_node_references_are_byte_offsets(self):
@@ -99,8 +99,8 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
-            "reference\0\0\0data\n"
-            "key\0\x0038\0data\n"
+            "key\x00\x0051\x00data\n"
+            "reference\x00\x00\x00data\n"
             "\n", contents)
 
     def test_node_references_are_cr_delimited(self):
@@ -111,51 +111,64 @@
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
-            "reference2\0\0\0data\n"
-            "reference\0\0\0data\n"
-            "key\0\x0056\r38\0data\n"
+            "key\x00\x0054\r71\x00data\n"
+            "reference\x00\x00\x00data\n"
+            "reference2\x00\x00\x00data\n"
             "\n", contents)
 
     def test_multiple_reference_lists_are_tab_delimited(self):
         builder = GraphIndexBuilder(reference_lists=2)
-        builder.add_node('reference', ([], []), 'data')
-        builder.add_node('key', (['reference'], ['reference']), 'data')
+        builder.add_node('keference', ([], []), 'data')
+        builder.add_node('rey', (['keference'], ['keference']), 'data')
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
-            "reference\0\0\t\0data\n"
-            "key\0\x0038\t38\0data\n"
+            "keference\x00\x00\t\x00data\n"
+            "rey\x00\x0038\t38\x00data\n"
             "\n", contents)
 
     def test_add_node_referencing_missing_key_makes_absent(self):
         builder = GraphIndexBuilder(reference_lists=1)
-        builder.add_node('key', (['reference', 'reference2'], ), 'data')
+        builder.add_node('rey', (['beference', 'aeference2'], ), 'data')
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
-            "reference2\0a\0\0\n"
-            "reference\0a\0\0\n"
-            "key\0\x0053\r38\0data\n"
+            "aeference2\x00a\x00\x00\n"
+            "beference\x00a\x00\x00\n"
+            "rey\x00\x0053\r38\x00data\n"
             "\n", contents)
 
     def test_node_references_three_digits(self):
         # test the node digit expands as needed.
         builder = GraphIndexBuilder(reference_lists=1)
-        references = map(str, range(9))
-        builder.add_node('5-key', (references, ), '')
+        references = map(str, reversed(range(9)))
+        builder.add_node('2-key', (references, ), '')
         stream = builder.finish()
         contents = stream.read()
         self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
+            "0\x00a\x00\x00\n"
+            "1\x00a\x00\x00\n"
+            "2\x00a\x00\x00\n"
+            "2-key\x00\x00130\r124\r118\r112\r106\r100\r050\r044\r038\x00\n"
+            "3\x00a\x00\x00\n"
+            "4\x00a\x00\x00\n"
+            "5\x00a\x00\x00\n"
+            "6\x00a\x00\x00\n"
+            "7\x00a\x00\x00\n"
             "8\x00a\x00\x00\n"
-            "7\x00a\x00\x00\n"
-            "6\x00a\x00\x00\n"
-            "5-key\x00\x00130\r124\r118\r112\r106\r100\r050\r044\r038\x00\n"
-            "5\x00a\x00\x00\n"
-            "4\x00a\x00\x00\n"
-            "3\x00a\x00\x00\n"
-            "2\x00a\x00\x00\n"
-            "1\x00a\x00\x00\n"
-            "0\x00a\x00\x00\n"
+            "\n", contents)
+
+    def test_absent_has_no_reference_overhead(self):
+        # the offsets after an absent record should be correct when there are
+        # >1 reference lists.
+        builder = GraphIndexBuilder(reference_lists=2)
+        builder.add_node('parent', (['aail', 'zther'], []), '')
+        stream = builder.finish()
+        contents = stream.read()
+        self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
+            "aail\x00a\x00\x00\n"
+            "parent\x00\x0038\r63\t\x00\n"
+            "zther\x00a\x00\x00\n"
             "\n", contents)
 
     def test_add_node_bad_key(self):
@@ -171,7 +184,7 @@
         self.assertRaises(errors.BadIndexValue, builder.add_node, 'akey',
             (), 'data\naa')
         self.assertRaises(errors.BadIndexValue, builder.add_node, 'akey',
-            (), 'data\0aa')
+            (), 'data\x00aa')
 
     def test_add_node_bad_mismatched_ref_lists_length(self):
         builder = GraphIndexBuilder()
@@ -215,19 +228,6 @@
         builder.add_node('key', (['reference'], ), 'data')
         builder.add_node('reference', ([],), 'data')
 
-    def test_absent_has_no_reference_overhead(self):
-        # the offsets after an absent record should be correct when there are
-        # >1 reference lists.
-        builder = GraphIndexBuilder(reference_lists=2)
-        builder.add_node('parent', (['tail', 'other'], []), '')
-        stream = builder.finish()
-        contents = stream.read()
-        self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
-            "tail\x00a\x00\x00\n"
-            "parent\x00\x0038\r63\t\x00\n"
-            "other\x00a\x00\x00\n"
-            "\n", contents)
-
 
 class TestGraphIndex(TestCaseWithMemoryTransport):
 

=== modified file 'doc/developers/indices.txt'
--- a/doc/developers/indices.txt	2007-07-13 15:05:36 +0000
+++ b/doc/developers/indices.txt	2007-07-15 07:31:37 +0000
@@ -34,7 +34,7 @@
 ========
 
 bzr is moving to a write-once model for repository storage in order to
-achieve lock-free repositories eventually. In order to support this we are
+achieve lock-free repositories eventually. In order to support this, we are
 making our new index classes **immutable**. That is, one creates a new
 index in a single operation, and after that it is read only. To combine
 two indices a ``Combined*`` index may be used, or an **index merge** may
@@ -44,10 +44,20 @@
 General Index API
 =================
 
-While different indices will likely require different interfaces, we
-should keep these consistent to encourage easy adaption and reuse between
-indices. For instance a simple key-value index should be layerable on top
-of a more complex graph-aware index.
+We may end up with multiple different Index types (e.g. GraphIndex,
+Index, WhackyIndex). Even though these may require different method
+signatures to operate would strive to keep the signatures and return
+values as similar as possible. e.g.::
+
+    GraphIndexBuilder - add_node(key, references, value)
+    IndexBuilder - add_node(key, value)
+    WhackyIndexBuilder - add_node(key, whackiness, value)
+
+as opposed to something quite different like::
+
+    node = IncrementalBuilder.get_node()
+    node.key = 'foo'
+    node.value = 'bar'
 
 Services
 --------

=== modified file 'doc/developers/repository.txt'
--- a/doc/developers/repository.txt	2007-07-13 15:05:36 +0000
+++ b/doc/developers/repository.txt	2007-07-15 07:31:37 +0000
@@ -44,8 +44,9 @@
                     introduced by a set of revisions in some cheap form, insert
                     data from a stream, validate data during insertion.
 Garbage Collection  Exclusive lock the repository preventing readers.
-Revert              Revision graph access, Inventory extraction, file text
-                    access.
+Revert              Delta from working tree to historical tree, and then 
+                    arbitrary file access to obtain the texts of differing
+                    files.
 Uncommit            Revision graph access.
 Status              Revision graph access, revision text access, file
                     fingerprint information, inventory differencing.
@@ -55,8 +56,11 @@
                     fetch is needed.
 Log                 Revision graph (entire at the moment) access,
                     sometimes status between adjacent revisions. Log of a
-                    file needs per-file-graph.
-Missing             Revision graph access.
+                    file needs per-file-graph. Dominator caching or
+                    similar tools may be needed to prevent entire graph
+                    access.
+Missing             Revision graph access, and revision texts to show
+                    output.
 Update              As for merge, but twice.
 ==================  ====================
 
@@ -65,8 +69,11 @@
 
 Ideally we can make our data access for commands such as branch to
 dovetail well with the native storage in the repository, in the common
-case. Doing this may require the commands to operate in predictable
-manners.
+case. Doing this may require choosing the behaviour of some commands to
+allow us to have a smaller range of access patterns which we can optimise
+more heavily. Alternatively if each command is very predicable in its
+data access pattern we may be able to hint to the low level layers which
+pattern is needed on a per command basis to get efficient behaviour.
 
 ===================  ===================================================
 Command              Data access pattern
@@ -134,6 +141,13 @@
 Patterns used
 -------------
 
+Note that these are able to be changed by changing what we store. For
+instance if the repository satisfies mpdiff requests, then bundle can be
+defined in terms of mpdiff lookups rather than file text lookups
+appropriate to create mpdiffs. If the repository satisfies full text
+requests only, then you need the topological access to build up the
+desired mpdiffs.
+
 =========================================== =========
 Pattern                                     Commands
 =========================================== =========
@@ -262,7 +276,7 @@
 Discovery of files
 ~~~~~~~~~~~~~~~~~~
 
-With non listable transports how should the collection of pack/index files
+With non-listable transports how should the collection of pack/index files
 be found ? Initially record a list of all the pack/index files from
 write actions. (Require writable transports to be listable). We can then
 use a heuristic to statically combine pack/index files later.