Rev 2655: Merge index improvements. in http://people.ubuntu.com/~robertc/baz2.0/repository
Robert Collins
robertc at robertcollins.net
Sun Jul 15 08:35:58 BST 2007
At http://people.ubuntu.com/~robertc/baz2.0/repository
------------------------------------------------------------
revno: 2655
revision-id: robertc at robertcollins.net-20070715073554-ge1dpaaa12s4ex7p
parent: robertc at robertcollins.net-20070715041046-c1jgc3fjq3v4rvdy
parent: robertc at robertcollins.net-20070715073137-cd9kb764q4e40o0f
committer: Robert Collins <robertc at robertcollins.net>
branch nick: repository
timestamp: Sun 2007-07-15 17:35:54 +1000
message:
Merge index improvements.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
bzrlib/tests/test_index.py test_index.py-20070712131115-lolkarso50vjr64s-2
doc/developers/indices.txt indices.txt-20070713142939-m5cdnp31u8ape0td-1
doc/developers/repository.txt repository.txt-20070709152006-xkhlek456eclha4u-1
------------------------------------------------------------
revno: 2626.1.11
revision-id: robertc at robertcollins.net-20070715073137-cd9kb764q4e40o0f
parent: robertc at robertcollins.net-20070715045353-27opxm5h91ez0fjs
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 17:31:37 +1000
message:
Tweak documentation as per Aaron's review.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
doc/developers/indices.txt indices.txt-20070713142939-m5cdnp31u8ape0td-1
doc/developers/repository.txt repository.txt-20070709152006-xkhlek456eclha4u-1
------------------------------------------------------------
revno: 2626.1.10
revision-id: robertc at robertcollins.net-20070715045353-27opxm5h91ez0fjs
parent: robertc at robertcollins.net-20070715044519-140kzz00uzldgt7z
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 14:53:53 +1000
message:
Remove some unneeded index iteration by checking if we have found all keys, and grammar improvements from Aaron's review.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
doc/developers/indices.txt indices.txt-20070713142939-m5cdnp31u8ape0td-1
doc/developers/repository.txt repository.txt-20070709152006-xkhlek456eclha4u-1
------------------------------------------------------------
revno: 2626.1.9
revision-id: robertc at robertcollins.net-20070715044519-140kzz00uzldgt7z
parent: robertc at robertcollins.net-20070715044034-121yu86vvyet4akv
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 14:45:19 +1000
message:
Various index tweaks and test clarity from John's review.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
bzrlib/tests/test_index.py test_index.py-20070712131115-lolkarso50vjr64s-2
------------------------------------------------------------
revno: 2626.1.8
revision-id: robertc at robertcollins.net-20070715044034-121yu86vvyet4akv
parent: robertc at robertcollins.net-20070715043527-ub3bjsi71j9jnzum
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 14:40:34 +1000
message:
Check the index length is as expected, when we have done preprocessing.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
------------------------------------------------------------
revno: 2626.1.7
revision-id: robertc at robertcollins.net-20070715043527-ub3bjsi71j9jnzum
parent: robertc at robertcollins.net-20070715043029-59iiywcyu729js37
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 14:35:27 +1000
message:
Remove duplication in the index serialisation logic with John's suggestion.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
------------------------------------------------------------
revno: 2626.1.6
revision-id: robertc at robertcollins.net-20070715043029-59iiywcyu729js37
parent: robertc at robertcollins.net-20070714145757-n37rf8ezk0avc1eh
committer: Robert Collins <robertc at robertcollins.net>
branch nick: index
timestamp: Sun 2007-07-15 14:30:29 +1000
message:
Reverse index ordering - we do not have date prefixed revids.
modified:
bzrlib/index.py index.py-20070712131115-lolkarso50vjr64s-1
bzrlib/tests/test_index.py test_index.py-20070712131115-lolkarso50vjr64s-2
=== modified file 'bzrlib/index.py'
--- a/bzrlib/index.py 2007-07-14 14:57:57 +0000
+++ b/bzrlib/index.py 2007-07-15 07:31:37 +0000
@@ -67,7 +67,7 @@
def add_node(self, key, references, value):
"""Add a node to the index.
- :param key: The key. keys must be whitespace free utf8.
+ :param key: The key. keys must be whitespace-free utf8.
:param references: An iterable of iterables of keys. Each is a
reference to another key.
:param value: The value to associate with the key. It may be any
@@ -106,15 +106,24 @@
# one to pad all the data with reference-length and determine entry
# addresses.
# One to serialise.
- nodes = sorted(self._nodes.items(),reverse=True)
+
+ # forward sorted by key. In future we may consider topological sorting,
+ # at the cost of table scans for direct lookup, or a second index for
+ # direct lookup
+ nodes = sorted(self._nodes.items())
+ # if we do not prepass, we don't know how long it will be up front.
+ expected_bytes = None
# we only need to pre-pass if we have reference lists at all.
if self.reference_lists:
+ key_offset_info = []
non_ref_bytes = prefix_length
total_references = 0
# TODO use simple multiplication for the constants in this loop.
- # TODO: factor out the node length calculations so this loop
- # and the next have less (no!) duplicate code.
for key, (absent, references, value) in nodes:
+ # record the offset known *so far* for this key:
+ # the non reference bytes to date, and the total references to
+ # date - saves reaccumulating on the second pass
+ key_offset_info.append((key, non_ref_bytes, total_references))
# key is literal, value is literal, there are 3 null's, 1 NL
non_ref_bytes += len(key) + len(value) + 3 + 1
# one byte for absent if set.
@@ -136,27 +145,11 @@
while 10 ** digits < possible_total_bytes:
digits += 1
possible_total_bytes = non_ref_bytes + total_references*digits
+ expected_bytes = possible_total_bytes + 1 # terminating newline
# resolve key addresses.
key_addresses = {}
- current_offset = prefix_length
- for key, (absent, references, value) in nodes:
- key_addresses[key] = current_offset
- # key is literal, value is literal, there are 3 null's, 1 NL
- current_offset += len(key) + len(value) + 3 + 1
- # one byte for absent if set.
- if absent:
- current_offset += 1
- elif self.reference_lists:
- # (ref_lists -1) tabs
- current_offset += self.reference_lists - 1
- # (ref-1 cr's per ref_list)
- for ref_list in references:
- # accrue reference bytes
- current_offset += digits * len(ref_list)
- # accrue reference separators
- if ref_list:
- # accrue reference separators
- current_offset += len(ref_list) - 1
+ for key, non_ref_bytes, total_references in key_offset_info:
+ key_addresses[key] = non_ref_bytes + total_references*digits
# serialise
format_string = '%%0%sd' % digits
for key, (absent, references, value) in nodes:
@@ -169,6 +162,11 @@
lines.append("%s\0%s\0%s\0%s\n" % (key, absent,
'\t'.join(flattened_references), value))
lines.append('\n')
+ result = StringIO(''.join(lines))
+ if expected_bytes and len(result.getvalue()) != expected_bytes:
+ raise errors.BzrError('Failed index creation. Internal error:'
+ ' mismatched output length and expected length: %d %d' %
+ (len(result.getvalue()), expected_bytes))
return StringIO(''.join(lines))
@@ -178,13 +176,15 @@
The index maps keys to a list of key reference lists, and a value.
Each node has the same number of key reference lists. Each key reference
list can be empty or an arbitrary length. The value is an opaque NULL
- terminated string without any newlines.
+ terminated string without any newlines. The storage of the index is
+ hidden in the interface: keys and key references are always bytestrings,
+ never the internal representation (e.g. dictionary offsets).
It is presumed that the index will not be mutated - it is static data.
- Currently successive iter_entries/iter_all_entries calls will read the
- entire index each time. Additionally iter_entries calls will read the
- entire index always. XXX: This must be fixed before the index is
+ Successive iter_all_entries calls will read the entire index each time.
+ Additionally, iter_entries calls will read the index linearly until the
+ desired keys are found. XXX: This must be fixed before the index is
suitable for production use. :XXX
"""
@@ -214,7 +214,8 @@
if line == '\n':
trailers += 1
continue
- key, absent, references, value = line[:-1].split('\0')
+ key, absent, references, value = line.split('\0')
+ value = value[:-1] # remove the newline
ref_lists = []
for ref_string in references.split('\t'):
ref_lists.append(tuple([
@@ -223,7 +224,7 @@
ref_lists = tuple(ref_lists)
self.keys_by_offset[pos] = (key, absent, ref_lists, value)
pos += len(line)
- for key, absent, references, value in self.keys_by_offset.values():
+ for key, absent, references, value in self.keys_by_offset.itervalues():
if absent:
continue
# resolve references:
@@ -260,9 +261,14 @@
efficient order for the index.
"""
keys = set(keys)
+ if not keys:
+ return
for node in self.iter_all_entries():
+ if not keys:
+ return
if node[0] in keys:
yield node
+ keys.remove(node[0])
def _signature(self):
"""The file signature for this index type."""
@@ -280,12 +286,18 @@
The backing indices must implement GraphIndex, and are presumed to be
static data.
+
+ Queries against the combined index will be made against the first index,
+ and then the second and so on. The order of index's can thus influence
+ performance significantly. For example, if one index is on local disk and a
+ second on a remote server, the local disk index should be before the other
+ in the index list.
"""
def __init__(self, indices):
"""Create a CombinedGraphIndex backed by indices.
- :param indices: The indices to query for data.
+ :param indices: An ordered list of indices to query for data.
"""
self._indices = indices
@@ -300,6 +312,9 @@
def iter_all_entries(self):
"""Iterate over all keys within the index
+ Duplicate keys across child indices are presumed to have the same
+ value and are only reported once.
+
:return: An iterable of (key, reference_lists, value). There is no
defined order for the result iteration - it will be in the most
efficient order for the index.
@@ -314,6 +329,9 @@
def iter_entries(self, keys):
"""Iterate over keys within the index.
+ Duplicate keys across child indices are presumed to have the same
+ value and are only reported once.
+
:param keys: An iterable providing the keys to be retrieved.
:return: An iterable of (key, reference_lists, value). There is no
defined order for the result iteration - it will be in the most
@@ -321,6 +339,8 @@
"""
keys = set(keys)
for index in self._indices:
+ if not keys:
+ return
for node in index.iter_entries(keys):
keys.remove(node[0])
yield node
=== modified file 'bzrlib/tests/test_index.py'
--- a/bzrlib/tests/test_index.py 2007-07-14 14:57:57 +0000
+++ b/bzrlib/tests/test_index.py 2007-07-15 04:45:19 +0000
@@ -47,7 +47,7 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
- "akey\0\0\0data\n\n", contents)
+ "akey\x00\x00\x00data\n\n", contents)
def test_add_node_empty_value(self):
builder = GraphIndexBuilder()
@@ -55,23 +55,23 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
- "akey\0\0\0\n\n", contents)
+ "akey\x00\x00\x00\n\n", contents)
- def test_build_index_two_nodes_sorted_reverse(self):
+ def test_build_index_two_nodes_sorted(self):
# the highest sorted node comes first.
builder = GraphIndexBuilder()
# use three to have a good chance of glitching dictionary hash
# lookups etc. Insert in randomish order that is not correct
# and not the reverse of the correct order.
+ builder.add_node('2002', (), 'data')
+ builder.add_node('2000', (), 'data')
builder.add_node('2001', (), 'data')
- builder.add_node('2000', (), 'data')
- builder.add_node('2002', (), 'data')
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=0\n"
- "2002\0\0\0data\n"
- "2001\0\0\0data\n"
- "2000\0\0\0data\n"
+ "2000\x00\x00\x00data\n"
+ "2001\x00\x00\x00data\n"
+ "2002\x00\x00\x00data\n"
"\n", contents)
def test_build_index_reference_lists_are_included_one(self):
@@ -80,7 +80,7 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
- "key\0\0\0data\n"
+ "key\x00\x00\x00data\n"
"\n", contents)
def test_build_index_reference_lists_are_included_two(self):
@@ -89,7 +89,7 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
- "key\0\0\t\0data\n"
+ "key\x00\x00\t\x00data\n"
"\n", contents)
def test_node_references_are_byte_offsets(self):
@@ -99,8 +99,8 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
- "reference\0\0\0data\n"
- "key\0\x0038\0data\n"
+ "key\x00\x0051\x00data\n"
+ "reference\x00\x00\x00data\n"
"\n", contents)
def test_node_references_are_cr_delimited(self):
@@ -111,51 +111,64 @@
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
- "reference2\0\0\0data\n"
- "reference\0\0\0data\n"
- "key\0\x0056\r38\0data\n"
+ "key\x00\x0054\r71\x00data\n"
+ "reference\x00\x00\x00data\n"
+ "reference2\x00\x00\x00data\n"
"\n", contents)
def test_multiple_reference_lists_are_tab_delimited(self):
builder = GraphIndexBuilder(reference_lists=2)
- builder.add_node('reference', ([], []), 'data')
- builder.add_node('key', (['reference'], ['reference']), 'data')
+ builder.add_node('keference', ([], []), 'data')
+ builder.add_node('rey', (['keference'], ['keference']), 'data')
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
- "reference\0\0\t\0data\n"
- "key\0\x0038\t38\0data\n"
+ "keference\x00\x00\t\x00data\n"
+ "rey\x00\x0038\t38\x00data\n"
"\n", contents)
def test_add_node_referencing_missing_key_makes_absent(self):
builder = GraphIndexBuilder(reference_lists=1)
- builder.add_node('key', (['reference', 'reference2'], ), 'data')
+ builder.add_node('rey', (['beference', 'aeference2'], ), 'data')
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
- "reference2\0a\0\0\n"
- "reference\0a\0\0\n"
- "key\0\x0053\r38\0data\n"
+ "aeference2\x00a\x00\x00\n"
+ "beference\x00a\x00\x00\n"
+ "rey\x00\x0053\r38\x00data\n"
"\n", contents)
def test_node_references_three_digits(self):
# test the node digit expands as needed.
builder = GraphIndexBuilder(reference_lists=1)
- references = map(str, range(9))
- builder.add_node('5-key', (references, ), '')
+ references = map(str, reversed(range(9)))
+ builder.add_node('2-key', (references, ), '')
stream = builder.finish()
contents = stream.read()
self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=1\n"
+ "0\x00a\x00\x00\n"
+ "1\x00a\x00\x00\n"
+ "2\x00a\x00\x00\n"
+ "2-key\x00\x00130\r124\r118\r112\r106\r100\r050\r044\r038\x00\n"
+ "3\x00a\x00\x00\n"
+ "4\x00a\x00\x00\n"
+ "5\x00a\x00\x00\n"
+ "6\x00a\x00\x00\n"
+ "7\x00a\x00\x00\n"
"8\x00a\x00\x00\n"
- "7\x00a\x00\x00\n"
- "6\x00a\x00\x00\n"
- "5-key\x00\x00130\r124\r118\r112\r106\r100\r050\r044\r038\x00\n"
- "5\x00a\x00\x00\n"
- "4\x00a\x00\x00\n"
- "3\x00a\x00\x00\n"
- "2\x00a\x00\x00\n"
- "1\x00a\x00\x00\n"
- "0\x00a\x00\x00\n"
+ "\n", contents)
+
+ def test_absent_has_no_reference_overhead(self):
+ # the offsets after an absent record should be correct when there are
+ # >1 reference lists.
+ builder = GraphIndexBuilder(reference_lists=2)
+ builder.add_node('parent', (['aail', 'zther'], []), '')
+ stream = builder.finish()
+ contents = stream.read()
+ self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
+ "aail\x00a\x00\x00\n"
+ "parent\x00\x0038\r63\t\x00\n"
+ "zther\x00a\x00\x00\n"
"\n", contents)
def test_add_node_bad_key(self):
@@ -171,7 +184,7 @@
self.assertRaises(errors.BadIndexValue, builder.add_node, 'akey',
(), 'data\naa')
self.assertRaises(errors.BadIndexValue, builder.add_node, 'akey',
- (), 'data\0aa')
+ (), 'data\x00aa')
def test_add_node_bad_mismatched_ref_lists_length(self):
builder = GraphIndexBuilder()
@@ -215,19 +228,6 @@
builder.add_node('key', (['reference'], ), 'data')
builder.add_node('reference', ([],), 'data')
- def test_absent_has_no_reference_overhead(self):
- # the offsets after an absent record should be correct when there are
- # >1 reference lists.
- builder = GraphIndexBuilder(reference_lists=2)
- builder.add_node('parent', (['tail', 'other'], []), '')
- stream = builder.finish()
- contents = stream.read()
- self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
- "tail\x00a\x00\x00\n"
- "parent\x00\x0038\r63\t\x00\n"
- "other\x00a\x00\x00\n"
- "\n", contents)
-
class TestGraphIndex(TestCaseWithMemoryTransport):
=== modified file 'doc/developers/indices.txt'
--- a/doc/developers/indices.txt 2007-07-13 15:05:36 +0000
+++ b/doc/developers/indices.txt 2007-07-15 07:31:37 +0000
@@ -34,7 +34,7 @@
========
bzr is moving to a write-once model for repository storage in order to
-achieve lock-free repositories eventually. In order to support this we are
+achieve lock-free repositories eventually. In order to support this, we are
making our new index classes **immutable**. That is, one creates a new
index in a single operation, and after that it is read only. To combine
two indices a ``Combined*`` index may be used, or an **index merge** may
@@ -44,10 +44,20 @@
General Index API
=================
-While different indices will likely require different interfaces, we
-should keep these consistent to encourage easy adaption and reuse between
-indices. For instance a simple key-value index should be layerable on top
-of a more complex graph-aware index.
+We may end up with multiple different Index types (e.g. GraphIndex,
+Index, WhackyIndex). Even though these may require different method
+signatures to operate would strive to keep the signatures and return
+values as similar as possible. e.g.::
+
+ GraphIndexBuilder - add_node(key, references, value)
+ IndexBuilder - add_node(key, value)
+ WhackyIndexBuilder - add_node(key, whackiness, value)
+
+as opposed to something quite different like::
+
+ node = IncrementalBuilder.get_node()
+ node.key = 'foo'
+ node.value = 'bar'
Services
--------
=== modified file 'doc/developers/repository.txt'
--- a/doc/developers/repository.txt 2007-07-13 15:05:36 +0000
+++ b/doc/developers/repository.txt 2007-07-15 07:31:37 +0000
@@ -44,8 +44,9 @@
introduced by a set of revisions in some cheap form, insert
data from a stream, validate data during insertion.
Garbage Collection Exclusive lock the repository preventing readers.
-Revert Revision graph access, Inventory extraction, file text
- access.
+Revert Delta from working tree to historical tree, and then
+ arbitrary file access to obtain the texts of differing
+ files.
Uncommit Revision graph access.
Status Revision graph access, revision text access, file
fingerprint information, inventory differencing.
@@ -55,8 +56,11 @@
fetch is needed.
Log Revision graph (entire at the moment) access,
sometimes status between adjacent revisions. Log of a
- file needs per-file-graph.
-Missing Revision graph access.
+ file needs per-file-graph. Dominator caching or
+ similar tools may be needed to prevent entire graph
+ access.
+Missing Revision graph access, and revision texts to show
+ output.
Update As for merge, but twice.
================== ====================
@@ -65,8 +69,11 @@
Ideally we can make our data access for commands such as branch to
dovetail well with the native storage in the repository, in the common
-case. Doing this may require the commands to operate in predictable
-manners.
+case. Doing this may require choosing the behaviour of some commands to
+allow us to have a smaller range of access patterns which we can optimise
+more heavily. Alternatively if each command is very predicable in its
+data access pattern we may be able to hint to the low level layers which
+pattern is needed on a per command basis to get efficient behaviour.
=================== ===================================================
Command Data access pattern
@@ -134,6 +141,13 @@
Patterns used
-------------
+Note that these are able to be changed by changing what we store. For
+instance if the repository satisfies mpdiff requests, then bundle can be
+defined in terms of mpdiff lookups rather than file text lookups
+appropriate to create mpdiffs. If the repository satisfies full text
+requests only, then you need the topological access to build up the
+desired mpdiffs.
+
=========================================== =========
Pattern Commands
=========================================== =========
@@ -262,7 +276,7 @@
Discovery of files
~~~~~~~~~~~~~~~~~~
-With non listable transports how should the collection of pack/index files
+With non-listable transports how should the collection of pack/index files
be found ? Initially record a list of all the pack/index files from
write actions. (Require writable transports to be listable). We can then
use a heuristic to statically combine pack/index files later.
More information about the bazaar-commits
mailing list