Rev 4301: (jam) Tweaks to the pure-python group compressor, in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Thu Apr 23 03:08:23 BST 2009

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 4301
revision-id: pqm at pqm.ubuntu.com-20090423015537-xfgqsbjj9ctpcd3o
parent: pqm at pqm.ubuntu.com-20090420092748-tm2cofylpjauo1nw
parent: john at arbash-meinel.com-20090423005830-kkdc31tqjetbj2f0
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Thu 2009-04-23 02:55:37 +0100
message:
  (jam) Tweaks to the pure-python group compressor,
  	shrinks time from 30min => 4min for some circumstances.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
  bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  bzrlib/tests/test__groupcompress.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
  bzrlib/tests/test_groupcompress.py test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 4300.1.7
    revision-id: john at arbash-meinel.com-20090423005830-kkdc31tqjetbj2f0
    parent: john at arbash-meinel.com-20090422231241-rb3imoltcpzeghfe
    parent: john at arbash-meinel.com-20090421235416-f0cz6ilf5cufbugi
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 19:58:30 -0500
    message:
      Bring in the other test cases.
      Also, remove the assert statements.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
      bzrlib/tests/test__groupcompress.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
        ------------------------------------------------------------
        revno: 4300.2.1
        revision-id: john at arbash-meinel.com-20090421235416-f0cz6ilf5cufbugi
        parent: pqm at pqm.ubuntu.com-20090420092748-tm2cofylpjauo1nw
        committer: John Arbash Meinel <john at arbash-meinel.com>
        branch nick: 1.15-gc-python
        timestamp: Tue 2009-04-21 18:54:16 -0500
        message:
          Fix bug #364900, properly remove the 64kB that was just encoded in the copy.
          Also, stop supporting None as a copy length in 'encode_copy_instruction'.
          It was only used by the test suite, and it is good to pull that sort of thing out of
          production code. (Besides, setting the copy to 64kB has the same effect.)
        modified:
          NEWS                           NEWS-20050323055033-4e00b5db738777ff
          bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
          bzrlib/tests/test__groupcompress.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
    ------------------------------------------------------------
    revno: 4300.1.6
    revision-id: john at arbash-meinel.com-20090422231241-rb3imoltcpzeghfe
    parent: john at arbash-meinel.com-20090422225955-xkcuonztuijyxec2
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 18:12:41 -0500
    message:
      Remove a couple TODOs that don't matter.
    modified:
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
    ------------------------------------------------------------
    revno: 4300.1.5
    revision-id: john at arbash-meinel.com-20090422225955-xkcuonztuijyxec2
    parent: john at arbash-meinel.com-20090422221458-wg8pwibhdvgvvths
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 17:59:55 -0500
    message:
      A couple more cleanups on the pure-python implementation.
      This drops the time for 'bzr pack' from 30min+ down to 4min.
      1) Keep the matching entries as a set rather than a list and then casting
         into a set all the time.
      2) Delay incrementing until doing a match, and then only increment the
        small set rather than the large one. 'prev' has gone through a set
        intersection in most code paths, so it will be a lot smaller than
        the raw 'locations'.
    modified:
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
    ------------------------------------------------------------
    revno: 4300.1.4
    revision-id: john at arbash-meinel.com-20090422221458-wg8pwibhdvgvvths
    parent: john at arbash-meinel.com-20090422205425-ujz47ris3ekak1h4
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 17:14:58 -0500
    message:
      Change self._matching_lines to use a set rather than a list.
      We need to consider memory consumption, etc, but it means we don't have
      to cast into a set() to do the intersection check.
      Might consider redoing the copy_ends code of _get_longest_match.
    modified:
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
    ------------------------------------------------------------
    revno: 4300.1.3
    revision-id: john at arbash-meinel.com-20090422205425-ujz47ris3ekak1h4
    parent: john at arbash-meinel.com-20090422204951-xykrubpy1zehhr9p
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 15:54:25 -0500
    message:
      The assertion is <= 127, not < 127
    modified:
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
    ------------------------------------------------------------
    revno: 4300.1.2
    revision-id: john at arbash-meinel.com-20090422204951-xykrubpy1zehhr9p
    parent: john at arbash-meinel.com-20090422171845-5dmqokv8ygf3cvs5
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 15:49:51 -0500
    message:
      Change the pure-python compressor a bit.
      Specifically, change how we encode insertions, but factor out that code into
      another class.
      The primary change is trying to get better line-based alignment for inserts,
      subject to the 127 character insert limit.
      The old code would take a long insert, split it into 127 byte chunks, and then
      split those chunks into lines.
      However, that tends to leave hunks that can't be indexed, because they aren't
      a complete line.
      So now we iterate over the lines, fitting them into 127-byte insertions as
      possible, so we get proper indexing.
      Note that it means any line > 127 bytes will never be matched, which is
      a fairly serious issue in the pure-python matcher, but not worth fixing,
      because you can just use the compiled matcher instead.
    modified:
      bzrlib/_groupcompress_py.py    _groupcompress_py.py-20090324110021-j63s399f4icrgw4p-1
    ------------------------------------------------------------
    revno: 4300.1.1
    revision-id: john at arbash-meinel.com-20090422171845-5dmqokv8ygf3cvs5
    parent: pqm at pqm.ubuntu.com-20090420092748-tm2cofylpjauo1nw
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress_info
    timestamp: Wed 2009-04-22 12:18:45 -0500
    message:
      Add the ability to convert a gc block into 'human readable' form.
    modified:
      bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      bzrlib/tests/test_groupcompress.py test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
=== modified file 'NEWS'

--- a/NEWS	2009-04-20 08:37:32 +0000
+++ b/NEWS	2009-04-21 23:54:16 +0000
@@ -44,6 +44,9 @@
 * Non-recursive ``bzr ls`` now works properly when a path is specified.
   (Jelmer Vernooij, #357863)
 
+* Fix a bug in the pure-python ``GroupCompress`` code when handling copies
+  longer than 64KiB. (John Arbash Meinel, #364900)
+
 Documentation
 *************
 

=== modified file 'bzrlib/_groupcompress_py.py'
--- a/bzrlib/_groupcompress_py.py	2009-04-09 20:23:07 +0000
+++ b/bzrlib/_groupcompress_py.py	2009-04-23 00:58:30 +0000
@@ -23,6 +23,74 @@
 from bzrlib import osutils
 
 
+class _OutputHandler(object):
+    """A simple class which just tracks how to split up an insert request."""
+
+    def __init__(self, out_lines, index_lines, min_len_to_index):
+        self.out_lines = out_lines
+        self.index_lines = index_lines
+        self.min_len_to_index = min_len_to_index
+        self.cur_insert_lines = []
+        self.cur_insert_len = 0
+
+    def add_copy(self, start_byte, end_byte):
+        # The data stream allows >64kB in a copy, but to match the compiled
+        # code, we will also limit it to a 64kB copy
+        for start_byte in xrange(start_byte, end_byte, 64*1024):
+            num_bytes = min(64*1024, end_byte - start_byte)
+            copy_bytes = encode_copy_instruction(start_byte, num_bytes)
+            self.out_lines.append(copy_bytes)
+            self.index_lines.append(False)
+
+    def _flush_insert(self):
+        if not self.cur_insert_lines:
+            return
+        if self.cur_insert_len > 127:
+            raise AssertionError('We cannot insert more than 127 bytes'
+                                 ' at a time.')
+        self.out_lines.append(chr(self.cur_insert_len))
+        self.index_lines.append(False)
+        self.out_lines.extend(self.cur_insert_lines)
+        if self.cur_insert_len < self.min_len_to_index:
+            self.index_lines.extend([False]*len(self.cur_insert_lines))
+        else:
+            self.index_lines.extend([True]*len(self.cur_insert_lines))
+        self.cur_insert_lines = []
+        self.cur_insert_len = 0
+
+    def _insert_long_line(self, line):
+        # Flush out anything pending
+        self._flush_insert()
+        line_len = len(line)
+        for start_index in xrange(0, line_len, 127):
+            next_len = min(127, line_len - start_index)
+            self.out_lines.append(chr(next_len))
+            self.index_lines.append(False)
+            self.out_lines.append(line[start_index:start_index+next_len])
+            # We don't index long lines, because we won't be able to match
+            # a line split across multiple inserts anway
+            self.index_lines.append(False)
+
+    def add_insert(self, lines):
+        if self.cur_insert_lines != []:
+            raise AssertionError('self.cur_insert_lines must be empty when'
+                                 ' adding a new insert')
+        for line in lines:
+            if len(line) > 127:
+                self._insert_long_line(line)
+            else:
+                next_len = len(line) + self.cur_insert_len
+                if next_len > 127:
+                    # Adding this line would overflow, so flush, and start over
+                    self._flush_insert()
+                    self.cur_insert_lines = [line]
+                    self.cur_insert_len = len(line)
+                else:
+                    self.cur_insert_lines.append(line)
+                    self.cur_insert_len = next_len
+        self._flush_insert()
+
+
 class LinesDeltaIndex(object):
     """This class indexes matches between strings.
 
@@ -33,6 +101,9 @@
     :ivar endpoint: The total number of bytes in self.line_offsets
     """
 
+    _MIN_MATCH_BYTES = 10
+    _SOFT_MIN_MATCH_BYTES = 200
+
     def __init__(self, lines):
         self.lines = []
         self.line_offsets = []
@@ -50,7 +121,11 @@
         for idx, do_index in enumerate(index):
             if not do_index:
                 continue
-            matches.setdefault(new_lines[idx], []).append(start_idx + idx)
+            line = new_lines[idx]
+            try:
+                matches[line].add(start_idx + idx)
+            except KeyError:
+                matches[line] = set([start_idx + idx])
 
     def get_matches(self, line):
         """Return the lines which match the line in right."""
@@ -59,7 +134,7 @@
         except KeyError:
             return None
 
-    def _get_longest_match(self, lines, pos, locations):
+    def _get_longest_match(self, lines, pos):
         """Look at all matches for the current line, return the longest.
 
         :param lines: The lines we are matching against
@@ -74,48 +149,45 @@
         """
         range_start = pos
         range_len = 0
-        copy_ends = None
+        prev_locations = None
         max_pos = len(lines)
+        matching = self._matching_lines
         while pos < max_pos:
-            if locations is None:
-                # TODO: is try/except better than get(..., None)?
-                try:
-                    locations = self._matching_lines[lines[pos]]
-                except KeyError:
-                    locations = None
-            if locations is None:
+            try:
+                locations = matching[lines[pos]]
+            except KeyError:
                 # No more matches, just return whatever we have, but we know
                 # that this last position is not going to match anything
                 pos += 1
                 break
+            # We have a match
+            if prev_locations is None:
+                # This is the first match in a range
+                prev_locations = locations
+                range_len = 1
+                locations = None # Consumed
             else:
-                # We have a match
-                if copy_ends is None:
-                    # This is the first match in a range
-                    copy_ends = [loc + 1 for loc in locations]
-                    range_len = 1
+                # We have a match started, compare to see if any of the
+                # current matches can be continued
+                next_locations = locations.intersection([loc + 1 for loc
+                                                         in prev_locations])
+                if next_locations:
+                    # At least one of the regions continues to match
+                    prev_locations = set(next_locations)
+                    range_len += 1
                     locations = None # Consumed
                 else:
-                    # We have a match started, compare to see if any of the
-                    # current matches can be continued
-                    next_locations = set(copy_ends).intersection(locations)
-                    if next_locations:
-                        # At least one of the regions continues to match
-                        copy_ends = [loc + 1 for loc in next_locations]
-                        range_len += 1
-                        locations = None # Consumed
-                    else:
-                        # All current regions no longer match.
-                        # This line does still match something, just not at the
-                        # end of the previous matches. We will return locations
-                        # so that we can avoid another _matching_lines lookup.
-                        break
+                    # All current regions no longer match.
+                    # This line does still match something, just not at the
+                    # end of the previous matches. We will return locations
+                    # so that we can avoid another _matching_lines lookup.
+                    break
             pos += 1
-        if copy_ends is None:
+        if prev_locations is None:
             # We have no matches, this is a pure insert
-            return None, pos, locations
-        return (((min(copy_ends) - range_len, range_start, range_len)),
-                pos, locations)
+            return None, pos
+        smallest = min(prev_locations)
+        return (smallest - range_len + 1, range_start, range_len), pos
 
     def get_matching_blocks(self, lines, soft=False):
         """Return the ranges in lines which match self.lines.
@@ -133,15 +205,13 @@
         # instructions.
         result = []
         pos = 0
-        locations = None
         max_pos = len(lines)
         result_append = result.append
-        min_match_bytes = 10
+        min_match_bytes = self._MIN_MATCH_BYTES
         if soft:
-            min_match_bytes = 200
+            min_match_bytes = self._SOFT_MIN_MATCH_BYTES
         while pos < max_pos:
-            block, pos, locations = self._get_longest_match(lines, pos,
-                                                            locations)
+            block, pos = self._get_longest_match(lines, pos)
             if block is not None:
                 # Check to see if we match fewer than min_match_bytes. As we
                 # will turn this into a pure 'insert', rather than a copy.
@@ -178,38 +248,6 @@
                 ' got out of sync with the line counter.')
         self.endpoint = endpoint
 
-    def _flush_insert(self, start_linenum, end_linenum,
-                      new_lines, out_lines, index_lines):
-        """Add an 'insert' request to the data stream."""
-        bytes_to_insert = ''.join(new_lines[start_linenum:end_linenum])
-        insert_length = len(bytes_to_insert)
-        # Each insert instruction is at most 127 bytes long
-        for start_byte in xrange(0, insert_length, 127):
-            insert_count = min(insert_length - start_byte, 127)
-            out_lines.append(chr(insert_count))
-            # Don't index the 'insert' instruction
-            index_lines.append(False)
-            insert = bytes_to_insert[start_byte:start_byte+insert_count]
-            as_lines = osutils.split_lines(insert)
-            out_lines.extend(as_lines)
-            index_lines.extend([True]*len(as_lines))
-
-    def _flush_copy(self, old_start_linenum, num_lines,
-                    out_lines, index_lines):
-        if old_start_linenum == 0:
-            first_byte = 0
-        else:
-            first_byte = self.line_offsets[old_start_linenum - 1]
-        stop_byte = self.line_offsets[old_start_linenum + num_lines - 1]
-        num_bytes = stop_byte - first_byte
-        # The data stream allows >64kB in a copy, but to match the compiled
-        # code, we will also limit it to a 64kB copy
-        for start_byte in xrange(first_byte, stop_byte, 64*1024):
-            num_bytes = min(64*1024, stop_byte - first_byte)
-            copy_bytes = encode_copy_instruction(start_byte, num_bytes)
-            out_lines.append(copy_bytes)
-            index_lines.append(False)
-
     def make_delta(self, new_lines, bytes_length=None, soft=False):
         """Compute the delta for this content versus the original content."""
         if bytes_length is None:
@@ -217,6 +255,8 @@
         # reserved for content type, content length
         out_lines = ['', '', encode_base128_int(bytes_length)]
         index_lines = [False, False, False]
+        output_handler = _OutputHandler(out_lines, index_lines,
+                                        self._MIN_MATCH_BYTES)
         blocks = self.get_matching_blocks(new_lines, soft=soft)
         current_line_num = 0
         # We either copy a range (while there are reusable lines) or we
@@ -224,11 +264,16 @@
         for old_start, new_start, range_len in blocks:
             if new_start != current_line_num:
                 # non-matching region, insert the content
-                self._flush_insert(current_line_num, new_start,
-                                   new_lines, out_lines, index_lines)
+                output_handler.add_insert(new_lines[current_line_num:new_start])
             current_line_num = new_start + range_len
             if range_len:
-                self._flush_copy(old_start, range_len, out_lines, index_lines)
+                # Convert the line based offsets into byte based offsets
+                if old_start == 0:
+                    first_byte = 0
+                else:
+                    first_byte = self.line_offsets[old_start - 1]
+                last_byte = self.line_offsets[old_start + range_len - 1]
+                output_handler.add_copy(first_byte, last_byte)
         return out_lines, index_lines
 
 
@@ -271,9 +316,7 @@
             copy_bytes.append(chr(base_byte))
         offset >>= 8
     if length is None:
-        # None is used by the test suite
-        copy_bytes[0] = chr(copy_command)
-        return ''.join(copy_bytes)
+        raise ValueError("cannot supply a length of None")
     if length > 0x10000:
         raise ValueError("we don't emit copy records for lengths > 64KiB")
     if length == 0:
@@ -337,7 +380,6 @@
 
 def make_delta(source_bytes, target_bytes):
     """Create a delta from source to target."""
-    # TODO: The checks below may not be a the right place yet.
     if type(source_bytes) is not str:
         raise TypeError('source is not a str')
     if type(target_bytes) is not str:

=== modified file 'bzrlib/groupcompress.py'
--- a/bzrlib/groupcompress.py	2009-04-20 08:37:32 +0000
+++ b/bzrlib/groupcompress.py	2009-04-22 17:18:45 +0000
@@ -299,6 +299,66 @@
                  ]
         return ''.join(chunks)
 
+    def _dump(self, include_text=False):
+        """Take this block, and spit out a human-readable structure.
+
+        :param include_text: Inserts also include text bits, chose whether you
+            want this displayed in the dump or not.
+        :return: A dump of the given block. The layout is something like:
+            [('f', length), ('d', delta_length, text_length, [delta_info])]
+            delta_info := [('i', num_bytes, text), ('c', offset, num_bytes),
+            ...]
+        """
+        self._ensure_content()
+        result = []
+        pos = 0
+        while pos < self._content_length:
+            kind = self._content[pos]
+            pos += 1
+            if kind not in ('f', 'd'):
+                raise ValueError('invalid kind character: %r' % (kind,))
+            content_len, len_len = decode_base128_int(
+                                self._content[pos:pos + 5])
+            pos += len_len
+            if content_len + pos > self._content_length:
+                raise ValueError('invalid content_len %d for record @ pos %d'
+                                 % (content_len, pos - len_len - 1))
+            if kind == 'f': # Fulltext
+                result.append(('f', content_len))
+            elif kind == 'd': # Delta
+                delta_content = self._content[pos:pos+content_len]
+                delta_info = []
+                # The first entry in a delta is the decompressed length
+                decomp_len, delta_pos = decode_base128_int(delta_content)
+                result.append(('d', content_len, decomp_len, delta_info))
+                measured_len = 0
+                while delta_pos < content_len:
+                    c = ord(delta_content[delta_pos])
+                    delta_pos += 1
+                    if c & 0x80: # Copy
+                        (offset, length,
+                         delta_pos) = decode_copy_instruction(delta_content, c,
+                                                              delta_pos)
+                        delta_info.append(('c', offset, length))
+                        measured_len += length
+                    else: # Insert
+                        if include_text:
+                            txt = delta_content[delta_pos:delta_pos+c]
+                        else:
+                            txt = ''
+                        delta_info.append(('i', c, txt))
+                        measured_len += c
+                        delta_pos += c
+                if delta_pos != content_len:
+                    raise ValueError('Delta consumed a bad number of bytes:'
+                                     ' %d != %d' % (delta_pos, content_len))
+                if measured_len != decomp_len:
+                    raise ValueError('Delta claimed fulltext was %d bytes, but'
+                                     ' extraction resulted in %d bytes'
+                                     % (decomp_len, measured_len))
+            pos += content_len
+        return result
+
 
 class _LazyGroupCompressFactory(object):
     """Yield content from a GroupCompressBlock on demand."""
@@ -1661,6 +1721,7 @@
     apply_delta_to_source,
     encode_base128_int,
     decode_base128_int,
+    decode_copy_instruction,
     LinesDeltaIndex,
     )
 try:

=== modified file 'bzrlib/tests/test__groupcompress.py'
--- a/bzrlib/tests/test__groupcompress.py	2009-04-09 20:23:07 +0000
+++ b/bzrlib/tests/test__groupcompress.py	2009-04-21 23:54:16 +0000
@@ -186,6 +186,19 @@
             'N\x90\x1d\x1ewhich is meant to differ from\n\x91:\x13',
             delta)
 
+    def test_make_delta_with_large_copies(self):
+        # We want to have a copy that is larger than 64kB, which forces us to
+        # issue multiple copy instructions.
+        big_text = _text3 * 1220
+        delta = self.make_delta(big_text, big_text)
+        self.assertDeltaIn(
+            '\xdc\x86\x0a'      # Encoding the length of the uncompressed text
+            '\x80'              # Copy 64kB, starting at byte 0
+            '\x84\x01'          # and another 64kB starting at 64kB
+            '\xb4\x02\x5c\x83', # And the bit of tail.
+            None,   # Both implementations should be identical
+            delta)
+
     def test_apply_delta_is_typesafe(self):
         self.apply_delta(_text1, 'M\x90M')
         self.assertRaises(TypeError, self.apply_delta, object(), 'M\x90M')
@@ -358,18 +371,18 @@
         self.assertEqual((exp_offset, exp_length, exp_newpos), out)
 
     def test_encode_no_length(self):
-        self.assertEncode('\x80', 0, None)
-        self.assertEncode('\x81\x01', 1, None)
-        self.assertEncode('\x81\x0a', 10, None)
-        self.assertEncode('\x81\xff', 255, None)
-        self.assertEncode('\x82\x01', 256, None)
-        self.assertEncode('\x83\x01\x01', 257, None)
-        self.assertEncode('\x8F\xff\xff\xff\xff', 0xFFFFFFFF, None)
-        self.assertEncode('\x8E\xff\xff\xff', 0xFFFFFF00, None)
-        self.assertEncode('\x8D\xff\xff\xff', 0xFFFF00FF, None)
-        self.assertEncode('\x8B\xff\xff\xff', 0xFF00FFFF, None)
-        self.assertEncode('\x87\xff\xff\xff', 0x00FFFFFF, None)
-        self.assertEncode('\x8F\x04\x03\x02\x01', 0x01020304, None)
+        self.assertEncode('\x80', 0, 64*1024)
+        self.assertEncode('\x81\x01', 1, 64*1024)
+        self.assertEncode('\x81\x0a', 10, 64*1024)
+        self.assertEncode('\x81\xff', 255, 64*1024)
+        self.assertEncode('\x82\x01', 256, 64*1024)
+        self.assertEncode('\x83\x01\x01', 257, 64*1024)
+        self.assertEncode('\x8F\xff\xff\xff\xff', 0xFFFFFFFF, 64*1024)
+        self.assertEncode('\x8E\xff\xff\xff', 0xFFFFFF00, 64*1024)
+        self.assertEncode('\x8D\xff\xff\xff', 0xFFFF00FF, 64*1024)
+        self.assertEncode('\x8B\xff\xff\xff', 0xFF00FFFF, 64*1024)
+        self.assertEncode('\x87\xff\xff\xff', 0x00FFFFFF, 64*1024)
+        self.assertEncode('\x8F\x04\x03\x02\x01', 0x01020304, 64*1024)
 
     def test_encode_no_offset(self):
         self.assertEncode('\x90\x01', 0, 1)

=== modified file 'bzrlib/tests/test_groupcompress.py'
--- a/bzrlib/tests/test_groupcompress.py	2009-04-20 08:37:32 +0000
+++ b/bzrlib/tests/test_groupcompress.py	2009-04-22 17:18:45 +0000
@@ -447,6 +447,18 @@
         # And the decompressor is finalized
         self.assertIs(None, block._z_content_decompressor)
 
+    def test__dump(self):
+        dup_content = 'some duplicate content\nwhich is sufficiently long\n'
+        key_to_text = {('1',): dup_content + '1 unique\n',
+                       ('2',): dup_content + '2 extra special\n'}
+        locs, block = self.make_block(key_to_text)
+        self.assertEqual([('f', len(key_to_text[('1',)])),
+                          ('d', 21, len(key_to_text[('2',)]),
+                           [('c', 2, len(dup_content)),
+                            ('i', len('2 extra special\n'), '')
+                           ]),
+                         ], block._dump())
+
 
 class TestCaseWithGroupCompressVersionedFiles(tests.TestCaseWithTransport):