Rev 3893: Merge the updates to the groupcompress DeltaIndex. in http://bazaar.launchpad.net/%7Ebzr/bzr/brisbane-core

Fri Mar 20 03:23:08 GMT 2009

At http://bazaar.launchpad.net/%7Ebzr/bzr/brisbane-core

------------------------------------------------------------
revno: 3893
revision-id: john at arbash-meinel.com-20090320032107-bm9wg421rtcacy5i
parent: john at arbash-meinel.com-20090320031652-jjy97n2zsjq1ouxp
parent: john at arbash-meinel.com-20090319233050-tf8ah6zasmeaetr0
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: brisbane-core
timestamp: Thu 2009-03-19 22:21:07 -0500
message:
  Merge the updates to the groupcompress DeltaIndex.
modified:
  bzrlib/delta.h                 delta.h-20090227173129-qsu3u43vowf1q3ay-1
  bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
  bzrlib/tests/test__groupcompress_pyx.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
    ------------------------------------------------------------
    revno: 3888.1.18
    revision-id: john at arbash-meinel.com-20090319233050-tf8ah6zasmeaetr0
    parent: john at arbash-meinel.com-20090319145132-e7eu3p75btuidhu2
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 18:30:50 -0500
    message:
      *grow* the local hmask if it is smaller than expected, don't *shrink* it.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.17
    revision-id: john at arbash-meinel.com-20090319145132-e7eu3p75btuidhu2
    parent: john at arbash-meinel.com-20090319144153-y4m58rs011omd0g3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 09:51:32 -0500
    message:
      Remove an invalid assert.
      
      The assert is only valid if we grow at most by 1 level,
      with the new limitation of forcing the hsize, we can
      grow by a lot more than that. (f=>ff).
      Rather than writing a new assert that fits all cases, just
      remove it.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.16
    revision-id: john at arbash-meinel.com-20090319144153-y4m58rs011omd0g3
    parent: john at arbash-meinel.com-20090319062205-cy7f49htv3vet8g2
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 09:41:53 -0500
    message:
      Handle when our current packing is sub-optimal.
      
      It happens somtimes that our estimated hsize is too big, so
      that the next estimate tries to shrink it. However the code
      like pack_delta_index only copes with growing, and that is
      honestly all we really care about.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.15
    revision-id: john at arbash-meinel.com-20090319062205-cy7f49htv3vet8g2
    parent: john at arbash-meinel.com-20090319061549-fzk7na4uczpev2iy
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 01:22:05 -0500
    message:
      Remove the noisy debugging code. (down to 23.1s)
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.14
    revision-id: john at arbash-meinel.com-20090319061549-fzk7na4uczpev2iy
    parent: john at arbash-meinel.com-20090319061002-bcf6ikop39ap4s6w
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 01:15:49 -0500
    message:
      Simplify the code a bit. We don't repack as often, so go with a
      more obvious code, rather than trying tricks with memcpy()
      (it didn't seem to really help, anyway).
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.13
    revision-id: john at arbash-meinel.com-20090319061002-bcf6ikop39ap4s6w
    parent: john at arbash-meinel.com-20090319060153-p0e7qedqdq9vd8iq
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 01:10:02 -0500
    message:
      Tweak the number of blank spaces up just a tad.
      It seems that setting it to 8 doesn't see a net gain. There is a slight
      improvement in the number of readjustments done, but that is counteracted
      by the make_delta time.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.12
    revision-id: john at arbash-meinel.com-20090319060153-p0e7qedqdq9vd8iq
    parent: john at arbash-meinel.com-20090319044457-py567xqq3nyzwau4
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Thu 2009-03-19 01:01:53 -0500
    message:
      Now we are able to weave 'add_source' into the existing index.
      This brings 'bzr pack' time down to ~23.6s (with debugging on).
      According to lsprof time for 'add_delta_source' overall dropped from ~5s down to
      about 300ms, and now the time for 'add_source' dropped 5s->3.3s->1.6s.
      Next thing is to probably bump the number of free slots.
    modified:
      bzrlib/delta.h                 delta.h-20090227173129-qsu3u43vowf1q3ay-1
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.11
    revision-id: john at arbash-meinel.com-20090319044457-py567xqq3nyzwau4
    parent: john at arbash-meinel.com-20090319040218-pqex5298ifl1ds54
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 23:44:57 -0500
    message:
      Shave 5s->3.3s in add_source by copying old entries across directly.
      This works quite a bit better than copying them over into the hash
      and then copying them *back* into the final packed form.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.10
    revision-id: john at arbash-meinel.com-20090319040218-pqex5298ifl1ds54
    parent: john at arbash-meinel.com-20090319035245-54xrfw2ztsjzbdo3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 23:02:18 -0500
    message:
      Fix a bug when there is more than one entry (increment out_entry).
      Also make the mini_hsize always the same size as hsize.
      Otherwise we end up walking the same nodes over and over again.
      This way, we only walk nodes when we are going to be inserting them.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.9
    revision-id: john at arbash-meinel.com-20090319035245-54xrfw2ztsjzbdo3
    parent: john at arbash-meinel.com-20090318224524-ve32it3ddqfzvi6q
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 22:52:45 -0500
    message:
      When expanding an index put the entries into a hash.
      
      Rather than iterating all entries for every hash index, create another
      mini hash, and pull them out of there.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.8
    revision-id: john at arbash-meinel.com-20090318224524-ve32it3ddqfzvi6q
    parent: john at arbash-meinel.com-20090318224055-hs7nxyq5fjz0pr19
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 17:45:24 -0500
    message:
      Reverted back to the same hash width, and bumped EXTRA_NULLS to 3.
      Most entries in a hash bucket are genuinely random, so they don't trigger
      extra comparisons. So walking 4-7 nodes is fairly cheap at that level.
      My guess is that bumping EXTRA_NULL has a bigger effect when you get the
      occassional non-random data, that forces expansion because it gets a
      collision.
      Data with repetition a multiple of 16 (but not 16) will cause this, as
      you can get a large insertion, with lots of dupes.
      We filter out when the dupe is exactly a multiple of 16, we may want to
      do something similar at larger ranges (or use limit_hash_table on the data
      possibly with a much smaller value than 64.)
      Most important (next) is to handle the large update case.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.7
    revision-id: john at arbash-meinel.com-20090318224055-hs7nxyq5fjz0pr19
    parent: john at arbash-meinel.com-20090318223201-rpttk4ur07xar0z5
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 17:40:55 -0500
    message:
      Different attempt, which I thought would give similar results but doesn't.
      Changing it back to EXTRA_NULLS=1 and setting hash map to be twice as wide.
      This results in 9.2k inserts, 1.3k expands, and overall 3m58s.
      Also, the assumption of 'not many added' is patently false.
      The biggest I've seen now is 2.2k insertions, and 2.2k*4k index is going to
      be stupid-painful.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.6
    revision-id: john at arbash-meinel.com-20090318223201-rpttk4ur07xar0z5
    parent: john at arbash-meinel.com-20090318220835-hfmreyshsh78l8pf
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 17:32:01 -0500
    message:
      Increasing EXTRA_NULLS to 2 from 1 ups the hit rate
      9.1k => 10k without expanding, and 1407=>433 expansions.
      Drops the overall time 3m45s=>3m40s.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.5
    revision-id: john at arbash-meinel.com-20090318220835-hfmreyshsh78l8pf
    parent: john at arbash-meinel.com-20090318215056-dzx4j8ym5yhwh67b
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 17:08:35 -0500
    message:
      Restoring the debugging for now.
      Only matching commands that are > RABIN_WINDOW + 3 reduces the
      total number of resizes from 3.1k to 1.4k. Reduces overall
      matches from 9.6k => 9.1k. Those match commands were flooding
      the hash map, because they get repeated and always hit the same
      hash bucket.
      
      That said, this is seems overall slower than the old code, my
      guess is the O(MN) behavior of the resize loop. Time to put
      the new data into its own hash. :)
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.4
    revision-id: john at arbash-meinel.com-20090318215056-dzx4j8ym5yhwh67b
    parent: john at arbash-meinel.com-20090318174524-djvr0t7qwyca10jn
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 16:50:56 -0500
    message:
      The new layout is working.
      
      Commenting out the debug info for now.
      What I'm finding is a surprising number of repeated strings.
      Basically, common strings of length < 20, which then end up
      indexed by the RABIN code, but don't get copied in the output.
      (because RABIN is a 16-byte match, but the copy command has
      a minimum size of 20-bytes. Perhaps we need to change the
      code so that it doesn't try to index <20 character inserts.
      Or change the copy code so that it allows shorter copies.
    modified:
      bzrlib/delta.h                 delta.h-20090227173129-qsu3u43vowf1q3ay-1
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
      bzrlib/tests/test__groupcompress_pyx.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
    ------------------------------------------------------------
    revno: 3888.1.3
    revision-id: john at arbash-meinel.com-20090318174524-djvr0t7qwyca10jn
    parent: john at arbash-meinel.com-20090318171041-pccbt1pvfq4doe4i
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 12:45:24 -0500
    message:
      include_entries_from_hash wasn't properly skipping NULL records.
      
      Now the tests pass again, and we can look at bringing back a simpler
      create_delta_index_from_delta.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.2
    revision-id: john at arbash-meinel.com-20090318171041-pccbt1pvfq4doe4i
    parent: john at arbash-meinel.com-20090317222027-tko3r2lzg6blf96y
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Wed 2009-03-18 12:10:41 -0500
    message:
      Revert some of the previous work.
      The tests start failing if we insert extra null spaces, so get back to a
      point where they are passing and work from there.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
    ------------------------------------------------------------
    revno: 3888.1.1
    revision-id: john at arbash-meinel.com-20090317222027-tko3r2lzg6blf96y
    parent: john at arbash-meinel.com-20090317201340-amjnj1wl78iwcxae
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: gc_delta_index_room
    timestamp: Tue 2009-03-17 17:20:27 -0500
    message:
      (broken, in progress), change the data structures to allow for inserting small deltas.
      By adding 2 blank spots per hash spot, we can normally update the structure without
      having to resize the whole thing.
      We'll probably want to tune how many extra slots to provide.
      The general work is probably good, but I need to finish handling the case when we
      really *do* need to resize the structure.
    modified:
      bzrlib/diff-delta.c            diffdelta.c-20090226042143-l9wzxynyuxnb5hus-1
-------------- next part --------------
=== modified file 'bzrlib/delta.h'

--- a/bzrlib/delta.h	2009-03-11 06:26:22 +0000
+++ b/bzrlib/delta.h	2009-03-19 06:01:53 +0000
@@ -32,7 +32,7 @@
  */
 extern struct delta_index *
 create_delta_index(const struct source_info *src,
-                   const struct delta_index *old);
+                   struct delta_index *old);
 
 
 /*
@@ -47,7 +47,7 @@
  */
 extern struct delta_index *
 create_delta_index_from_delta(const struct source_info *delta,
-                              const struct delta_index *old);
+                              struct delta_index *old);
 /*
  * free_delta_index: free the index created by create_delta_index()
  *

=== modified file 'bzrlib/diff-delta.c'
--- a/bzrlib/diff-delta.c	2009-03-11 06:50:59 +0000
+++ b/bzrlib/diff-delta.c	2009-03-19 23:30:50 +0000
@@ -12,6 +12,8 @@
  * published by the Free Software Foundation.
  */
 
+#include <stdio.h>
+
 #include "delta.h"
 #include <stdlib.h>
 #include <string.h>
@@ -23,6 +25,12 @@
 #define RABIN_SHIFT 23
 #define RABIN_WINDOW 16
 
+/* The hash map is sized to put 4 entries per bucket, this gives us ~even room
+ * for more data. Tweaking this number above 4 doesn't seem to help much,
+ * anyway.
+ */
+#define EXTRA_NULLS 4
+
 static const unsigned int T[256] = {
     0x00000000, 0xab59b4d1, 0x56b369a2, 0xfdeadd73, 0x063f6795, 0xad66d344,
     0x508c0e37, 0xfbd5bae6, 0x0c7ecf2a, 0xa7277bfb, 0x5acda688, 0xf1941259,
@@ -121,6 +129,11 @@
     unsigned int val;
 };
 
+struct index_entry_linked_list {
+    struct index_entry *p_entry;
+    struct index_entry_linked_list *next;
+};
+
 struct unpacked_index_entry {
     struct index_entry entry;
     struct unpacked_index_entry *next;
@@ -197,20 +210,86 @@
 
 static struct delta_index *
 pack_delta_index(struct unpacked_index_entry **hash, unsigned int hsize,
-                 unsigned int num_entries)
+                 unsigned int num_entries, struct delta_index *old_index)
 {
-    unsigned int i, memsize;
+    unsigned int i, j, hmask, memsize, fit_in_old, copied_count;
     struct unpacked_index_entry *entry;
     struct delta_index *index;
-    struct index_entry *packed_entry, **packed_hash;
+    struct index_entry *packed_entry, **packed_hash, *old_entry, *copy_from;
+    struct index_entry null_entry = {0};
     void *mem;
+
+    hmask = hsize - 1;
+
+    // if (old_index) {
+    //     fprintf(stderr, "Packing %d entries into %d for total of %d entries"
+    //                     " %x => %x\n",
+    //                     num_entries - old_index->num_entries,
+    //                     old_index->num_entries, num_entries,
+    //                     old_index->hash_mask, hmask);
+    // } else {
+    //     fprintf(stderr, "Packing %d entries into a new index\n",
+    //                     num_entries);
+    // }
+    /* First, see if we can squeeze the new items into the existing structure.
+     */
+    fit_in_old = 0;
+    copied_count = 0;
+    if (old_index && old_index->hash_mask == hmask) {
+        fit_in_old = 1;
+        for (i = 0; i < hsize; ++i) {
+            packed_entry = NULL;
+            for (entry = hash[i]; entry; entry = entry->next) {
+                if (packed_entry == NULL) {
+                    /* Find the last open spot */
+                    packed_entry = old_index->hash[i + 1];
+                    --packed_entry;
+                    while (packed_entry >= old_index->hash[i]
+                           && packed_entry->ptr == NULL) {
+                        --packed_entry;
+                    }
+                    ++packed_entry;
+                }
+                if (packed_entry >= old_index->hash[i+1]
+                    || packed_entry->ptr != NULL) {
+                    /* There are no free spots here :( */
+                    fit_in_old = 0;
+                    break;
+                }
+                /* We found an empty spot to put this entry
+                 * Copy it over, and remove it from the linked list, just in
+                 * case we end up running out of room later.
+                 */
+                *packed_entry++ = entry->entry;
+                assert(entry == hash[i]);
+                hash[i] = entry->next;
+                copied_count += 1;
+                old_index->num_entries++;
+            }
+            if (!fit_in_old) {
+                break;
+            }
+        }
+    }
+    if (old_index) {
+        if (fit_in_old) {
+            // fprintf(stderr, "Fit all %d entries into old index\n",
+            //                 copied_count);
+            /* No need to allocate a new buffer */
+            return NULL;
+        } else {
+            // fprintf(stderr, "Fit only %d entries into old index,"
+            //                 " reallocating\n", copied_count);
+        }
+    }
     /*
      * Now create the packed index in array form
      * rather than linked lists.
+     * Leave a 2-entry gap for inserting more entries between the groups
      */
     memsize = sizeof(*index)
         + sizeof(*packed_hash) * (hsize+1)
-        + sizeof(*packed_entry) * num_entries;
+        + sizeof(*packed_entry) * (num_entries + hsize * EXTRA_NULLS);
     mem = malloc(memsize);
     if (!mem) {
         return NULL;
@@ -218,161 +297,77 @@
 
     index = mem;
     index->memsize = memsize;
-    index->hash_mask = hsize - 1;
+    index->hash_mask = hmask;
     index->num_entries = num_entries;
-
-    mem = index->hash;
-    packed_hash = mem;
-    mem = packed_hash + (hsize+1);
-    packed_entry = mem;
-
-    for (i = 0; i < hsize; i++) {
-        /*
-         * Coalesce all entries belonging to one linked list
-         * into consecutive array entries.
-         */
-        packed_hash[i] = packed_entry;
-        for (entry = hash[i]; entry; entry = entry->next)
-            *packed_entry++ = entry->entry;
-    }
-
-    /* Sentinel value to indicate the length of the last hash bucket */
-    packed_hash[hsize] = packed_entry;
-
-    assert(packed_entry - (struct index_entry *)mem == num_entries);
-    index->last_entry = (packed_entry - 1);
-    return index;
-}
-
-void include_entries_from_index(struct unpacked_index_entry **hash,
-                                unsigned int *hash_count,
-                                unsigned int hsize,
-                                struct unpacked_index_entry *entry,
-                                const struct delta_index *old)
-{
-    unsigned int i, hmask, masked_val;
-    struct index_entry *old_entry;
-
-    hmask = hsize - 1;
-    /* Iterate backwards through the existing entries
-     * This is so that matches early in the file end up being the first pointer
-     * in the linked list.
-     * TODO: We know that all old entries are going to occur before the new
-     *       entries, and that we are going to follow this with a limit and
-     *       then pack step. There is probably a way we could optimize this
-     *       step, so that we wouldn't have to copy everything into a new
-     *       buffer, and then copy everything again into yet another buffer.
-     */
-    for (old_entry = old->last_entry, i = 0; i < old->num_entries;
-         i++, old_entry--) {
-        masked_val = old_entry->val & hmask;
-        entry->entry = *old_entry;
-        entry->next = hash[masked_val];
-        hash[masked_val] = entry++;
-        hash_count[masked_val]++;
-    }
-}
-
-struct delta_index *
-create_index_from_old_and_hash(const struct delta_index *old,
-                               struct unpacked_index_entry **hash,
-                               unsigned int hsize,
-                               unsigned int num_entries)
-{
-    unsigned int i, memsize, total_num_entries;
-    struct unpacked_index_entry *entry;
-    struct delta_index *index;
-    struct index_entry *packed_entry, **packed_hash;
-    struct index_entry *copy_from_start, *copy_to_start;
-    size_t total_copy, to_copy;
-    unsigned int num_ops;
-    unsigned int bytes_copied;
-    void *mem;
-    /*
-     * Now create the packed index in array form
-     * rather than linked lists.
-     */
-    total_num_entries = num_entries + old->num_entries;
-    memsize = sizeof(*index)
-        + sizeof(*packed_hash) * (hsize+1)
-        + sizeof(*packed_entry) * total_num_entries;
-    mem = malloc(memsize);
-    if (!mem) {
-        return NULL;
-    }
-
-    index = mem;
-    index->memsize = memsize;
-    index->hash_mask = hsize - 1;
-    index->num_entries = total_num_entries;
-
-    mem = index->hash;
-    packed_hash = mem;
-    mem = packed_hash + (hsize+1);
-    packed_entry = mem;
-
-    total_copy = 0;
-    bytes_copied = 0;
-    num_ops = 0;
-    copy_from_start = copy_to_start = NULL;
-    for (i = 0; i < hsize; i++) {
-        /*
-         * Coalesce all entries belonging to one linked list
-         * into consecutive array entries.
-         * The entries in old all come before the entries in hash.
-         */
-        packed_hash[i] = packed_entry;
-        to_copy = (old->hash[i+1] - old->hash[i]);
-        if (to_copy > 0) {
-            /* This next range can be copied wholesale. However, we will wait
-             * until we have to insert something from the new data before we
-             * copy everything.
-             * So for now, just reserve space for the copy.
+    if (old_index) {
+        if (hmask < old_index->hash_mask) {
+            fprintf(stderr, "hash mask was shrunk %x => %x\n",
+                            old_index->hash_mask, hmask);
+        }
+        assert(hmask >= old_index->hash_mask);
+    }
+
+    mem = index->hash;
+    packed_hash = mem;
+    mem = packed_hash + (hsize+1);
+    packed_entry = mem;
+
+    for (i = 0; i < hsize; i++) {
+        /*
+         * Coalesce all entries belonging to one linked list
+         * into consecutive array entries.
+         */
+        packed_hash[i] = packed_entry;
+        /* Old comes earlier as a source, so it always comes first in a given
+         * hash bucket.
+         */
+        if (old_index) {
+            /* Could we optimize this to use memcpy when hmask ==
+             * old_index->hash_mask? Would it make any real difference?
              */
-            if (total_copy == 0) {
-                copy_from_start = old->hash[i];
-                copy_to_start = packed_entry;
+            j = i & old_index->hash_mask;
+            copy_from = old_index->hash[j];
+            for (old_entry = old_index->hash[j];
+                 old_entry < old_index->hash[j + 1] && old_entry->ptr != NULL;
+                 old_entry++) {
+                if ((old_entry->val & hmask) == i) {
+                    *packed_entry++ = *old_entry;
+                }
             }
-            packed_entry += to_copy;
-            total_copy += to_copy;
-            assert((packed_entry - copy_to_start) == total_copy);
-            assert((old->hash[i+1] - copy_from_start) == total_copy);
         }
         for (entry = hash[i]; entry; entry = entry->next) {
-            /* We have an entry to insert, so flush the copy buffer */
-            if (total_copy > 0) {
-                assert(copy_to_start != NULL);
-                assert(copy_from_start != NULL);
-                memcpy(copy_to_start, copy_from_start,
-                       total_copy * sizeof(struct index_entry));
-                bytes_copied += (total_copy * sizeof(struct index_entry));
-                total_copy = 0;
-                copy_from_start = copy_to_start = NULL;
-                num_ops += 1;
-            }
             *packed_entry++ = entry->entry;
         }
-    }
-    if (total_copy > 0) {
-        assert(copy_to_start != NULL);
-        assert(copy_from_start != NULL);
-        memcpy(copy_to_start, copy_from_start,
-               total_copy * sizeof(struct index_entry));
-        bytes_copied += (total_copy * sizeof(struct index_entry));
-        num_ops += 1;
+        /* TODO: At this point packed_entry - packed_hash[i] is the number of
+         *       records that we have inserted into this hash bucket.
+         *       We should *really* consider doing some limiting along the
+         *       lines of limit_hash_buckets() to avoid pathological behavior.
+         */
+        /* Now add extra 'NULL' entries that we can use for future expansion. */
+        for (j = 0; j < EXTRA_NULLS; ++j ) {
+            *packed_entry++ = null_entry;
+        }
     }
 
     /* Sentinel value to indicate the length of the last hash bucket */
     packed_hash[hsize] = packed_entry;
 
-    assert(packed_entry - (struct index_entry *)mem == total_num_entries);
+    if (packed_entry - (struct index_entry *)mem
+        != num_entries + hsize*EXTRA_NULLS) {
+        fprintf(stderr, "We expected %d entries, but created %d\n",
+                num_entries + hsize*EXTRA_NULLS,
+                (int)(packed_entry - (struct index_entry*)mem));
+    }
+    assert(packed_entry - (struct index_entry *)mem
+            == num_entries + hsize*EXTRA_NULLS);
     index->last_entry = (packed_entry - 1);
     return index;
 }
 
 
-struct delta_index * create_delta_index(const struct source_info *src,
-                                        const struct delta_index *old)
+struct delta_index *
+create_delta_index(const struct source_info *src,
+                   struct delta_index *old)
 {
     unsigned int i, hsize, hmask, num_entries, prev_val, *hash_count;
     unsigned int total_num_entries;
@@ -398,6 +393,10 @@
     for (i = 4; (1u << i) < hsize && i < 31; i++);
     hsize = 1 << i;
     hmask = hsize - 1;
+    if (old && old->hash_mask > hmask) {
+        hmask = old->hash_mask;
+        hsize = hmask + 1;
+    }
 
     /* allocate lookup index */
     memsize = sizeof(*hash) * hsize +
@@ -442,37 +441,249 @@
             hash_count[i]++;
         }
     }
-    /* If we have an existing delta_index, we want to bring its info into the
-     * new structure. We assume that the existing structure's objects all occur
-     * before the new source, so they get first precedence in the index.
-     */
-    if (old != NULL)
-        include_entries_from_index(hash, hash_count, hsize, entry, old);
-
+    /* TODO: It would be nice to limit_hash_buckets at a better time. */
     total_num_entries = limit_hash_buckets(hash, hash_count, hsize,
                                            total_num_entries);
     free(hash_count);
-    index = pack_delta_index(hash, hsize, total_num_entries);
-    index->last_src = src;
+    if (old) {
+        old->last_src = src;
+    }
+    index = pack_delta_index(hash, hsize, total_num_entries, old);
     free(hash);
     if (!index) {
         return NULL;
     }
-    return index;
+    index->last_src = src;
+    return index;
+}
+
+/* Take some entries, and put them into a custom hash.
+ * @param entries   A list of entries, sorted by position in file
+ * @param num_entries   Length of entries
+ * @param out_hsize     The maximum size of the hash, the final size will be
+ *                      returned here
+ */
+struct index_entry_linked_list **
+_put_entries_into_hash(struct index_entry *entries, unsigned int num_entries,
+                       unsigned int hsize)
+{
+    unsigned int hash_offset, hmask, memsize;
+    struct index_entry *entry, *last_entry;
+    struct index_entry_linked_list *out_entry, **hash;
+    void *mem;
+
+    hmask = hsize - 1;
+
+    memsize = sizeof(*hash) * hsize +
+          sizeof(*out_entry) * num_entries;
+    mem = malloc(memsize);
+    if (!mem)
+        return NULL;
+    hash = mem;
+    mem = hash + hsize;
+    out_entry = mem;
+
+    memset(hash, 0, sizeof(*hash)*(hsize+1));
+
+    /* We know that entries are in the order we want in the output, but they
+     * aren't "grouped" by hash bucket yet.
+     */
+    last_entry = entries + num_entries;
+    for (entry = entries + num_entries - 1; entry >= entries; --entry) {
+        hash_offset = entry->val & hmask;
+        out_entry->p_entry = entry;
+        out_entry->next = hash[hash_offset];
+        /* TODO: Remove entries that have identical vals, or at least filter
+         *       the map a little bit.
+         * if (hash[i] != NULL) {
+         * }
+         */
+        hash[hash_offset] = out_entry;
+        ++out_entry;
+    }
+    return hash;
+}
+
+
+struct delta_index *
+create_index_from_old_and_new_entries(const struct delta_index *old_index,
+                                      struct index_entry *entries,
+                                      unsigned int num_entries)
+{
+    unsigned int i, j, hsize, hmask, total_num_entries;
+    struct delta_index *index;
+    struct index_entry *entry, *packed_entry, **packed_hash;
+    struct index_entry *last_entry, null_entry = {0};
+    void *mem;
+    unsigned long memsize;
+    struct index_entry_linked_list *unpacked_entry, **mini_hash;
+
+    /* Determine index hash size.  Note that indexing skips the
+       first byte to allow for optimizing the Rabin's polynomial
+       initialization in create_delta(). */
+    total_num_entries = num_entries + old_index->num_entries;
+    hsize = total_num_entries / 4;
+    for (i = 4; (1u << i) < hsize && i < 31; i++);
+    hsize = 1 << i;
+    if (hsize < old_index->hash_mask) {
+        /* For some reason, there was a code path that would actually *shrink*
+         * the hash size. This screws with some later code, and in general, I
+         * think it better to make the hash bigger, rather than smaller. So
+         * we'll just force the size here.
+         * Possibly done by create_delta_index running into a
+         * limit_hash_buckets call, that ended up transitioning across a
+         * power-of-2. The cause isn't 100% clear, though.
+         */
+        hsize = old_index->hash_mask + 1;
+    }
+    hmask = hsize - 1;
+    // fprintf(stderr, "resizing index to insert %d entries into array"
+    //                 " with %d entries: %x => %x\n",
+    //         num_entries, old_index->num_entries, old_index->hash_mask, hmask);
+
+    memsize = sizeof(*index)
+        + sizeof(*packed_hash) * (hsize+1)
+        + sizeof(*packed_entry) * (total_num_entries + hsize*EXTRA_NULLS);
+    mem = malloc(memsize);
+    if (!mem) {
+        return NULL;
+    }
+    index = mem;
+    index->memsize = memsize;
+    index->hash_mask = hmask;
+    index->num_entries = total_num_entries;
+    index->last_src = old_index->last_src;
+
+    mem = index->hash;
+    packed_hash = mem;
+    mem = packed_hash + (hsize+1);
+    packed_entry = mem;
+
+    mini_hash = _put_entries_into_hash(entries, num_entries, hsize);
+    if (mini_hash == NULL) {
+        free(index);
+        return NULL;
+    }
+    last_entry = entries + num_entries;
+    for (i = 0; i < hsize; i++) {
+        /*
+         * Coalesce all entries belonging in one hash bucket
+         * into consecutive array entries.
+         * The entries in old_index all come before 'entries'.
+         */
+        packed_hash[i] = packed_entry;
+        /* Copy any of the old entries across */
+        /* Would we rather use memcpy? */
+        if (hmask == old_index->hash_mask) {
+            for (entry = old_index->hash[i];
+                 entry < old_index->hash[i+1] && entry->ptr != NULL;
+                 ++entry) {
+                assert((entry->val & hmask) == i);
+                *packed_entry++ = *entry;
+            }
+        } else {
+            /* If we resized the index from this action, all of the old values
+             * will be found in the previous location, but they will end up
+             * spread across the new locations.
+             */
+            j = i & old_index->hash_mask;
+            for (entry = old_index->hash[j];
+                 entry < old_index->hash[j+1] && entry->ptr != NULL;
+                 ++entry) {
+                assert((entry->val & old_index->hash_mask) == j);
+                if ((entry->val & hmask) == i) {
+                    /* Any entries not picked up here will be picked up on the
+                     * next pass.
+                     */
+                    *packed_entry++ = *entry;
+                }
+            }
+        }
+        /* Now see if we need to insert any of the new entries.
+         * Note that loop ends up O(hsize*num_entries), so we expect that
+         * num_entries is always small.
+         * We also help a little bit by collapsing the entry range when the
+         * endpoints are inserted. However, an alternative would be to build a
+         * quick hash lookup for just the new entries.
+         * Testing shows that this list can easily get up to about 100
+         * entries, the tradeoff is a malloc, 1 pass over the entries, copying
+         * them into a sorted buffer, and a free() when done,
+         */
+        for (unpacked_entry = mini_hash[i];
+             unpacked_entry;
+             unpacked_entry = unpacked_entry->next) {
+            assert((unpacked_entry->p_entry->val & hmask) == i);
+            *packed_entry++ = *(unpacked_entry->p_entry);
+        }
+        /* Now insert some extra nulls */
+        for (j = 0; j < EXTRA_NULLS; ++j) {
+            *packed_entry++ = null_entry;
+        }
+    }
+    free(mini_hash);
+
+    /* Sentinel value to indicate the length of the last hash bucket */
+    packed_hash[hsize] = packed_entry;
+
+    if ((packed_entry - (struct index_entry *)mem)
+        != (total_num_entries + hsize*EXTRA_NULLS)) {
+        fprintf(stderr, "We expected %d entries, but created %d\n",
+                total_num_entries + hsize*EXTRA_NULLS,
+                (int)(packed_entry - (struct index_entry*)mem));
+        fflush(stderr);
+    }
+    assert((packed_entry - (struct index_entry *)mem)
+           == (total_num_entries + hsize * EXTRA_NULLS));
+    index->last_entry = (packed_entry - 1);
+    return index;
+}
+
+
+void
+get_text(char buff[128], const unsigned char *ptr)
+{
+    unsigned int i;
+    const unsigned char *start;
+    unsigned char cmd;
+    start = (ptr-RABIN_WINDOW-1);
+    cmd = *(start);
+    if (cmd < 0x80) {// This is likely to be an insert instruction
+        if (cmd < RABIN_WINDOW) {
+            cmd = RABIN_WINDOW;
+        }
+    } else {
+        /* This was either a copy [should never be] or it
+         * was a longer insert so the insert start happened at 16 more
+         * bytes back.
+         */
+        cmd = RABIN_WINDOW + 1;
+    }
+    if (cmd > 60) {
+        cmd = 60; /* Be friendly to 80char terms */
+    }
+    /* Copy the 1 byte command, and 4 bytes after the insert */
+    cmd += 5;
+    memcpy(buff, start, cmd);
+    buff[cmd] = 0;
+    for (i = 0; i < cmd; ++i) {
+        if (buff[i] == '\n') {
+            buff[i] = 'N';
+        } else if (buff[i] == '\t') {
+            buff[i] = 'T';
+        }
+    }
 }
 
 struct delta_index *
 create_delta_index_from_delta(const struct source_info *src,
-                              const struct delta_index *old)
+                              struct delta_index *old_index)
 {
-    unsigned int i, hsize, hmask, num_entries, prev_val, *hash_count;
-    unsigned int total_num_entries;
+    unsigned int i, num_entries, max_num_entries, prev_val, num_inserted;
+    unsigned int hash_offset;
     const unsigned char *data, *buffer, *top;
     unsigned char cmd;
-    struct delta_index *index;
-    struct unpacked_index_entry *entry, **hash;
-    void *mem;
-    unsigned long memsize;
+    struct delta_index *new_index;
+    struct index_entry *entry, *entries, *old_entry;
 
     if (!src->buf || !src->size)
         return NULL;
@@ -486,47 +697,21 @@
        actual number will be recomputed during processing.
        */
 
-    num_entries = (src->size - 1)  / RABIN_WINDOW;
-    if (old != NULL)
-        total_num_entries = num_entries + old->num_entries;
-    else
-        total_num_entries = num_entries;
-    hsize = total_num_entries / 4;
-    for (i = 4; (1u << i) < hsize && i < 31; i++);
-    hsize = 1 << i;
-    hmask = hsize - 1;
-
-    /* allocate lookup index */
-    memsize = sizeof(*hash) * hsize +
-          sizeof(*entry) * total_num_entries;
-    mem = malloc(memsize);
-    if (!mem)
-        return NULL;
-    hash = mem;
-    mem = hash + hsize;
-    entry = mem;
-
-    memset(hash, 0, hsize * sizeof(*hash));
-
-    /* allocate an array to count hash num_entries */
-    hash_count = calloc(hsize, sizeof(*hash_count));
-    if (!hash_count) {
-        free(hash);
-        return NULL;
-    }
+    max_num_entries = (src->size - 1)  / RABIN_WINDOW;
+
+    /* allocate an array to hold whatever entries we find */
+    entries = malloc(sizeof(*entry) * max_num_entries);
+    if (!entries) /* malloc failure */
+        return NULL;
 
     /* then populate the index for the new data */
-    /* The create_delta_index code starts at the end and works backward,
-     * because it is easier to update the entry pointers (you always update the
-     * head, rather than updating the tail).
-     * However, we can't walk the delta structure that way.
-     */
     prev_val = ~0;
     data = buffer;
     /* source size */
     get_delta_hdr_size(&data, top);
     /* target size */
     get_delta_hdr_size(&data, top);
+    entry = entries; /* start at the first slot */
     num_entries = 0; /* calculate the real number of entries */
     while (data < top) {
         cmd = *data++;
@@ -545,7 +730,13 @@
                 /* Invalid insert, not enough bytes in the delta */
                 break;
             }
-            for (; cmd > RABIN_WINDOW; cmd -= RABIN_WINDOW,
+            /* The create_delta code requires a match at least 4 characters
+             * (including only the last char of the RABIN_WINDOW) before it
+             * will consider it something worth copying rather than inserting.
+             * So we don't want to index anything that we know won't ever be a
+             * match.
+             */
+            for (; cmd > RABIN_WINDOW + 3; cmd -= RABIN_WINDOW,
                                        data += RABIN_WINDOW) {
                 unsigned int val = 0;
                 for (i = 1; i <= RABIN_WINDOW; i++)
@@ -553,27 +744,16 @@
                 if (val != prev_val) {
                     /* Only keep the first of consecutive data */
                     prev_val = val;
-                    i = val & hmask;
-                    entry->entry.ptr = data + RABIN_WINDOW;
-                    entry->entry.val = val;
-                    entry->entry.src = src;
-                    entry->next = NULL;
-                    /* Now we have to insert this entry at the end of the hash
-                     * map
-                     */
-                    if (hash[i] == NULL) {
-                        hash[i] = entry;
-                    } else {
-                        struct unpacked_index_entry *this;
-                        for (this = hash[i];
-                            this->next != NULL;
-                            this = this->next) { /* No action */ }
-                        this->next = entry;
-                        this = entry;
-                    }
-                    hash_count[i]++;
+                    num_entries++;
+                    entry->ptr = data + RABIN_WINDOW;
+                    entry->val = val;
+                    entry->src = src;
                     entry++;
-                    num_entries++;
+                    if (num_entries > max_num_entries) {
+                        /* We ran out of entry room, something is really wrong
+                         */
+                        break;
+                    }
                 }
             }
             /* Move the data pointer by whatever remainder is left */
@@ -589,49 +769,72 @@
     }
     if (data != top) {
         /* Something was wrong with this delta */
-        free(hash);
-        free(hash_count);
+        free(entries);
         return NULL;
     }
     if (num_entries == 0) {
         /** Nothing to index **/
-        free(hash);
-        free(hash_count);
+        free(entries);
         return NULL;
     }
-    if (old != NULL) {
-        if (hmask == old->hash_mask) {
-            /* The total hash table size didn't change, which means that
-             * entries will end up in the same pages. We can do bulk-copying to
-             * get the final output
-             */
-            index = create_index_from_old_and_hash(old, hash, hsize,
-                                                   num_entries);
-            free(hash_count);
-            free(hash);
-            if (!index) {
-                return NULL;
-            }
-            index->last_src = src;
-            return index;
-        } else {
-            total_num_entries = num_entries + old->num_entries;
-            include_entries_from_index(hash, hash_count, hsize, entry, old);
-        }
+    assert(old_index != NULL);
+    old_index->last_src = src;
+    /* See if we can fill in these values into the holes in the array */
+    entry = entries;
+    num_inserted = 0;
+    for (; num_entries > 0; --num_entries, ++entry) {
+        hash_offset = (entry->val & old_index->hash_mask);
+        /* The basic structure is a hash => packed_entries that fit in that
+         * hash bucket. Things are structured such that the hash-pointers are
+         * strictly ordered. So we start by pointing to the next pointer, and
+         * walk back until we stop getting NULL targets, and then go back
+         * forward. If there are no NULL targets, then we know because
+         * entry->ptr will not be NULL.
+         */
+        old_entry = old_index->hash[hash_offset + 1];
+        old_entry--;
+        while (old_entry->ptr == NULL
+               && old_entry >= old_index->hash[hash_offset]) {
+            old_entry--;
+        }
+        old_entry++;
+        if (old_entry->ptr != NULL
+            || old_entry >= old_index->hash[hash_offset + 1]) {
+            /* There is no room for this entry, we have to resize */
+            // char buff[128];
+            // get_text(buff, entry->ptr);
+            // fprintf(stderr, "Failed to find an opening @%x for %8x:\n '%s'\n",
+            //         hash_offset, entry->val, buff);
+            // for (old_entry = old_index->hash[hash_offset];
+            //      old_entry < old_index->hash[hash_offset+1];
+            //      ++old_entry) {
+            //     get_text(buff, old_entry->ptr);
+            //     fprintf(stderr, "  [%2d] %8x %8x: '%s'\n",
+            //             (int)(old_entry - old_index->hash[hash_offset]),
+            //             old_entry->val, old_entry->ptr, buff);
+            // }
+            break;
+        }
+        num_inserted++;
+        *old_entry = *entry;
+        /* For entries which we *do* manage to insert into old_index, we don't
+         * want them double copied into the final output.
+         */
+        old_index->num_entries++;
+    }
+    if (num_entries > 0) {
+        /* We couldn't fit the new entries into the old index, so allocate a
+         * new one, and fill it with stuff.
+         */
+        // fprintf(stderr, "inserted %d before resize\n", num_inserted);
+        new_index = create_index_from_old_and_new_entries(old_index,
+            entry, num_entries);
     } else {
-        total_num_entries = num_entries;
-    }
-
-    total_num_entries = limit_hash_buckets(hash, hash_count, hsize,
-                                           total_num_entries);
-    free(hash_count);
-    index = pack_delta_index(hash, hsize, total_num_entries);
-    free(hash);
-    if (!index) {
-        return NULL;
-    }
-    index->last_src = src;
-    return index;
+        new_index = NULL;
+        // fprintf(stderr, "inserted %d without resizing\n", num_inserted);
+    }
+    free(entries);
+    return new_index;
 }
 
 void free_delta_index(struct delta_index *index)
@@ -731,7 +934,8 @@
              *       You end up getting a lot more collisions in the hash,
              *       which doesn't actually lead to a entry->val match.
              */
-            for (entry = index->hash[i]; entry < index->hash[i+1];
+            for (entry = index->hash[i];
+                 entry < index->hash[i+1] && entry->src != NULL;
                  entry++) {
                 const unsigned char *ref;
                 const unsigned char *src;

=== modified file 'bzrlib/tests/test__groupcompress_pyx.py'
--- a/bzrlib/tests/test__groupcompress_pyx.py	2009-03-11 06:50:59 +0000
+++ b/bzrlib/tests/test__groupcompress_pyx.py	2009-03-18 21:50:56 +0000
@@ -82,6 +82,16 @@
 common with the next text
 """
 
+_fourth_text = """\
+123456789012345
+same rabin hash
+123456789012345
+same rabin hash
+123456789012345
+same rabin hash
+123456789012345
+same rabin hash
+"""
 
 class Test_GroupCompress(tests.TestCase):
     """Direct tests for the compiled extension."""
@@ -191,15 +201,17 @@
 
     def test_delta_with_delta_bytes(self):
         di = self._gc_module.DeltaIndex()
+        source = _first_text
         di.add_source(_first_text, 0)
         self.assertEqual(len(_first_text), di._source_offset)
         delta = di.make_delta(_second_text)
         self.assertEqual('Dh\tsome more\x91\x019'
                          '&previous text\nand has some extra text\n', delta)
         di.add_delta_source(delta, 0)
+        source += delta
         self.assertEqual(len(_first_text) + len(delta), di._source_offset)
-        third_delta = di.make_delta(_third_text)
-        result = self._gc_module.apply_delta(_first_text + delta, third_delta)
+        second_delta = di.make_delta(_third_text)
+        result = self._gc_module.apply_delta(source, second_delta)
         self.assertEqualDiff(_third_text, result)
         # We should be able to match against the 'previous text\nand has some...'
         # that was part of the delta bytes
@@ -207,4 +219,31 @@
         # enough to match in the original text, and those bytes are not present
         # in the delta for the second text.
         self.assertEqual('z\x85\x01\x90\x14\x1chas some in common with the '
+                         '\x91T&\x03and\x91\x18,', second_delta)
+        # Add this delta, and create a new delta for the same text. We should
+        # find the remaining text, and only insert the short 'and' text.
+        di.add_delta_source(second_delta, 0)
+        source += second_delta
+        third_delta = di.make_delta(_third_text)
+        result = self._gc_module.apply_delta(source, third_delta)
+        self.assertEqualDiff(_third_text, result)
+        self.assertEqual('\xa6\x01\x85\x01\x90\x14\x91\x80\x1c'
                          '\x91T&\x03and\x91\x18,', third_delta)
+        # Now create a delta, which we know won't be able to be 'fit' into the
+        # existing index
+        fourth_delta = di.make_delta(_fourth_text)
+        self.assertEqual(_fourth_text,
+                         self._gc_module.apply_delta(source, fourth_delta))
+        self.assertEqual('\xa6\x01\x80\x01'
+                         '\x7f123456789012345\nsame rabin hash\n'
+                         '123456789012345\nsame rabin hash\n'
+                         '123456789012345\nsame rabin hash\n'
+                         '123456789012345\nsame rabin hash'
+                         '\x01\n', fourth_delta)
+        di.add_delta_source(fourth_delta, 0)
+        source += fourth_delta
+        # With the next delta, everything should be found
+        fifth_delta = di.make_delta(_fourth_text)
+        self.assertEqual(_fourth_text,
+                         self._gc_module.apply_delta(source, fifth_delta))
+        self.assertEqual('\xac\x02\x80\x01\x91\xab\x7f\x01\n', fifth_delta)