Rev 4180: The mini-index is the key to shrinking the key width and still being able to 'scale'. in http://bzr.arbash-meinel.com/branches/bzr/jam-integration

Sat Mar 21 02:01:17 GMT 2009

At http://bzr.arbash-meinel.com/branches/bzr/jam-integration

------------------------------------------------------------
revno: 4180
revision-id: john at arbash-meinel.com-20090321020106-507m1ij4mk38cwh3
parent: john at arbash-meinel.com-20090321013631-wbiiaca5hpfcnlo6
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: jam-integration
timestamp: Fri 2009-03-20 21:01:06 -0500
message:
  The mini-index is the key to shrinking the key width and still being able to 'scale'.
  As the index gets longer, we add log(N) bits to the mini-index, which is
  exactly what we need to preserve the same level of collision avoidance.
-------------- next part --------------
=== modified file 'doc/developers/improved_chk_index.txt'

--- a/doc/developers/improved_chk_index.txt	2009-03-21 01:36:31 +0000
+++ b/doc/developers/improved_chk_index.txt	2009-03-21 02:01:06 +0000
@@ -100,7 +100,7 @@
 
 5. So for 1M keys, an ideal chk+group index would be:
     
-    a. 8-byte hash prefix
+    a. 6-byte hash prefix
 
     b. 2-byte group number
 
@@ -108,22 +108,22 @@
 
     d. a separate lookup of 12-byte group number to offset + length
 
-    e. a variable width mini-index that splits X bits of the key. (for the
-       smallest keys, and lowest chance of collision, this could *not* be
-       redundant with the value stored in (a)) This should then dereference
-       into a location in the index. This should probably be a 4-byte
-       reference. It is unlikely, but possible, to have an index >16MB. With
-       an 8-byte hash, it takes 1.3M chk nodes to do so.
-       At the smallest end, this will probably be a 256-way (8-bits) fan out,
-       at the high end it could go up to 64k-way (16-bits) or maybe even
-       1M-way (20-bits). (64k-way should handle up to 5-16M nodes and still
-       allow a cheap <4k read to find the final entry.)
+    e. a variable width mini-index that splits X bits of the key. (to maintain
+       small keys, low chance of collision, this is *not* redundant with the
+       value stored in (a)) This should then dereference into a location in
+       the index. This should probably be a 4-byte reference. It is unlikely,
+       but possible, to have an index >16MB. With an 10-byte entry, it only
+       takes 1.6M chk nodes to do so.  At the smallest end, this will probably
+       be a 256-way (8-bits) fan out, at the high end it could go up to
+       64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should
+       handle up to 5-16M nodes and still allow a cheap <4k read to find the
+       final entry.)
 
 So the max size for the optimal groupcompress+chk index with 10M entries would be::
 
-  12 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 121 MiB
+  10 * 10M (entries) + 64k * 12 (group) + 64k * 4 (mini index) = 101 MiB
 
-So 121MiB which breaks down as 120MiB for the actual entries, 0.75MiB for the
+So 101MiB which breaks down as 100MiB for the actual entries, 0.75MiB for the
 group records, and 0.25MiB for the mini index.
 
 1. Looking up a key would involve:
@@ -158,20 +158,30 @@
       
    f. If the size of these mini headers becomes critical (8 bytes per record
       is 8% overhead for 100 byte records), we could also compress this mini
-      header. Alternatively we could set the 'width' of records. So instead of
-      always using 4-byte wide offset + length, we could switch to 2-byte
-      wide. Though groups are likely to always be >64KiB long (uncompressed).
+      header. Changing the number of bytes per entry is unlikely to be
+      efficient, because groups standardize on 4MiB wide, which is >>64KiB for
+      a 2-byte offset, 3-bytes would be enough as long as we never store an
+      ISO as a single entry in the content. Variable width also isn't a big
+      win, since base-128 hits 4-bytes at just 2MiB.
+
       For minimum size without compression, we could only store the 4-byte
       length of each node. Then to compute the offset, you have to sum all
-      previous nodes. This should still be very cheap in compiled code, and
-      would also have the advantage that it would be highly compressible
-      itself. (Most nodes are going to have a length that fits 1-2 bytes.)
-      An alternative form would be to use the base-128 encoding. (If the MSB
+      previous nodes. We require <64k nodes in a group, so it is up to 256KiB
+      for this header, but we would lose partial reads.  This should still be
+      cheap in compiled code (needs tests, as you can't do partial info), and
+      would also have the advantage that fixed width would be highly
+      compressible itself. (Most nodes are going to have a length that fits
+      1-2 bytes.)
+      
+      An alternative form would be to use the base-128 encoding.  (If the MSB
       is set, then the next byte needs to be added to the current value
       shifted by 7*n bits.) This encodes 4GiB in 5 bytes, but stores 127B in 1
-      byte, and 2MiB in 3 bytes. If we only stored 64k entries in a group, we
-      would expect this header to be <128KiB long, which should be fast to
-      process.
+      byte, and 2MiB in 3 bytes. If we only stored 64k entries in a 4 MiB
+      group, the average size can only be 64B, which fits in a single byte
+      length, so 64KiB for this header, or only 1.5% overhead. We also don't
+      have to compute the offset of *all* nodes, just the ones before the one
+      we want, which is the similar to what we have to do to get the actual
+      content out.
 
 
 Partial Hash
@@ -255,12 +265,17 @@
 certainly acceptible.  (note that isn't 1 in 1000 of those keys will be a
 collision, but 1 in 1000 that you will have a *single* collision).  Using a
 collision chance of 10^-3, and number of keys 10^6, means we need (12+3)*0.4 =
-6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7.
+6 bytes. For 10M keys, you need (14+3)*0.4 = 6.8 aka 7. We get that extra byte
+from the ``mini-index``. In an index with a lot of keys, you want a bigger
+fan-out up front anyway, which gives you more bytes consumed and extends your
+effective key width.
 
 Also taking one more look at ``H ~ 0.4 (2N + E)``, you can rearrange and
 consider that for every order of magnitude more keys you insert, your chance
 for collision goes up by 2 orders of magnitude. But for 100M keys, 8 bytes
-gives you a 1 in 10,000 chance of collision.
+gives you a 1 in 10,000 chance of collision, and that is gotten at a 16-bit
+fan-out (64k-way), but for 100M keys, we would likely want at least 20-bit fan
+out.
 
 You can also see this from the original equation with a bit of rearranging::