Rev 4804: Some small doc updates to chk_index. in http://bazaar.launchpad.net/~jameinel/bzr/chk-index

Fri Oct 30 14:29:40 GMT 2009

At http://bazaar.launchpad.net/~jameinel/bzr/chk-index

------------------------------------------------------------
revno: 4804
revision-id: john at arbash-meinel.com-20091030142922-5iipnhlg49r3rgi9
parent: john at arbash-meinel.com-20091028204625-b0owje7tzg60y96o
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: chk-index
timestamp: Fri 2009-10-30 09:29:22 -0500
message:
  Some small doc updates to chk_index.
-------------- next part --------------
=== modified file 'doc/developers/improved_chk_index.txt'

--- a/doc/developers/improved_chk_index.txt	2009-03-24 16:35:22 +0000
+++ b/doc/developers/improved_chk_index.txt	2009-10-30 14:29:22 +0000
@@ -13,10 +13,11 @@
 Btree indexes also rely on zlib compression, in order to get their compact
 size, and further has to try hard to fit things into a compressed 4k page.
 When the key is a sha1 hash, we would not expect to get better than 20bytes
-per key, which is the same size as the binary representation of the hash. This
-means we could write an index format that gets approximately the same on-disk
-size, without having the overhead of ``zlib.decompress``. Some thought would
-still need to be put into how to efficiently access these records from remote. 
+per key, which is the same size as the binary representation of the hash (zlib
+compressing a sorted list of 10M hashes shrunk to only 97%). This means we
+could write an index format that gets approximately the same on-disk size,
+without having the overhead of ``zlib.decompress``. Some thought would still
+need to be put into how to efficiently access these records from remote. 
 
 
 Required information
@@ -112,7 +113,7 @@
        small keys, low chance of collision, this is *not* redundant with the
        value stored in (a)) This should then dereference into a location in
        the index. This should probably be a 4-byte reference. It is unlikely,
-       but possible, to have an index >16MB. With an 10-byte entry, it only
+       but possible, to have an index >16MB. With a 10-byte entry, it only
        takes 1.6M chk nodes to do so.  At the smallest end, this will probably
        be a 256-way (8-bits) fan out, at the high end it could go up to
        64k-way (16-bits) or maybe even 1M-way (20-bits). (64k-way should
@@ -385,6 +386,9 @@
 64k records. And our groups are currently scaled that we require at least
 1-2MB before they can be considered 'full'.
 
+However, there are also extremely pessimistic cases that can exist. So a
+variable number of bytes per group offset is probably the best answer.
+
 
 variable length index entries
 -----------------------------