Prototype "improved_chk_index"
John Arbash Meinel
john at arbash-meinel.com
Fri Oct 30 01:59:34 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
> John Arbash Meinel wrote:
>
>> So I think we *could* do better about size if we want to put a fairly
>> significant amount of effort into it. The "easy" fixes would be:
>>
>> 1) Move the text content into .cix, and then only have the per-file
>> graph available in .tix. (Allowing us to remove the 'value' field),
>> saving about 1.25:1
>> 2) Fix the per-file graph for root nodes to not require a node for every
>> revision that came from a non-rich-root source. That saves another
>> 1.28:1 for a total 1.6:1 space savings in .tix
>> 3) Think about some way to combine .rix and .iix. Possibly just dropping
>> the inventory records entirely. We talked about doing that in the
>> past. The most significant issue is stacked branches needing the
>> 'parent inventories but *not* the parent revisions'. Though we could
>> do that with a simple flag in the index that said "this revision not
>> considered 'present'"...
>> This is 2.3MB of the 30MB in indexes for LP, so <10% total space. But
>> becoming a more significant fraction if we shrink .cix and .tix.
>
> As another data point, the FireFox 3.5 import shows:
>
> * 123M pack file
> * 13M indices
> * 11M checkout/dirstate
> * 4.1M checkout/merge-hashes
>
> The index sizes are:
>
> 6.1M .bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.cix
> 1.2M .bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.iix
> 1.2M .bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.rix
> 4.0K .bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.six
> 4.3M .bzr/repository/indices/43a941041bdf68b431bc9c73b9004fd1.tix
>
> Looking inside the matching .git import (after pack -adf --window=250):
>
> 4.0K .git/branches
> 4.0K .git/COMMIT_EDITMSG
> 4.0K .git/config
> 4.0K .git/description
> 4.0K .git/HEAD
> 48K .git/hooks
> 4.1M .git/index
> 16K .git/info
> 88K .git/logs
> 123M .git/objects
> 332K .git/refs
>
> .git/objects is the matching pack file.
>
> So head-to-head, both tools have a 123M pack file. Beyond the pack file,
> git's overheads are 4.1M and ours are 28.1M. That certainly suggests we
> have room for improvement in this area.
>
> Ian C.
>
Not quite. ".git/index" is actually the staging area, aka 'checkout'.
.git/objects/**/*.idx is the index files, and .git/objects/**/*.pack is
the pack files.
My guess is that the .pack file is >118MB and ~5MB for the .index files.
ISTR that git index files scale at about 24-28 bytes per sha. Each entry
is a sha-hash and an offset in the .pack file. The compression parents
are in the data stream, not the index, etc. I don't know if the 8-byte
offset is always triggered in newer versions, or only if the .pack is
big enough.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkrqSIYACgkQJdeBCYSNAAMyDACcCyoxf8eN15gxnq64Nw2HX+d5
P74AmwcoazKLD5vUZE87Ge0KyiAULKqN
=K4OY
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list