Bazaar, OLPC, and the "no-packs" repository format.

Thu Dec 20 14:44:26 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:
> On Št, 2007-12-20 at 07:54 -0600, John Arbash Meinel wrote:
>> Lukáš Lalinský wrote:
>>> On Št, 2007-12-20 at 06:25 -0600, John Arbash Meinel wrote:
>>>> Robert has at least conceptualized a hash-based index, so lookups become more
>>>> of an O(1) operation. Which makes it O(N) for N packs, but that is better than
>>>> O(N log(M)) for our current indexes. I wonder if you could even combine the
>>>> hash maps to make it O(1) across all indexes that you have read...
>>> I've implemented a prototype of a similar idea about two weeks ago, but
>>> gave up on finishing it because I found out that the main bottleneck for
>>> me was reading data from all over the big pack files. The indexing code
>>> is in the attachment, but it's only a part of the plugin, which itself
>>> requires patched bzrlib and breaks 'packs-0.92', so I did't bother with
>>> attaching the whole package (it's not very usable, anyway).
>> For what operations?
> 
> Various logs were my primary tests, since I was playing with this
> because of the per-file log performance bug I've reported. To get all
> revisions it had to seek so many times that even though I had faster
> index than with knits, the whole operation is slower than with knits.

Have you used the recent 'bzr pack' which sorts the revision data by reverse
topological order?

For *my* use of log, I do:

bzr log --short --forward -r-10..-1

Which is still extra slow because it is having to grab the whole revision graph
(to generate the dotted revnos).

So making 'get_revision_graph()' faster would help me a lot.
>  
>>> The general idea is that the index is split into segments, which are
>>> grouped by their left-hand history (similar to toposort, but backwards)
>>> and the index header contains 32 bit hashes of each key of each segment.
>>> The combined index also uses a global cache, so in most cases you don't
>>> search in all indexes, but only the ones that actually contain the key
>>> (for 15k revisions in bzr.dev I get no hash collisisions, so it's
>>> something like O(1+very_small_constant*N)). But that helps only in cases
>>> where you have a few big indexes, not a *lot* of tiny ones.
>> That seems like it would be difficult to look up an arbitrary key. Since you
>> don't know where it fits relative to other keys.
> 
> No, lookup for an arbitrary key within one index is O(1) operation. You
> have a hash map of all revisions IDs pointing to segment number in which
> they are. This makes decision whether a revision ID is in the index
> really trivial, and even for Python SVN repository with ~33k revisions I
> had no hash collisions, so in order to lookup one key I had to load only
> one segment. Though, the main advantage is that for most operations you
> need the whole ancestry of a key, and again you read only the
> interesting segments (even if they are in multiple indexes). You never
> read segments you don't need, and the related segments are close
> together, so in most cases you can read them without seeking.
> 
> Lukas
> 

Ok, the way you said it sounded like you have a pointer of just the tip
revision in the global hash map. Instead you have a hash of every key pointing
to the start of the segment they are in.

How do you know how long the segments are? Is that stored in the index as well.

I suppose I should look at your code a bit more.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHan/JJdeBCYSNAAMRAgUxAKC99k+/WIhnTxnpiNiXKX3NAlTXUwCgrKkx
+Irk8X7peYBgUCzOi9OqhPg=
=MvqA
-----END PGP SIGNATURE-----