[CHK/MERGE] CHKMap.iter_interesting_nodes() and fetch updates

Mon Dec 1 04:02:36 GMT 2008

On Thu, 2008-11-20 at 14:05 -0600, John Arbash Meinel wrote:
> 
> 
> 4) The biggest update is chk_map.iter_interesting_nodes() which takes
> some root keys to exclude and some keys to include, and then can walk
> what it needs to find the subset that is interesting.
> 
> At the moment, it does one "deep" pass on uninteresting, and then does
> the minimal pass it can on all of the interesting keys. We need to
> have
> the full set of uninteresting, because the values may only exist in a
> deep LeafNode.
>
> However, at the very root page, it *is* smart enough to filter out any
> exact matches it finds at that level.
> 
> The algorithm itself could be tweaked significantly, using some info
> from how prefixes work, etc, so that it can filter out sub-trees that
> can't possibly be interesting/uninteresting.

The iter_changes algorithm does this without a deep scan already - I'm
sure its algorithm can be adapted here. The algorithm is a parallel scan
of all the lowest-prefix pointers; common subtrees are always eliminated
regardless of height.

> I have the feeling that the current algorithm is actually sufficient
> for
> the code to be "production ready", though. If the trees really are
> similar, then that first culling can filter out most of the subtrees.
> And after that we do one full-depth search, and everything else is
> truly
> minimal. Though it is a bit of a shame the full-depth is on the
> uninteresting side, as we know we don't want those pages.

We do need to get rid of that to be able to pull small changes
efficiently. At the moment this would be much worse than the current
examine-knit-delta case for a single-change in a large tree.

> 5) iter_interesting_nodes() works very well for both
> item_keys_introduced_by() and general fetching, and has been hooked
> into
> both places. Replacing the customized _find_file_keys* that Andrew and
> I
> put together.

Excellent.

> The big win here is that doing "bzr branch bzrtools-d4 xxx" takes 8s
> instead of >1min. (Doing it with existing formats takes 4-5s so there
> is
> still bits that we can improve upon.)
> One of the big issues is that we still read all of the inventory bits
> twice. Once for item_keys_introduced_by, and once for actually copying
> the pages. Which should be addressed by better generic fetch code.
> 
> The way it is written, we *do* end up buffering the records from
> get_record_stream() which is generally not preferred. At the peak,
> that
> will be 1 CHK page for every revision. So for bzr.dev that could be
> approx 25,000*4k = 100MB, and 260MB for MySQL.
> 
> Again, the algorithm can be tweaked, so it doesn't have to process as
> deep, or buffer as much. I can't think of a great way to avoid *any*
> buffering that doesn't end up causing us to read the pages twice. If
> we
> *really* didn't want to buffer the records, the algorithm could be
> trivially tweaked to only track the keys rather than the whole record
> object.

I think that reading twice is fine for streaming pull, and avoidable for
VFS pull, by copying everything we read if we might need it (if we read
it, write it to the local repo).

I haven't read the exact code yet, I'll do that on your next edition of
the patch shortly.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081201/815485b5/attachment.pgp