[CHK/MERGE] Use chk map to determine file texts to fetch

Tue Nov 11 22:50:36 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2008-11-11 at 19:14 +1000, Andrew Bennetts wrote:
>> Andrew Bennetts wrote:
>>> Hi Robert,
>>>
>>> This patch improves chk->chk fetch to compare hash values rather than
>>> inventory entries when determining the text versions to transfer.  Fetching is
>>> still very slow for other reasons, but this seems like a step in the right
>>> direction.
>> This update adds CHKMap.copy_to(store) and uses it to make fetch run in a
>> reasonable amount of time.
> 
> I'd like to make sure I understand what this is doing a bit better...
> 
> If I understand it correctly, its so that a fetch of a CHKInventory from
> repo A to repo B doesn't end up generating a whole new CHKInventory from
> scratch.
> 
> Wouldn't it be better though, to copy all the root nodes, then all the
> unique key references from all the inventories? That would give a
> breadth first traversal. Or better still use the same logic as you have
> put in pack_repo for determining which inventory entries to examine, to
> determine which parts of the inventory trees need copying.
> 
> I think the primary thing I'm concerned about here is that this doesn't
> stream - it needs to chat, so the source side and the target side need a
> full duplex connection, which means it won't fit well into a streaming
> push from/to the smart server.
> 
> -Rob
> 

So

1) It doesn't stream, that is correct
2) We should be doing it across all inventories we are copying at once,
rather than one inventory at a time.

3) We *do* read the root, then copy the unique keys + root. I'm not sure
what you are thinking that it does.

In the end, I think a better solution is to get a better streaming fetch
written, but this is in the "get it to the point it is usable", rather
than rewriting everything. (Such as having a usertest run complete in
less-than-overnight fashion.)

Our logic for streaming data from the remote isn't really complete. We
have the "get_record_stream()" functionality, which is a decent start,
but you have to know all of the keys you are going to need ahead of time.

What we really want is something that could be given a "search" on
revision-ids, and then fill in all the details from there on down. This
can be layered on top of our work for improving
"item_keys_introduced_by" which can figure out what *texts* need to be
transmitted, but at the same time, we could figure out what inventory
pages need to be transmitted. I don't know if this fits into a "generic"
fetch code, though.

I think get_record_stream() is a decent step in that direction, with the
major limitation at the moment that you need to know the list of keys
that you need to request.

Imagine if we unified the namespace of all the various
revisions/inventories/texts/chk nodes, etc. So that at the top level you
would have something like "item_keys_introduced_by()" which would yield:

('revision', 'revision-id')
('inventory', 'revision-id')
('chk', 'sha1:xxyyzz')
('chk', 'sha1:aabbcc')
('text', 'file-id', 'revision-id')
...

You could either have the namespace unified, or the first entry would
define what index needs to be used, etc.

Perhaps all we really need is to update item_keys_introduced_by() to
allow it to return the chk pages that need to be copied.

The other concern is needing the ability to adapt between repository
formats. Certainly streaming chk inventory records to a Knit repository
needs someone to do the translation. (And is it possible to do the
translation if you are only given a the minimal stream you would want if
you were streaming into another chk repository.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkaDDsACgkQJdeBCYSNAAMAnwCgzl9OpbCeF45oQ4Z9X84lDteF
T3wAni8E6ylDH/b/8rBR6iQmtWqggV8p
=kzXB
-----END PGP SIGNATURE-----