Dev6 + stacking

Mon May 18 17:51:34 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ping on this, as it is blocking me from finishing up making dev6 stackable.

It breaks down into:

1) Change fileids_altered_by_revision_ids to use the iter_interesting
code, to handle ghosts, etc.
2a) Changing insert_stream() so that it knows whether or not to check
for missing text keys as part of missing inventories or
2b) Change iter_interesting to allow entries to be absent

John
=:->

John Arbash Meinel wrote:
> Sorry to break continuity, but due to power and internet outages, my
> mail is a bit messed up right now. In the end, though, it is okay,
> because I really wanted to start from scratch. (I also apologize if
> the formatting is wonky, I'm sending via gmail, and I'm not sure how
> 'text mode' works with indenting for lists). Writing the email ended
> up revealing some things, so I'll try to summarize properly.
> 
> For starters
> 
> 1) Robert recently changed 'fileids_altered_by' to be a set difference
> between the revisions listed, and any parents of those revisions whose
> inventories can be found. However, CHKInventoryRepository.fileids_...
> was *not* updated to conform to this.
> 
> 2) This shows that there probably are no direct tests of
> fileids_altered_by... under 'ghost' circumstances. It may be
> indirectly tested via fetch operations.
> 
> This doesn't break CHK repositories, because they don't use
> fileids_altered_by to do fetch. The code has inlined the comparison,
> so that as it reads the chk pages to analyze what texts (and more chk
> pages) to send, it then yields the record so that it can be
> transmitted on the wire.
> 
> The actual transmission code does effectively the same as (1), so it
> should be correct.
> 
> However, fileids_altered_by has not been updated to match the
> behavior, because it wasn't actually being used anymore.
> 
> 3) I *could* update fileids_altered... to be built on top of
> 'iter_interesting_nodes'. However, it will currently fail if attempted
> on a stacked branch before we fill in missing parent inventories.
> First I'll try to define some scenarios for XML code:
> 
>   a) We have not yet filled in missing parent inventories, however the
> current inventory texts are fulltexts. Under this condition
> fileids_altered_by will return all texts present in the given
> revision.
>   b) We have not filled in missing parents, and the current inventory
> texts are deltas. Here, fileids_altered_by cannot give all text keys,
> but can only give the text keys for things modified in this revision.
> (all it has is a line-delta.)
>      I think one could argue that this is an incorrect result, though
> one can also argue that compression parents cannot be ghosts, thus it
> is actually implicitly doing a difference between the revision and the
> missing parent, as though the parent was actually present and we then
> removed the items from it.
>      I also think that this could give incorrect results for cases
> where one inventory has a file claiming a last modified revision, such
> that the inventory for that revision does not also claim the file is
> modified. 'inv_Y[file_id].revision == X' but 'inv_X[file_id].revision
> != X'. My quick investigation doesn't show this, so I could be wrong.
>   c) We have filled in all parent inventories (sans ghosts). I believe
> this walks the deltas for the new inventories, and subtracts out the
> set of references from the full inventories of the parents.
> 
> Now if we consider this for CHK:
>   a) This doesn't quite follow for CHK. All individual texts will be
> complete, but the total inventory shape will not be. If we had all
> pages for the revisions to transmit, but not the pages for parent
> inventories, iter_interesing_nodes will compare 'interesting' with
> set(), and will give all text keys present in the revision.
>   b) This is closest to the case where we don't have parent inventory
> pages, and we only have chk pages that the client determined were
> 'interesting' for this revision. In this case, iter_interesting_nodes
> will break, as it doesn't handle absent references. We will compare
> 'interesting' with set(), and see a reference to a page we don't have.
>   c) With the current fill code, we will fill in the complete chk tree
> for parent nodes, which should also mean that we have complete chk
> trees for 'transmitted' revisions.
> 
> So the problem really is in (b), in that if we don't have a either a
> complete target inventory, or enough pages from the parent inventory,
> the walking code will fail. One could posit that it is equivalent to
> the XML case where all we have is the delta. Under those circumstances
> we could just treat 'absent' chk pages as 'implicitly matching some
> absent parent', the way we do for XML deltas. Having the page missing
> marks it as implicitly uninteresting, otherwise the client would have
> uploaded it for us.
> 
> I'm hesitant to do this, as it trusts the person who sent me the data
> to have given me everything I need, rather than validating it.
> 
> 
> 4) There is a point that Robert and I seem to be disagreeing on. Most
> notably the ordering of:
> 
>   a) Transmit the minimally correct set of revisions, inventories,
> (chk pages), texts, signatures.
>   b) Fill in missing parent inventories, (chk pages,) and compression
> parents for inventories and texts.
>   c) Check to see if there are any file texts which should be present
> but are not. (Based on the difference between the revisions we claim
> to have, and the parent inventories which we have available.)
> 
> I think we agree that the ordering is good. Where I disagree is that
> the check (c) is being done between a + b. The code is:
> 
>     insert_stream(get_stream #minimal)
>         => missing_keys = self.target_repo.get_missing_parent_inventories()
>             => get_parent_map(revisions).difference(get_parent_map(inventories))
>             => check fileids_altered_by_revision_ids()
>             <= return missing-inventories if fileids missing
>         => + get_missing_compression_parents()
>         <= return missing_keys
>     missing_stream = get_stream_for_missing_keys(missing_keys)
>     missing_keys = insert_stream(missing_stream)
>             => get_parent_map(revisions).difference(get_parent_map(inventories))
>             => check fileids_altered_by_revision_ids()
>             <= return missing-inventories if fileids missing
>         => + get_missing_compression_parents()
>         <= return missing_keys
>     if missing_keys: abort
> 
> The problem is that fileids_altered_by_revision_ids is done inside the
> first insert_stream(), which triggers 3(b) from before.  *As written*,
> the chk fetch code needs either:
>     a) Enough parent pages to be able to filter out known-common texts
>     b) All chk pages for a revision
> (Because I don't handle absent pages, and I don't particularly *want* to.)
> 
> Now if we changed the code to:
>     insert_stream(get_stream #minimal)
>         => missing_keys = self.target_repo.get_missing_parent_inventories()
>             => get_parent_map(revisions).difference(get_parent_map(inventories))
>             <= return missing-inventories
>         => + get_missing_compression_parents()
>         <= return missing_keys
>     missing_stream = get_stream_for_missing_keys(filter_absent(missing_keys))
>     missing_keys = insert_stream(missing_stream)
>             => get_parent_map(revisions).difference(get_parent_map(inventories))
>             => check fileids_altered_by_revision_ids()
>             <= return missing-inventories if fileids missing
>         => + get_missing_compression_parents()
>         <= return missing_keys
>     if missing_keys: abort
> 
> Then we don't have a problem. fileids_altered_by should be updated to
> use the set difference code, but we don't have to support absent pages
> that we don't *know* are common.
> 
> There are 2 primary changes:
>     1) Don't check for missing text keys until the missing inventory
> texts have been added
>     2) Don't attempt to transmit ghosts for missing inventories
> 
> (1) is necessary for the current chk_map.iter_interesting_nodes code,
> rather than teaching it to ignore absent records.
> (2) is necessary for the case of missing some parent inventories, some
> fraction of which are ghosts. (With the current XML code, you can
> still have missing texts until the inventories are filled in, but the
> inventory request will fail with an AbsentContentFactory because one
> of the parents is a real ghost.)
> 
> John
> =:->
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoRkhYACgkQJdeBCYSNAANhSACfbxW7OLGviV92bEn5pTtTcpVO
ew0AnjjMA4M8T8Ntshr8y53TMjLhU+GR
=cVjq
-----END PGP SIGNATURE-----