[MERGE] Graph.heads()

Fri Aug 24 20:39:52 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Another thing to consider is that querying the per-file graph is not
> needed. The per-file nodes are a subset of the revision graph nodes. And
> they have the same topological order. I suspect a little theory work
> will prove that a per-file graph node N is an ancestor of A IFF revision
> graph node N is an ancestory of A in the revision graph. If so (and
> please do check it) querying the revision graph is enough to answer the
> question 'what are the per file heads'. And that gives us one graph,
> which we can cache in memory and add data to as queries come in that
> have not had enough read to answer yet.

I haven't been able to come up with a counterexample yet. I'm not positive that
it isn't possible, though.

The way I thought of it, was that given 2 nodes, either they are both heads, or
one of them supersedes the other. And in a subset graph, you are missing nodes,
such that you don't have linkages you might otherwise have. (Causing a node to
not be considered a head when it really is.)

But there is also this property:
The last-modified value for a file is either the revision id of the node it is
introduced, or the value of a parent. If it is the value of a parent, then all
other parents must not have modified the file.

So in the per-file graph all potentially interesting nodes have to be present.

For example, in the revision graph:

  A
  |\
  B C
  |\|
  D E
  |/
  F

If a given file in 'E' could have a last-modified of C, and in 'D' it has a
last-modified of 'B'. Then merging E and D would say that you have 2 heads
(B,C). Which seems incorrect because revision_id 'E' supersedes 'B'. But:

a) Under our current model, if you have a last-modified of B and C, then E is
required to have a last-modified of E.
b) Even if you considered E as "bzr revert file" and we left the text marked as
C, and didn't update the record, it would mean that you need to resolve the B,C
conflict in F.

So I haven't done a rigorous proof, but after a few hours of trying different
combinations, I wasn't able to come up with a counterexample.

For even more assurance, we could actually run tests on some of our
repositories. Looking at the values given for each file graph, and comparing it
to what you would get for the revision graph as a whole.

That said, a per-file graph (having less nodes) could be significantly faster
to crawl through, than the whole tree graph.

One could imagine a file being introduced in revision 1, and then not modified
until revision 10,000 (where it was modified in 2 branches simultaneously).
The per-file graph would have 3 nodes, while you would have to search through
the entire repository graph to encounter those three nodes.

More important, I guess, is the general/average case. And determining the cost
of a round trip to read the individual file case, versus traversing the average
whole-tree graph.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGzzQIJdeBCYSNAAMRAsxmAKDPQBC+cT2Jv3UalRB5+1pDZYG6GQCffY9O
9wIT9KEf8i8u6ztoydfLv+0=
=3n9Y
-----END PGP SIGNATURE-----