[RFC] dirstate design update

Mon Feb 19 02:31:42 GMT 2007

So, dirstate is now provisionally working, but I've realised theres a
flaw in the current design. Its one that John and I discussed without
really getting detailed enough last year, and I've now got a good handle
on it.

I think we should modify dirstate to be roughly the following for each
line in the state file: Make the tuple (dirname, basename, fileid) be
the unique key (rather than (dirname, basename, fileid).

Secondly, we should make the data for each tree more consistent,
probably by factoring out the path for each parent; this will make the
state file smaller in the common case without reducing our ability to
bisect for paths in each tree, because, rather than storing the details
for parent entries under the path in the current tree, we should move
them to be at the path in their own tree, giving us a per-treeid
path->id mapping that we can bisect into. To keep the unique key unique,
we need a record in a tree that can act as a pointer to the alternative
path. Currently we have the following structure:

rows = dirname basename KIND fileid_utf8 size packed_stat fingerprint {parent_details} 
parent_details = revision_utf8 KIND dirname basename size executable fingerprint

I propose that it become (all fields in utf8):
row = key current_details {parent_details}
key = dirname basename fileid
current_details = common_details working_details
common_details = minikind fingerprint size executable
working_details = packed_stat
parent_details = common_details history_details
history_details = revisionid

In any tree, a kind of 'moved' indicates that the fingerprint field
(which we treat as opaque data specific to the 'kind' anyway) has the
details for the id of this row in that tree.

I'm strongly tempted to add a id->path index as well, but I think that
where we need id->path mapping; we also usually read the whole file, so
I'm going to skip that for the moment, as we have the ability to locate
via bisect any path in any tree, and if we lookup things by path, we can
accumulate a id->path mapping as we go, which will tend to match what we
looked for.

I plan to implement this asap, so please speak up now to alter/tweak the
design - and once we stabilise on this, I'll update the wiki page for
it.

The rationale for all this is that we want fast operations for the
common case (diff/status/commit/merge on all files) and extremely fast
operations for the less common but still occurs a lot status/diff/commit
on specific files). Operations on specific files involve a scan for all
the children of a path, *in every involved tree*, which the current
format did not accommodate. 

Cheers,
Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070219/0992c26d/attachment.pgp