cssc vs bzr weave
John A Meinel
john at arbash-meinel.com
Tue Oct 11 05:50:25 BST 2005
In an attempt to try and figure out what is going on, I wrote a script
which uses cssc (the open source SCCS replacement:
http://cssc.sourceforge.net/)
Basically, my idea was to try and convert the bzr inventory weave, into
an SCCS weave, just to see what the file size, insert, get, etc times
would be.
Now, I don't have any real idea how to use sccs, but I figured out the
"get -e" and "delta" commands enough that I think I'm doing the right thing.
My testing was done with version 1.0.1 (the script expects the
executables to be in the parent directory).
Anyway, what I found is that bzr weave is much faster than cssc, even
though it is written in python.
CSSC is either O(N^2) or O(e^N) in the number of revisions. It is hard
to say which, because both provide a good fit to the data (R^2 > 0.98)
See my plot here:
http://bzr.arbash-meinel.com/other/timing/CSSCvsWeave.png
Regardless, the bzr weave code is O(N), which is much better.
Now, all of this could be because of CSSC using a poor diff
implementation (Martin was talking that python's diff should be
O(text_lines^2)), I can't really say. Though it is at least O(N^2),
where N is the total number of lines in the file.
I think there is a bug in CSSC 1.0.1, because it seems to always add the
last line, even though it hasn't really changed. This may be a primary
source of the problem, or it may be secondary.
Does anyone have a better knowledge of SCCS, that could modify my script
to use sccs directly.
I tried reading up on SCCS's weave format, which apparently had more
tricks than the bzr's weave format. But at least CSSC is not a place to
look for optimal solutions.
I've tried to think of ways to try and improve bzr's O(weavelines)
behavior. It isn't that bad, but because inventory is now weaved, we
will always see pathological behavior, because the inventory is modified
*every* revision.
With the current implementation, even "bzr status" is O(num changes to
all files in entire history), because we have to extract the last
committed inventory in order to do a comparison. (Which on my machines
dwarfs the stat cost, especially with the hashcache).
With inventory, one could certainly just "start over" every 100 revs or
so, since we don't actually care about the weavelines (we won't weave
merge), we just care about a decent method for compacting the history.
This might be all we need to make bzr work with reasonable speed again.
(Probably this would mean a slightly different weave format, maybe
splitting of the revision list into a separate file, so that you could
determine what texts were labeled with what number).
But this doesn't help weaves in general, it just helps bzr in a specific
case. So I was wondering what tricks SCCS might have implemented in the
past, rather than trying to rediscover everything.
I thought about changing the in-memory representation. Potentially
keeping some sort of tree data structure, which would let you skip
regions that do not involve the lines you care about. Though I have some
difficulty coming up with a structure that doesn't end up just unweaving
every revision into memory. (Fast access, but you pay for it dearly).
I don't know if there might be something similar to my append-only
format, since you could keep a store of texts, and then just the
add/delete operations. Then rather than scanning linearly through
memory, you could just rebuild the text from those operations. (You
would probably need weave independent, not my weave-dependent,
operations.) Maybe Aaron Bentley's WeaveDiff format would be a better
starting point.
Just some thoughts (and another pretty picture),
John
=:->
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: time_convert.py
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20051010/da281809/attachment.diff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051010/da281809/attachment.pgp
More information about the bazaar
mailing list