Rev 2622: Add performance analysis of diff in file:///home/pqm/archives/thelove/bzr/%2Btrunk/
Canonical.com Patch Queue Manager
pqm at pqm.ubuntu.com
Tue Jul 17 06:49:20 BST 2007
At file:///home/pqm/archives/thelove/bzr/%2Btrunk/
------------------------------------------------------------
revno: 2622
revision-id: pqm at pqm.ubuntu.com-20070717054918-530mjsb0mvyehe8h
parent: pqm at pqm.ubuntu.com-20070716205413-42lqws7bkld2gbju
parent: aaron.bentley at utoronto.ca-20070717002716-npegr2fup7cb1g5b
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Tue 2007-07-17 06:49:18 +0100
message:
Add performance analysis of diff
added:
doc/developers/diff.txt diff.txt-20070716233605-2q6jzorua7mr42jk-1
modified:
doc/developers/performance-roadmap.txt performanceroadmap.t-20070507174912-mwv3xv517cs4sisd-2
doc/developers/performance.dot performance.dot-20070527173558-rqaqxn1al7vzgcto-3
------------------------------------------------------------
revno: 2621.1.1
merged: aaron.bentley at utoronto.ca-20070717002716-npegr2fup7cb1g5b
parent: pqm at pqm.ubuntu.com-20070716205413-42lqws7bkld2gbju
committer: Aaron Bentley <aaron.bentley at utoronto.ca>
branch nick: bzr.docs
timestamp: Mon 2007-07-16 20:27:16 -0400
message:
Add performance analysis of diff
=== added file 'doc/developers/diff.txt'
--- a/doc/developers/diff.txt 1970-01-01 00:00:00 +0000
+++ b/doc/developers/diff.txt 2007-07-17 00:27:16 +0000
@@ -0,0 +1,110 @@
+diff Performance Analysis
+=========================
+
+.. contents:: :local:
+
+Minimal Work
+------------
+
+Reuse of historical comparisons
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A significant part of the work done by diff is sequence matching. This
+scales O(n^2) with the number of lines in the file. Therefore, it
+is worthwile to avoid content comparisons as much as possible.
+
+Our current knit format contains content comparisons, and this data can
+be converted into lists of matching blocks. Other future formats such as
+mpdiff may also support such conversion. So it is possible to reuse past
+comparisons.
+
+It is also possible to combine sequential comparisons. So given a comparison
+of "foo" to "bar", and "bar" to "baz", it is possible to derive a comparison of
+"foo" to "baz".
+
+Reuse of historical comparisons will scale with the number of uncommon
+build-parents between the two historical revisions. This will typically be
+proportional to the amount of change that the file has undergone. Therefore,
+in the common case, reuse of historical comparisons will scale with the
+amount of change.
+
+The downside of such reuse is that it ties the comparison to the historical
+data. But given the performance improvement, it seems to be worth
+consideration. Fresh comparisons can be performed if the user requests them.
+
+It may also be possible to accelerate comparisons by including annotation data,
+thus increasing the number of unique lines.
+
+Historical Tree Against Historical Tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This operation should be strictly proportional to the amount of change, because
+a comparison has already been done at commit time. Achieving that performance
+requires the committed data to be properly structured, so that the comparison
+can be extracted and combined with other comparisons. This comparision
+extraction should be possible at the inventory and file-content levels.
+
+Minimum work:
+
+1. Extract and combine inventory comparisons
+2. Extract and combine text comparisions for modified texts
+
+Basis Against Historical Tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This is another case of Historical Tree Against Historical Tree.
+
+Basis Against Basis
+~~~~~~~~~~~~~~~~~~~
+This is another case of Historical Tree Against Historical Tree.
+
+Working Tree Against Basis
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+This must scale with the number of versioned files, unless the user indicates
+that only certain files should be compared.
+
+Performance can be further improved by caching comparisons to avoid repeating
+them. Caching could potentially be performed by ``diff`` and perhaps by
+``merge``. Merge is aware of the relationship of a text merge's result to
+the THIS value, and the THIS value is generally the basis value. So the
+comparison is latent, but present. The only issue is extracting it.
+
+The cache could be indexed by sha1sum pairs. It could also be indexed by
+file-id, to facilitate removal of stale data.
+
+Minimum work:
+
+1. Scan working tree for modified files
+2. Retrieve cached comparisons
+3. Perform comparisons on files with no cached comparisons
+4. Cache comparisons for files with no cached comparisons
+
+Working Tree Against Historical Tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This can be structured as a comparison of working tree against basis tree,
+followed by basis tree against historical tree. Therefore, it combines the
+performance characteristics of "Working Tree Against Basis" with "Basis Against
+Historical Tree".
+
+Working Tree Against Working Tree
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This can be structured as two comparisons against basis, and one comparison
+of basis against basis. Its performance is therefore similar to Working Tree
+Against Historical Tree.
+
+API Changes
+-----------
+
+Desired API:
+
+ - Tree.get_comparision(file_id, tree)
+
+This probably entails:
+
+ - WorkingTree.store_comparison(file_id, revision_id, sha1, comparison)
+ - WorkingTree.get_comparison(file_id, revision_id, sha1)
+ - Repository.get_comparision(file_id, revision_id, revision_id)
+ - merge_comparisions(comparison, comparision)
+
+Storage considerations
+----------------------
+It must be cheap (e.g. scale with number of intermediate revisions) to perform
+comparison of two historical texts. It must be cheap to perform comparison of
+the inventories of two historical trees.
=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt 2007-07-02 02:31:28 +0000
+++ b/doc/developers/performance-roadmap.txt 2007-07-17 00:27:16 +0000
@@ -38,6 +38,8 @@
.. include:: commit.txt
+.. include:: diff.txt
+
.. include:: gc.txt
.. include:: revert.txt
=== modified file 'doc/developers/performance.dot'
--- a/doc/developers/performance.dot 2007-07-06 00:59:22 +0000
+++ b/doc/developers/performance.dot 2007-07-17 00:27:16 +0000
@@ -16,12 +16,12 @@
uncommit_analysis[label="Work required analysis for uncommit"];
wt_disk_order[label="Working Tree disk ordering\n6-8 weeks"];
iter_merge[label="iter_changes based merge\n2 days"];
+ diff_analysis[label="Work required analysis for diff"];
/* uncompleted node list - add new tasks here */
node[color="blue"];
log_analysis[label="Work required analysis for log"];
log_path_analysis[label="Work required analysis for log of selected paths."];
- diff_analysis[label="Work required analysis for diff"];
diff_path_analysis[label="Work required analysis for diff of selected paths"];
merge_analysis[label="Work required analysis for merge"];
missing_analysis[label="Work required analysis for missing"];
More information about the bazaar-commits
mailing list