[MERGE] Faster 'build_tree'

Thu Jul 26 15:49:44 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> I was poking around our knit extraction code, because it is one of our current
> bottlenecks. And I found a few areas that could use some cleanup.

I'm not sure that's the best use of your time, but okay, I'll review it.

Anyhow, I vote <resubmit>.  It needs a bit more cleanup, and possibly
refactoring.

=== modified file 'bzrlib/knit.py'
- --- bzrlib/knit.py	2007-07-20 12:56:33 +0000
+++ bzrlib/knit.py	2007-07-23 23:28:06 +0000
@@ -226,11 +226,6 @@
         lines = iter(lines)
         next = lines.next

- -        cache = {}
- -        def cache_and_return(line):
- -            origin, text = line.split(' ', 1)
- -            return cache.setdefault(origin, origin), text

^^^ Dead code elimination?

- -
         # walk through the lines parsing.
         for header in lines:
             start, end, count = [int(n) for n in header.split(',')]
@@ -238,6 +233,27 @@
             result.append((start, end, count, contents))
         return result

+    def parse_line_delta_no_annotations(self, lines):
+        """Convert the delta lines into real lines, but ignore annotations.
+
+        line delta is in the form of:
+        intstart intend intcount
+        1..count lines:
+        revid(utf8) newline\n
+        internal representation is
+        (start, end, count, [1..count line)])
+        """
+        result = []
+        lines = iter(lines)
+        next = lines.next
+
+        # walk through the lines parsing.
+        for header in lines:
+            start, end, count = [int(n) for n in header.split(',')]
+            contents = [next().split(' ', 1)[1] for i in xrange(count)]
+            result.append((start, end, count, contents))
+        return result
+
     def get_fulltext_content(self, lines):
         """Extract just the content lines from a fulltext."""
         return (line.split(' ', 1)[1] for line in lines)
@@ -307,6 +323,18 @@
     def parse_line_delta(self, lines, version_id):
         return list(self.parse_line_delta_iter(lines, version_id))

+    def parse_line_delta_no_annotations(self, lines):
+        cur = 0
+        num_lines = len(lines)
+        result = []
+        while cur < num_lines:
+            header = lines[cur]
+            cur += 1
+            start, end, c = [int(n) for n in header.split(',')]
+            result.append((start, end, c, lines[cur:cur+c]))
+            cur += c
+        return result

^^^ It would aid comprehension and comparison if both of these were
written in the same style.  KnitAnnotateFactory.parse_line_delta is
index-based, KnitAnnotateFactory.parse_line_delta_no_annotations
iterator-based, KnitPlainFactory.parse_line_delta_iter is index-based,
and KnitPlainFactory.parse_line_delta_no_annotations is index-based.

You might also see a performance improvement from avoiding iteration.

+    def _get_text_map(self, version_ids):
+        """Produce maps of text and KnitContents

^^^ This doesn't produce a map of KnitContents.

Also, if we have_get_text_map, does _get_contents_map need to return a
text map?  I think we can remove that.

Finally, if we do that, I think we may be able to make a parametrizable
method that either does get_text_map or get_contents_map.

+        no_newlines = []
+        raw_text_map = {}
+        text_map = {}
+        for version_id in version_ids:
+            components = []
+            cursor = version_id
+            while cursor is not None:
+                method, data, digest, next = record_map[cursor]
+                components.append((cursor, method, data, digest))
+                if cursor in raw_text_map:
+                    break
+                cursor = next

^^^ Ah, this brings back memories.  Perhaps a comment explaining this
creates a set of instructions for building the target version?  Or since
it's common, maybe we can factor it out?

+            raw_text = None
+            for component_id, method, data, digest in reversed(components):
+                if component_id in raw_text_map:
+                    raw_text = raw_text_map[component_id]
+                else:
+                    if method == 'fulltext':
+                        assert raw_text is None
+                        raw_text =
list(self.factory.get_fulltext_content(data))
+                    elif method == 'line-delta':
+                        assert raw_text is not None
+                        delta =
self.factory.parse_line_delta_no_annotations(data)
+                        raw_text = self._apply_delta(raw_text, delta)
+                    raw_text_map[component_id] = raw_text
+
+            if 'no-eol' in self._index.get_options(version_id):
+                text = raw_text[:] # Don't change the cached text
+                assert len(raw_text) > 0, ('We have no-eol on a text that'
+                    'has no actual lines for version_id %s in %s'
+                    % (version_id, self))
+                text[-1] = raw_text[-1].rstrip('\n')
+            else:
+                text = raw_text
+            text_map[version_id] = text
+
+            # digest here is the digest from the last applied component.
+            if sha_strings(text) != digest:
+                raise KnitCorrupt(self.filename,
+                                  'sha-1 does not match %s' % version_id)
+        return text_map

^^^ A bunch of these lines exceed 79 chars.

=== modified file 'bzrlib/revisiontree.py'
- --- bzrlib/revisiontree.py	2007-07-17 20:04:13 +0000
+++ bzrlib/revisiontree.py	2007-07-24 00:13:12 +0000
@@ -79,7 +79,9 @@

     def get_file_text(self, file_id):
         file_id = osutils.safe_file_id(file_id)
- -        return ''.join(self.get_file_lines(file_id))
+        ie = self._inventory[file_id]
+        weave = self._get_weave(file_id)
+        return weave.get_text(ie.revision)

^^^ Was this just a drive-by?  (You don't seem to use it anywhere.)

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGqLSI0F+nu1YWqI0RAtnOAJ44EAu5oQO3rAMAYR0QsT7Z8x8DxACbBhAW
aJaT7OSDZu/PurYfVjEqJnQ=
=l8+4
-----END PGP SIGNATURE-----