Rev 3897: Change _LazyGroupContentManager.get_record_stream() to invalidate the nodes. in http://bazaar.launchpad.net/%7Ebzr/bzr/brisbane-core

Mon Mar 23 20:13:04 GMT 2009

At http://bazaar.launchpad.net/%7Ebzr/bzr/brisbane-core

------------------------------------------------------------
revno: 3897
revision-id: john at arbash-meinel.com-20090323201046-ek580vnq69i270lp
parent: john at arbash-meinel.com-20090321032222-n2wbqe0ozhhizwxm
parent: john at arbash-meinel.com-20090323200442-10qwt4ws636wwjwl
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: brisbane-core
timestamp: Mon 2009-03-23 15:10:46 -0500
message:
  Change _LazyGroupContentManager.get_record_stream() to invalidate the nodes.
  This means that records are genuinely only valid until you go to the next one.
  This breaks the reference cycle with the manager, and drops peak memory by a decent amount.
  (as otherwise we have to wait until gc.collect() runs).
  Also, change get_bytes_as() so that it caches the bytes (though this cache is also removed
  when we go to the next object).
  This is mostly useful for CHK pages, which get processed at several layers. The
  get_record_stream() code uses it to determine what to send next, and the
  insert_record_stream() needs the raw bytes to put into the target repo.
  This saves an extract() on every chk page during 'pack'. apply_delta is rather fast,
  but there is no reason to waste calls.
modified:
  bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  bzrlib/repofmt/groupcompress_repo.py repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 3893.1.3
    revision-id: john at arbash-meinel.com-20090323200442-10qwt4ws636wwjwl
    parent: john at arbash-meinel.com-20090323032950-lmbrocu79l90dqn5
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: refcycles
    timestamp: Mon 2009-03-23 15:04:42 -0500
    message:
      Cleanup, in preparation for merging to brisbane-core.
    modified:
      bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      bzrlib/lru_cache.py            lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
      bzrlib/repofmt/groupcompress_repo.py repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 3893.1.2
    revision-id: john at arbash-meinel.com-20090323032950-lmbrocu79l90dqn5
    parent: john at arbash-meinel.com-20090320150205-kcmh70biyo76p0kn
    parent: john at arbash-meinel.com-20090321032222-n2wbqe0ozhhizwxm
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: refcycles
    timestamp: Sun 2009-03-22 22:29:50 -0500
    message:
      Merge brisbane-core tip, resolve differences.
      Finish making various get_record_stream() calls clean up refcycles
      and memory consumption after yielding the record.
    modified:
      bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      bzrlib/repofmt/groupcompress_repo.py repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
      bzrlib/tests/test_groupcompress.py test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 3893.1.1
    revision-id: john at arbash-meinel.com-20090320150205-kcmh70biyo76p0kn
    parent: john at arbash-meinel.com-20090320032107-bm9wg421rtcacy5i
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: refcycles
    timestamp: Fri 2009-03-20 10:02:05 -0500
    message:
      Some testing to see if we can decrease the peak memory consumption a bit.
      It looks like we can, just need some more perf, etc.
    modified:
      bzrlib/groupcompress.py        groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      bzrlib/lru_cache.py            lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
      bzrlib/repofmt/groupcompress_repo.py repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
-------------- next part --------------
=== modified file 'bzrlib/groupcompress.py'

--- a/bzrlib/groupcompress.py	2009-03-21 03:22:22 +0000
+++ b/bzrlib/groupcompress.py	2009-03-23 20:04:42 +0000
@@ -339,8 +339,6 @@
         :param sha1: TODO (should we validate only when sha1 is supplied?)
         :return: The bytes for the content
         """
-        # Handle the 'Empty Content' record, even if we don't always write it
-        # yet.
         if start == end == 0:
             return ''
         self._ensure_content(end)
@@ -477,6 +475,7 @@
         #       get_bytes_as call? After Manager.get_record_stream() returns
         #       the object?
         self._manager = manager
+        self._bytes = None
         self.storage_kind = 'groupcompress-block'
         if not first:
             self.storage_kind = 'groupcompress-block-ref'
@@ -496,15 +495,20 @@
             else:
                 return ''
         if storage_kind in ('fulltext', 'chunked'):
-            self._manager._prepare_for_extract()
-            block = self._manager._block
-            bytes = block.extract(self.key, self._start, self._end)
+            if self._bytes is None:
+                # Grab and cache the raw bytes for this entry
+                # and break the ref-cycle with _manager since we don't need it
+                # anymore
+                self._manager._prepare_for_extract()
+                block = self._manager._block
+                self._bytes = block.extract(self.key, self._start, self._end)
+                self._manager = None
             if storage_kind == 'fulltext':
-                return bytes
+                return self._bytes
             else:
-                return [bytes]
+                return [self._bytes]
         raise errors.UnavailableRepresentation(self.key, storage_kind,
-            self.storage_kind)
+                                               self.storage_kind)
 
 
 class _LazyGroupContentManager(object):
@@ -531,6 +535,9 @@
         """Get a record for all keys added so far."""
         for factory in self._factories:
             yield factory
+            # Break the ref-cycle
+            factory._bytes = None
+            factory._manager = None
         # TODO: Consider setting self._factories = None after the above loop,
         #       as it will break the reference cycle
 
@@ -1281,6 +1288,7 @@
         for key in missing:
             yield AbsentContentFactory(key)
         manager = None
+        last_read_memo = None
         # TODO: This works fairly well at batching up existing groups into a
         #       streamable format, and possibly allowing for taking one big
         #       group and splitting it when it isn't fully utilized.
@@ -1295,39 +1303,39 @@
                 for key in keys:
                     if key in self._unadded_refs:
                         if manager is not None:
-                            # Yield everything buffered so far
                             for factory in manager.get_record_stream():
                                 yield factory
-                            manager = None
+                            last_read_memo = manager = None
                         bytes, sha1 = self._compressor.extract(key)
                         parents = self._unadded_refs[key]
                         yield FulltextContentFactory(key, parents, sha1, bytes)
                     else:
                         index_memo, _, parents, (method, _) = locations[key]
-                        block = self._get_block(index_memo)
+                        read_memo = index_memo[0:3]
+                        if last_read_memo != read_memo:
+                            # We are starting a new block. If we have a
+                            # manager, we have found everything that fits for
+                            # now, so yield records
+                            if manager is not None:
+                                for factory in manager.get_record_stream():
+                                    yield factory
+                            # Now start a new manager
+                            block = self._get_block(index_memo)
+                            manager = _LazyGroupContentManager(block)
+                            last_read_memo = read_memo
                         start, end = index_memo[3:5]
-                        if manager is None:
-                            manager = _LazyGroupContentManager(block)
-                        elif manager._block is not block:
-                            # Flush and create a new manager
-                            for factory in manager.get_record_stream():
-                                yield factory
-                            manager = _LazyGroupContentManager(block)
                         manager.add_factory(key, parents, start, end)
             else:
                 if manager is not None:
-                    # Yield everything buffered so far
                     for factory in manager.get_record_stream():
                         yield factory
-                    manager = None
+                    last_read_memo = manager = None
                 for record in source.get_record_stream(keys, ordering,
                                                        include_delta_closure):
                     yield record
         if manager is not None:
-            # Yield everything buffered so far
             for factory in manager.get_record_stream():
                 yield factory
-            manager = None
 
     def get_sha1s(self, keys):
         """See VersionedFiles.get_sha1s()."""

=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- a/bzrlib/repofmt/groupcompress_repo.py	2009-03-20 15:53:00 +0000
+++ b/bzrlib/repofmt/groupcompress_repo.py	2009-03-23 20:04:42 +0000
@@ -302,7 +302,7 @@
                 # next_keys = next_keys.intersection(remaining_keys)
                 cur_keys = []
                 for prefix in sorted(keys_by_search_prefix):
-                    cur_keys.extend(keys_by_search_prefix[prefix])
+                    cur_keys.extend(keys_by_search_prefix.pop(prefix))
         for stream in _get_referenced_stream(self._chk_id_roots,
                                              self._gather_text_refs):
             yield stream