[MERGE] Packs. Kthxbye.

John Arbash Meinel john at arbash-meinel.com
Wed Oct 17 23:39:52 BST 2007


John Arbash Meinel has voted comment.
Status is now: Semi-approved
Comment:
General design question:
Are the indexes purely supplemental data (can they be rebuilt from the 
.pack
file)? I know for Knits this is not the case, but I was hoping you had a 
chance
to fix that.



why did you feel the need to implement both a Knit1 format and a Knit3 
format?
It would make more sense to me to just have 1 format going forward.
Though I suppose the time to convert from K1 => K3 might be a bit slow.

+    def get_format_string(self):
+        """See RepositoryFormat.get_format_string()."""
+        return "Bazaar Experimental no-subtrees\n"
+
+    def get_format_description(self):
+        """See RepositoryFormat.get_format_description()."""
+        return "Experimental no-subtrees"

Obviously you'll want to have a name other than 'experimental' and these
strings should end up having the version of Bazaar that introduced them.
I'm sure you know, I just wanted to remind you about it.
Are we going to call these just 'pack'. (bzr init --format=pack)


      def insert_data_stream(self, stream):
+        """XXX What does this really do?
+
+        Is it a substitute for fetch?
+        Should it manage its own write group ?
+        """

^- This is part of Andrew's new work to stream data across using the 
smart server.
So, AFAICT it is indeed a substitute for fetch. You ask the source for a 
stream,
and hand that off to the target.



+    def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
+        revision_ids):
^- Was there a real need to factor this out of
fileids_altered_by_revision_ids(self, revision_ids)=
?
Is it re-used somewhere? Or is this just to decrease the level of 
indenting?


+class Pack(object):
+    """An in memory proxy for a pack and its indices.
+
+    This is a base class that is not directly used, instead the classes
+    ExistingPack and NewPack are used.
+    """
+
...

+
+    def inventory_index_name(self, name):
+        """The inv index is the name + .iix."""
+        return self.index_name('inventory', name)

^- It would probably be good to have a

   def index_name(self, kind, name):
     raise NotImplementedError

In general, I prefer to see any functions the abstract class will use be 
at
least defined. Since it makes it more obvious (you don't have to search 
for all
the implementations, and see if they agree on how it should work.)
Actually, I don't even see the implementation of it on ExistingPack, so 
I'm
wondering if these should even exist on the base Pack object.



+class NewPack(Pack):
+    """An in memory proxy for a pack which is being created."""
...

+    def __init__(self, upload_transport, index_transport, 
pack_transport,
+        upload_suffix=''):
+        """Create a NewPack instance.
+
+        :param upload_transport: A writable transport for the pack to 
be
+            incrementally uploaded to.
+        :param index_transport: A writable transport for the pack's 
indices to
+            be written to when the pack is finished.
+        :param pack_transport: A writable transport for the pack to be 
renamed
+            to when the upload is complete. This *must* be the same as
+            upload_transport.clone('../packs').

^- Passing in 3 separate transport objects seems a bit odd to me. I 
suppose
this is because you want to stage the files in 'upload' and then rename 
them into
'pack', and then write the index? Though it would seem to me that you 
should upload
the file and the index, and then rename them into place. (Obviously the 
.pack before
the .*ix).

Also, passing in a target transport which *must* be clone('../packs') 
seems a bit odd.
Why not clone it yourself if it must be in a fixed relative position?


+        #  - refresh the pack-list to see if the pack is now absent
+        self.upload_transport.rename(self.random_name,
+                '../packs/' + self.name + '.pack')
+        self._state = 'finished'

^- I realize that this works, but I would honestly prefer it if 
Transports were
not allowed to use '..' for anything but cloning.
Since you are hard-coding the transport locations anyway, you could just 
take an
'upload_transport', and then create a
self.parent_transport = upload_transport.clone('..')
self.pack_transport = self.parent_transport.clone('packs')

And then your rename becomes
self.parent_transport.rename(self.upload_name + self.random_name,
                              'packs/'+self.name+'.pack')

Just a thought.

+    def add_writable_index(self, index, pack):
+        """Add an index which is able to have data added to it.
+
+        :param index: An index from the pack parameter.
+        :param pack: A Pack instance.
+        """
+        assert self.add_callback is None

^- Why does add_callback need to be None at this point? If there is a 
specific reason,
it would be good to have it documented in the docstring.

Also, it isn't 100% clear why you have an AggregatedIndex on top of a 
CombinedIndex.
I'm sure there is a reason, but they sure sound like they would be the 
same class.


It would be nice to see a docstring at the top of 
bzrlib/repofmt/pack_repo.py
which defines the overall layout of a pack repository.
For example, it seems like you have a directory with 'packs' a directory 
with
'indexes' and a directory for uploading.
Why aren't the indexes next to the packs?


+class RepositoryPackCollection(object):
+    """Management of packs within a repository."""
+
+    def __init__(self, repo, transport, index_transport, 
upload_transport,
+                 pack_transport):
+        """Create a new RepositoryPackCollection.
+
+        :param transport: Addresses the repository base directory
+            (typically .bzr/repository/).
+        :param index_transport: Addresses the directory containing 
indexes.
+        :param upload_transport: Addresses the directory into which 
packs are written
+            while they're being created.
+        :param pack_transport: Addresses the directory of existing 
complete packs.
+        """
+        self.repo = repo
+        self.transport = transport
+        self._index_transport = index_transport
+        self._upload_transport = upload_transport
+        self._pack_transport = pack_transport
+        self._suffix_offsets = {'.rix':0, '.iix':1, '.tix':2, '.six':3}
+        self.packs = []
+        # name:Pack mapping
+        self._packs = {}

^- You are using 'self.packs' which is a list of packs and 'self._packs' 
which
is a dictionary mapping the pack name to the pack objects. I would at 
least
call them "self.packs" and "self._pack_map". Why is one public and the 
other
private?
Why are repo, transport, packs, and the indexes public?


Why do you have:

+    def all_packs(self):
+        """Return a list of all the Pack objects this repository has.
+
+        Note that an in-progress pack being created is not returned.
+
+        :return: A list of Pack objects for all the packs in the 
repository.
+        """
+        result = []
+        for name in self.names():
+            result.append(self.get_pack_by_name(name))
+        return result

if you made self.packs() public?
It seems silly to go through self.names() and another get_pack_by_name() 
call
for all of that.

Do we need 'get_pack_by_name()' and "names()" at all?


During autopack:

+        pack_distribution = self.pack_distribution(total_revisions)
+        existing_packs = []
+        for pack in self.all_packs():
+            revision_count = pack.get_revision_count()
+            if revision_count == 0:
+                # revision less packs are not generated by normal 
operation,
+                # only by operations like sign-my-commits, and thus 
will not
+                # tend to grow rapdily or without bound like commit 
containing
+                # packs do - leave them alone as packing them really 
should
+                # group their data with the relevant commit, and that 
may
+                # involve rewriting ancient history - which autopack 
tries to
+                # avoid. Alternatively we could not group the data but 
treat
+                # each of these as having a single revision, and thus 
add
+                # one revision for each to the total revision count, to 
get
+                # a matching distribution.
+                continue
+            existing_packs.append((revision_count, pack))

^- This would seem to indicate that counting by number of revisions may 
not be
the best method. I would probably treat them as 1 count files, though 
you make
some decent points here.


Your "plan_autopack_combinations" seems interesting. I'm not sure if the
algorithm is 100% correct.
Take this example: existing_packs = 50, 40, 12, 7, 3
50+40+12+7+3 = 112
pack_distribution = 100, 10, 1, 1

I tried to walk through what it would do, but it seems to do very weird 
things
when trying to add them in.

Also, in the above scenario, you are doing
+            if next_pack_rev_count >= pack_distribution[0]:
But you are never decrementing pack_distribution[0], which means that in 
the
above you will pack everything into the '100' pack, even though it 
overflows a
lot. (you never got a pack > 100, so you always stay in the
+                # add the revisions we're going to add to the next 
output pack
portion.

At the very least, I think when you add a pack to the current pack, you 
need to
decrement the pack distribution, or something.
I could just be misunderstanding what your algorithm was trying to do.
If you are trying not to split existing packs, I would have thought
100, 10, 1, 1
would have at least tried to do
90, 12, 10
or maybe
112, 10
# (overflowing slightly on one is better than not packing)

I do think the algorithm tries to pack big ones into bigger ones before 
trying
to pack small ones in. Which is probably correct. (small ones == recent 
ones,
and you want recent ones to be fast to access, with small indexes.)

But I do think you need to consider a few more arrangements of packs, 
and make
sure it is doing what you think it is. (Maybe it is, but it isn't doing 
what
*I* think it should be doing.)

+      100 every 100, 1000 every 1000

  ....

  Caching and writeing of data
  ============================

  ^- At the very end you have this. which is pretty obviously not correct
  (though you didn't write it).


Sorry I didn't get past pack_distribution(). I'll try to review more 
later.


For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3C1192614619.31212.3.camel%40lifeless-64%3E



More information about the bazaar mailing list