[RFC] bzr.jrydberg.versionedfile

Wed Dec 21 11:32:42 GMT 2005

Robert Collins <robertc at robertcollins.net> writes:

>> The reason rollback is important is that several parts of the
>> knit-design depends in the notion that changes can be rolled back in
>> the event of a fatal error.  For example, the fetcher simply merges
>> all remote revisions into the local knit, _BEFORE_ it starts to look
>> at texts and inventories.  If something goes bad during the latter
>> phases, all fetched revisions must be thrown away.
>
> Except we have no guarantee (unless we get into WALF implementations)
> that this throw away will occur. That is, all it takes is a SEGSIGV and
> the rollback will not occur. Thus the use case of 'ensure a consistent
> repository' CANNOT be satisfied with a rollback mechanism on our
> transactions.

Robert and I discussed this a bit more in detail on #bzr.  According
to Robert there are three groups of files in a repository:

 1. revisions & signatures.
 2. inventories
 3. texts

All of group 3 must be present on disk for an inventory before that
inventory can be written to disk.  All of group 2 must be present on
disk for a revision before that revision can be written to disk.

KnitFetcher violates these rules right now, assuming that the
transaction can be roll back changes to a known state, which is not
really true as Robert pointed out.  The reason this was done in the
first place was to gain performance while fetching large amounts of
revisions.

To follow the rules above, but still have acceptable performance, a
few tricks must be used.  For control knits (knits in group 1 and 2) a
in-memory knit is created, and merged with the on-disk knit.  The
in-memory knit is then also merged with the remote knit.  At this
point all entries needed is in memory, and if fetched in the most
effective way (KnitVersionedFile.join is more effective than
extracting texts).  When the time comes, the on-disk knit is merged
with the in-memory knit.  

This will give a greater memory overhead, but should be a lot safer.
Some performance will most likely be lost, but hopefully not that
much.

>> > The id that was passed in was the logical id of the weave - the file
>> > id. The thing I think I object to here is making the api too
>> > generic. I'd rather have specific calls on it for the types of
>> > things it needs to keep unique.
>> 
>> The mapping is kept using using a (store, file_id) id.
>
> Which is now the responsibility of *all* callers that want to put an
> object in the map. I'm concerned here about the ability for callers to
> skew their apis - having a semantic api to the map prevents that. Its
> fine that the key be more specific - but we should not encourage skew.

I think having a generic API is worthy goal, and the user should be
well aware if there is risk for collisions; if so, she should use a
more specific key.

>> > The remove_* functions should probably not exist either - the point of
>> > the map is to exist for the whole transaction, when things are finished
>> > with the whole map should go away when the transaction does.
>> 
>> I put in the remove_object method to handle cache invalidations.  This
>> originates from the need to flush the cache in WeaveStore.copy_file.
>> My personal opinion is that weaves should never be copied, but instead
>> always merged (using WeaveVersionedFile.join), but I did not know the
>> performance implications so I left it in.
>
> mmm, I'll need to look at this. I'll do so next time I get a good shot
> to eyeball the code.

I just think we should let copy_file go.

>> > +    def get_text(self, file_id, version_id, transaction):
>> > +        """Return a sequence of lines that makes out version-id of
>> > +        specified file."""
>> > +        file = self.get_file(file_id, transaction)
>> > +        return file.get_text(version_id)
>> > +
>> > +    get_lines = get_text
>> > +
>> >
>> > I'd expect get_text and get_lines to return (respectively) a string and
>> > a sequence of lines. Having them the same is rather, confusing.
>> 
>> I rather have get_text return a sequence of lines, and have a
>> get_string method that returns the lines concatenated.
>> 
>> The reason get_lines is there is that I am lazy and didn't want to
>> change a large set of tests.
>
> If I have a variable 'foo_text' with no extra explanation, would you
> assume that to be a string or a list of lines ? I'm pretty sure the
> former would be the case -> and thats why I'm encouring get_text return
> a string.

I disagree.  I think get_text should be the primary operation, and
that it should return a list of lines.  If you want it as a string,
we'll provide a convenience method for you.

I do not think of 'text' as a stream of bytes.  I guess that is the
difference.

~j