[MERGE] Eliminate the use of VersionedFile.join when fetching data.
Andrew Bennetts
andrew at canonical.com
Fri May 2 08:21:33 BST 2008
Robert Collins wrote:
> On Fri, 2008-05-02 at 14:57 +1000, Andrew Bennetts wrote:
[...]
> > > === modified file 'bzrlib/knit.py'
> > > + if build_details[0] == 'line-delta':
> > > + kind = 'delta'
> > > + else:
> > > + kind = 'ft'
> > > + if annotated:
> > > + annotated_kind = 'annotated-'
> > > + else:
> > > + annotated_kind = ''
> > > + self.storage_kind = 'knit-%s%s-gz' % (annotated_kind, kind)
> >
> > I wonder a little if a tuple would be better than a str for storage_kind.
>
> Perhaps, but it would be a little less opaque and I think that would be
> unhelpful.
Fair enough.
> > I'm curious about why this is a sequence of one element tuples, rather than just
> > of strings.
>
> Compatability with VersionedFiles which is coming up, which uses tuple
> keys everywhere. VersionedFile only has one element in its keyspace,
> which is why it used strings.
Ok.
> > > + :param ordering: Either 'unordered' or 'topological'. A topologically
> > > + sorted stream has compression parents strictly before their
> > > + children.
> > > + :param include_delta_closure: If True then the closure across any
> > > + compression parents will be included (in the opaque data).
> > > + :return: An iterator of ContentFactory objects, each of which is only
> > > + valid until the iterator is advanced.
> > > + """
> >
> > You use the term “opaque data” in this docstring without really explaining what
> > you mean by it.
>
> Hmm. What I mean is that whatever magic is needed is not shown in the
> public api.
Saying “magic” isn't really any more helpful than saying “opaque”.
The docstring should say how this affects the resulting stream, e.g. that the
resulting stream will be guaranteed safe to use in a situation when otherwise
it would not be.
> > > + # Double index lookups here : need a unified api ?
> > > + parent_map = self.get_parent_map(versions)
> >
> > Where? Do you mean an index lookup has already happened by the time
> > get_record_stream is called, and now get_record_stream is doing another one?
>
> No. get_parent_map and the following lines both end up querying the
> index.
Ah ok. The comment should be specific about the lines: unlike you I don't (yet)
have deeply embedded in my brain exactly which lines hit the index. :)
> > > + position_map = self._get_components_positions(present_versions)
> > > + # c = component_id, r = record_details, i_m = index_memo, n = next
> >
> > Which code this comment meant to be a legend for?
>
> I don't know - you've stripped too much contents as far as I can tell.
In get_record_stream. (It was the only line like that in the entire diff, which
is why I though it would be clear with minimal context.)
> > > + records = [(version, position_map[version][1]) for version in
> > > + present_versions]
> > > + record_map = {}
> > > + for version in absent_versions:
> > > + yield AbsentContentFactory((version,))
> >
> > Why not yield the absent_versions as soon as you calculate them?
>
> Do you mean just move this block higher up? we could, I don't think it
> makes any difference.
No functional difference, but it'd be a bit clearer. There's a lot of things
going on in this function, and the less state that leaks across sections of it
the easier it is going to be to grok.
> > > + if self.factory.annotated:
> > > + # self is annotated, we need annotated knits to use directly.
> > > + annotated = "annotated-"
> > > + convertibles = []
> > > + else:
> > > + # self is not annotated, but we can strip annotations cheaply.
> > > + annotated = ""
> > > + convertibles = set(["knit-annotated-delta-gz",
> > > + "knit-annotated-ft-gz"])
> > > + native_types = set()
> > > + native_types.add("knit-%sdelta-gz" % annotated)
> > > + native_types.add("knit-%sft-gz" % annotated)
> > > + knit_types = native_types.union(convertibles)
> >
> > It might be good to extract this section into its own method.
> > insert_record_stream is quite long.
> >
> > What exactly is the distinction between a “native” type and a “knit” type?
> > Native seems obvious enough: types that this knit vf uses itself (and can thus
> > insert directly from a stream (assuming basis parents are present for a delta)),
> > but “knit” types seem a bit fuzzier. They are types that are cheaply
> > translateable to native types, I guess? “knit_types” doesn't seem like a great
> > name for that concept.
>
> they are the entire set of knit types. when the encoding matches that of
> self then its a native type, when it doesn't then as long as we can
> translate the record with no additional data its able to be optimised.
> Improved name suggestions are welcome.
“they are the entire set of knit types” is not a great way to explain the
meaning of “knit_types” :)
You seem to be saying here that it's a global thing: there's a fixed set of
types that are knit types. But the code says differently, it says that it
varies depending on the instance of the KnitVersionedFile, specifically if it's
annotated or not. The key thing that isn't spelled out in the code clearly is
the “we can translate the record with no additional data” part.
A wordy name might be “trivially_compatible_types”. A shorter name would be ok
(maybe even the current name) if it were prefixed with a comment like “#
foo_types is the set of types we can translate without requiring any additional
data.”
>
>
> > > + # Buffered index entries that we can't add immediately because their
> > > + # basis parent is missing. We don't buffer all because generating
> >
> > That sentence no verb. ;)
> >
> > Maybe it should read “buffered_index_entries holds entries that...”?
> >
> > (Short comments like “foo = {} # widgets to frob” often make perfectly good
> > sense, but an incomplete sentence in a comment that is otherwise a full
> > paragraph breaks my mental parser.)
>
> "buffer all index entries" is the only change needed there. I don't see
> an incomplete sentence.
Sure, s/Buffered index entries/Buffer all index entries/ fixes the sentence.
> > > + # annotations may require access to some of the new records. However we
> > > + # can't generate annotations from new deltas until their basis parent
> > > + # is present anyway, so we get away with not needing an index that
> > > + # reports on the new keys.
> >
> > I don't quite this last sentence. It's not immediately obvious to me why we
> > might have needed “an index that reports on the new keys”, or what that means.
>
> Because we may need to generate fulltexts using inserted data, and that
> means it needs to show up in the index for the current knit.
Hmm. I think the bit that I'm stuck on is “an index that reports on ...”, that
verb just doesn't go with that noun in my head. If it were “an index that
includes ...” then that makes sense to me.
> > # psuedo-code
> > adapter = get_best_adapter(record.storage_kind, native_types)
> > compatible_record = adapter.adapt(record)
> > self.insert_record(compatible_record)
> >
> > Which is what I had been expecting.
>
> In a .pack repository this will work fine (it will throw on an unordered
> stream where we need fulltexts during insertion, but packs are not
> annotated so don't need that). But this code is currently used for knits
> and packs. In VersionedFiles I will likely have a good home to split it
> out into.
Ah, so eventually the packs code will be neater than this? Nice.
> > > + elif record.storage_kind == 'fulltext':
> > > + self.add_lines(record.key[0], parents,
> > > + split_lines(record.get_bytes_as('fulltext')))
> > > + else:
> > > + adapter_key = record.storage_kind, 'fulltext'
> > > + adapter = get_adapter(adapter_key)
> > > + lines = split_lines(adapter.get_bytes(
> > > + record, record.get_bytes_as(record.storage_kind)))
> > > + try:
> > > + self.add_lines(record.key[0], parents, lines)
> > > + except errors.RevisionAlreadyPresent:
> > > + pass
> >
> > I'm don't really understand why the record-is-full-text and
> > record-is-adaptable-to-fulltext cases are slightly different to each other.
>
> Going through the adapter is slightly different. The try:except: block
> probably needs to be present in both sides, which suggests room for
> improving testing, and I have been thinking that get_bytes_as should be
> 'get_bytes' with no parameters. I'm not sure how to avoid abstraction
> violations for network streams at that point though :(.
Well, as long as this code is known to be not as neat as it ought to be, I'm
fine with an XXX comment saying so for now.
> > > + def test_get_record_stream_interface(self):
> > > + """each item in a stream has to provide a regular interface."""
> >
> > Just a nitpick: sentences ought to start with a captial letter.
>
> ITYM Capital :).
I was tempted :)
> > > + f, parents = get_diamond_vf(self.get_file())
> > > + entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > > + 'unordered', False)
> >
> > Why use this particular subset of versions? (ObXUnitPatterns: Mystery Guest)
>
> I will comment.
I'm looking forward to finding out what that comment says :)
> > > + def test_get_record_stream_interface_ordered(self):
> > > + """each item in a stream has to provide a regular interface."""
> >
> > Ideally I think different test methods would have different docstrings...
>
> These test methods I'd rather not docstring at all; the name is IMNSHO
> sufficient, but we have a coding standard to docstring them all. My
> ingenuity ran out.
(Personally, I think the standard should be basically the same for test code as
not test code: if it's clear without a docstring, then it shouldn't be
necessary. The difference with tests is that it's extra important that the
intent of a test method is clear. In well-organised test code a TestCase with a
clear purpose can easily not require docstrings on each method, IMO.)
In this case, just say “ordered stream” rather than “stream”.
> > > + def test_get_record_stream_interface_ordered_with_delta_closure(self):
> > > + """each item in a stream has to provide a regular interface."""
> > > + f, parents = get_diamond_vf(self.get_file())
> > > + entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > > + 'topological', True)
> > > + seen = []
> > > + for factory in entries:
> > > + seen.append(factory.key)
> > > + self.assertValidStorageKind(factory.storage_kind)
> > > + self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> > > + self.assertEqual(parents[factory.key[0]], factory.parents)
> > > + self.assertEqual(f.get_text(factory.key[0]),
> > > + factory.get_bytes_as('fulltext'))
> > > + self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> > > + str)
> >
> > You appear to have a half-done refactoring here. Re-use capture_stream, maybe
> > by adding a parameter to for an function to run on the factory. Is the extra
> > assertion this does vs. capture_stream significant?
>
> Very. There is no guarantee that get_bytes_as('fulltext') will work
> without passing True to the get_record_stream call.
I thought it would be important. :)
I almost didn't realise it was there, because it looked so much like verbatim
copy-and-paste code! (Hence why I ask for a refactoring to remove the
duplication.)
> > Also, you don't appear to have any tests anywhere that delta_closure=True gives
> > different results to False.
>
> Thats because its allowable for it to not give different results at this
> point. When someone (you :)) adds serialised streams, I would expect a
> parameterised version of this test that gets the versioned file to test
> from a smart server, and in that case adding a specific test for it that
> delta_closure=False causes fulltexts to be inaccessible.
Ah, hmm. Fair enough.
> > > + def test_get_record_stream_unknown_storage_kind_raises(self):
> > > + """Asking for a storage kind that the stream cannot supply raises."""
> > > + f, parents = get_diamond_vf(self.get_file())
> > > + entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > > + 'unordered', False)
> > > + # We track the contents because we should be able to try, fail a
> > > + # particular kind and then ask for one that works and continue.
> > > + seen = set()
> > > + for factory in entries:
> > > + seen.add(factory.key)
> > > + self.assertValidStorageKind(factory.storage_kind)
> > > + self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> > > + self.assertEqual(parents[factory.key[0]], factory.parents)
> > > + # currently no stream emits mpdiff
> > > + self.assertRaises(errors.UnavailableRepresentation,
> > > + factory.get_bytes_as, 'mpdiff')
> > > + self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> > > + str)
> > > + self.assertEqual(set([('base',), ('left',), ('right',), ('merged',)]),
> > > + seen)
> >
> > Definitely extend capture_stream to make this neater. The key assertion of this
> > test (the assertRaises) is buried by boilerplate.
>
> Review-Meta: When suggesting something that is likely 'obvious' please
> sketch out 'how' to avoid round trips. 'Write better prose' << 'Your
> paragraph will be more clear if you remove the double negatives and
> start with the current closing sentence.'
Sorry about that. I didn't want to dictate exactly what the code should be,
partly because as a reviewer I don't mind how the issues are addressed (so long
as they are addressed), but also to avoid insulting your intelligence. I seem
to have misjudged the balance in this instance.
In this case, I see that currently there's a fixed list of assertions, and
sometimes some others, used in a “for factory in entries” loop. So if
capture_stream was more like:
def capture_stream(self, f, entries, on_seen, parents,
extra_assertions=()):
for factory in entries:
...
for assertion in extra_assertions:
assertion(factory)
Then your test case would be (comment trimmed for brevity):
f, parents = get_diamond_vf(self.get_file())
entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
'unordered', False)
seen = set()
capture_stream(f, entries, seen.add, parents,
[lambda fac: self.assertRaises(
errors.UnavailableRepresentation, fac.get_bytes_as, 'mpdiff')])
self.assertEqual(set([('base',), ('left',), ('right',), ('merged',)]),
seen)
Not the most beautiful code in the world, but it helps keep the interesting bits
of the code up-front and centre in the test.
Hmm, here's a thought, less magic and probably better would be:
f, parents = get_diamond_vf(self.get_file())
entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
'unordered', False)
seen = set()
for factory in entries:
seen.add(factory.key)
self.assertFactoryIsSane(factory, f, parents)
self.assertRaises(errors.UnavailableRepresentation,
factory.get_bytes_as, 'mpdiff')
self.assertEqual(set([('base',), ('left',), ('right',), ('merged',)]),
seen)
assertFactoryIsSane would be:
def assertFactoryIsSane(self, factory, f, parents):
self.assertValidStorageKind(factory.storage_kind)
self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
self.assertEqual(parents[factory.key[0]], factory.parents)
self.assertIsInstance(factory.get_bytes_as(factory.storage_kind), str)
(And probably assertFactoryIsSane would have a better name.)
That's all I had in mind: refactoring the test code so that the interesting
parts (e.g. where one test differs from the next) are easy to see, rather than
lost in a mass of copy-and-pasted lines.
> > > + def test_get_record_stream_missing_records_are_absent(self):
> > > + f, parents = get_diamond_vf(self.get_file())
> > > + entries = f.get_record_stream(['merged', 'left', 'right', 'or', 'base'],
> >
> > Shouldn't it be “origin”, not “or”? Oh, no, that's intentional. Please use a
> > more obvious value, like “absent-key”.
> >
> > Also, I don't understand why you need to use all of merged, left, right and base
> > here.
>
> To guard against stupid code. Such as only returning the absent key.
That's fine (I agree that the practicality of doing that here outweighs the
purity of testing strictly one condition per test), but the intent should be
clear in the test. So it should explicitly say something like “# We ask for a
stream with a mix of absent and present versions to make sure there aren't dumb
bugs like only returning the absent key.”
> > > + def test_insert_record_stream_fulltexts_noeol(self):
> > > + """Any file should accept a stream of fulltexts."""
> >
> > The docstring ought to mention the noeols aspect of this test.
>
> Yes :(. I reiterate my objection to redundant docstrings in tests.
Ok.
> > > +class TestContentFactoryAdaption(TestCaseWithMemoryTransport):
> > > +
> >
> > > + def helpGetBytes(self, f, ft_adapter, delta_adapter):
> > > + """grab the interested adapted texts for tests."""
> >
> > Capital letter at the start of a sentence. Also, what are “interested adapted
> > texts”?
>
> interesting ;).
I was more interested in seeingwhat makes them interesting explicitly described
:)
> > > === modified file 'bzrlib/versionedfile.py'
> > [...]
> > > +class RecordingVersionedFileDecorator(object):
> > > + """A minimal versioned file that records calls made on it.
> > > +
> > > + Only enough methods have been added to support tests using it to date.
> >
> > (In my ideal world, this sort of class would live under bzrlib.tests somewhere,
> > rather than in the production code. That's not the policy atm though, so don't
> > worry about it.)
>
> I think it would be a mistake to put code like this in bzrlib.tests.
Well, that's a separate discussion, and one we can completely ignore for now. :)
I'm in no hurry to change the status quo, I was just wishing out-loud.
-Andrew.
More information about the bazaar
mailing list