RFC: scaling-inventory rework current thoughts

Wed Aug 13 23:36:07 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2008-08-11 at 17:38 -0500, John Arbash Meinel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
> 
>>> Going back to basics then, for serialisation scaling, we want something
>>> that:
>>>  1 allows commit to write less data than the full tree most of the time
>>>  2 allows the data to be written to be calculated without processing
>>>    size(tree) information [this lets importers and named-file commits
>>>    be faster, and allows a nice coupling between workingtree (all-data
>>>    by nature) and repositories.
>>>  3 generates the same exact representation for a given inventory
>>>    regardless of the amount of history available
>> I think there is some flexibility here between the logical form, and the
>> stored form on disk. For example, if we have deltas be optional in the
>> stream. So converters are welcome to generate deltas directly, but
>> inserting full versions in their place would be acceptable as well. (As
>> I understand your "journaling" system, this was not the case, as the
>> delta text was the canonical form.)
> 
> I agree that the storage subsystem might do a binary delta to store
> things; but actually I think we want to control this: lets say I have a
> 100K path tree split into 5K elements - so 20 separate serialised
> components, if the storage layer then does deltas which add latency like
> knits do - 100 elements on average, thats 200 IO's to read in the
> inventory, rather than 20 which the higher level code will be
> anticipating.
> 
> So we should analyse the full stack and be clear about the impact of
> such things. Additionally, if the storage layer *needs* to do binary
> deltas to have efficient storage, then we undo the goal of doing less
> work to some degree. So it needs to be considered holistically.

I agree we need to consider it holistically. I would bring up an item 10 then,
that transmitting and storing the inventory needs to be bounded on average.

It has certainly been an issue that we've had inventory.knit dominate the
overall size of a repository. And unfortunately that sort of info is really
hard to tease out of pack repositories. (I think the indexes have the
information, but we have to poke at them and add them together, etc.)

Anyway, I don't think we should require binary deltas, etc. But we should have
a specific mention about the size of the stored inventory (and the size for
transmission if they are separate.)

> 
>>>  4 allows a delta to be generated *from the serialised form* with work 
>>>    independent of the actual size of the tree - and ideally ~= the
>>>    actual changes involved. (The emphasis on the serialised form is
>>>    simply that upcasting to an in memory structure requiring the entire
>>>    inventory data to be read would defeat the point, even if the in
>>>    memory logic was very cheap).
>> Are you requesting compositing here? (Given 4 deltas, combine them into
>> an overall delta).
> 
> No. Given two serialised inventories, get a delta.
> 
>> Perhaps a better way of phrasing this, would be that you can map between
>>  logical/serialized and fulltext/delta without going to the other.
> 
> I haven't suggested in this document that we use deltas per se for
> storage at all - rather than we need to take deltas in and output
> deltas, or something similar anyhow. (Because we don't want to work in
> size-of-tree datasets).
> 
>> LF ↔ LD
>>  ↕   ↕
>> SF ↔ SD
>>
>> So if you had 2 serialized fulltexts, you could generate a serialized
>> delta without converting to logical fulltext and then logical deltas,
>> and then back down to serialized deltas.
> 
> This implies we have serialized deltas, I don't think that we can do
> that sanely given my constraint 3.
> 
>>>  5 lets bzr determine what we need to pull without having to probe for
>>>    size-of-tree index keys to determine if we have the referenced data
>>>    locally. (This is something that the current format allows).
>>>  6 lets bzr map paths to fileids without reading the entire serialised
>>>    form (this is used by e.g. diff -r X PATH)
>>>  7 lets bzr map fileids to paths without reading the entire serialised 
>>>    form (this is used by e.g. loggerhead, bzr search, and anything
>>>    doing UI output of historical partial trees.)
>>>
> 
>>> Note that currently inventories can be split amongst up to 200 text
>>> deltas regardless of tree size so while we can't ignore the impact of
>>> needing to do more separate IO's to deal with the full inventory of
>>> large trees its not a forgone conclusion that its a problem.
>> ^- I don't think I follow this statement. It sounds like you have a
>> design in mind already, but haven't actually stated what it is.
> 
> I have thoughts toward a design, my mail contained all the ones I am
> sure of. The statement you can't follow can be rephrased as:
> "its probably ok to need to do multiple reads to get a full inventory,
> because we do up to 200 reads today to get an inventory"
> 
>>> Aarons concept of using the dirstate fast-diff facilities to get a delta
>>> from the working tree to the basis tree, and then using purely
>>> historical diffs to work from there to other trees is a very useful one
>>> for increasing performance, because it allows us to have two different
>>> storage systems - one optimised for cheap low-latency high bandwidth IO,
>>> and one optimised for transmission and network use.
>> This is patch composition, correct? (Being able to determine the delta
>> from A=>C by going along A=>B=>C.)
> 
> Yes and No. Its similar but its not linear the way patch composition is.
> 
> Say we have three trees:
> W: working
> B: Basis
> O: Any other tree
> 
> One common operation is W.iter_changes(O).
> 
> We have fast iter_changes from W<->B thanks to dirstate.
> And the design goals above give us fast iter_changes from B<->O if they
> are both in the same serialised form.
> So we can get fast W<->O by bouncing through B. 
> 
> That is, its always 2 steps long, never 500 because ancestry is not part
> of the logic.
> 
>>> Accordingly, I think the most important thing to discuss is how strictly
>>> we want to satisfy 3; and if we go for 'very strictly' what we should do
>>> with the derived data (which amongst other things is used to generate
>>> per-file graphs to do [relatively] cheap 'log FILENAME'.
>> Well, we currently have it serve double duty. One is to provide a
>> per-file graph, which is useful for annotate, weave + lca merge (both do
>> per-file merges, and I might make --merge3 do it as well), log, and a
>> few other places.
>>
>> The other is just a handle to access the text (file_id, revision_id) is
>> our "text_key".
> 
> We can't assume that a (fileid, revid) key exists a-priori, we could
> easily change to a different key system for text keys and have that work
> - we already should always be redirecting through the inventory for the
> revision we're examining anyhow - except when we're doing per-file graph
> traversal. And for that we'd need to push the actual reference into the
> tree data, if we wanted to change the text key definition. (I don't).

Right. Stuff like "log" doesn't have inventories around to redirect through.
It can probably get them, but not terribly efficiently. (Goes back to the
efficient mapping idea. We might extend that to include "efficiently map from
revision_id, file_id => text)

> 
> 
>>> In my view it may be desirable to split the derived data out, but its
>>> definitely more work and we can achieve a looser form of 3 above without
>>> making the issues in that area that we have today any *worse*, so it is
>>> definitely a choice rather than a requirement. As its a choice I think
>>> we should not couple it to fixing the scaling bugs we have in the
>>> commands mentioned above.
>>>
>> This would be nice, and I think is a "correct" way to go. I'm not sure
>> about before or after getting the other work done. So I agree it
>> shouldn't be strictly coupled, but *could* be an opportune time to make
>> changes that require "expensive" format upgrades.
> 
> I think its massively harder to do the decoupling, because we'd have to
> solve three or four additional deep performance problems - more scatter
> gather, updatable local caches, testament rework, maybe more.
> 
>>> One possible approach to storing-and-serialisation is to write some very
>>> low level code that can store many texts with common regions with
>>> partial access to a given text, in an efficient manner. I think this
>>> would be hard to do and that there are probably easier ways. Another
>>> approach is to split the overall document up as part of the
>>> serialisation step, and then store all the fragments - this is
>>> conceptually very similar to our current storage facilities and as such
>>> likely easy to accomplish. The key thing I can see being of issue there
>>> is how to ensure that the names of the fragments honours condition 3
>>> above - because the names are most likely embedded as references.
>>> (Alternate: write something to store many copies of an identical text
>>> with different names efficiently). Roughly speaking, the more we can
>>> push complexity about this up towards the application, the simpler the
>>> storage layer can be - and up to a point this is always good :). (The
>>> point that its not is when you make other things unreasonably complex by
>>> doing this).
>> I think the combination of 2 & 3 make it difficult to do a radix tree.
> 
> I'm quite certain we could update a radix tree given an inventory delta
> to create a new radix tree (satisfies 2). And I'm quite certain we could
> make packing of internal nodes into named byte sequences that we can
> point to deterministic, both on initial creation and update (satisfies
> 3). I don't know that a radix tree is the right structure, but certainly
> I don't see any reason why we couldn't use one.

I certainly agree that giving a delta we can create a new radix tree. And I
believe we could make packing internal nodes deterministic. I'm not sure that
we can make packing internal nodes deterministic without evaluating the whole
tree.

You and I discussed this a bit in London. I've thought about it a bit more,
and I'm still uneasy about how much context you would have to expand into
before you know where the dividing points would be.

Especially such that it cleanly handles the "10,000 directories 1 file" case,
as well as the "1 directory, 10,000 files". If you know that you'll never
include a given directory, you don't have to evaluate it, but are you sure
that the person next to you will feel the same way. I suppose if you had a
specific page size that could hold 100 entries, and you always evaluated >100
entries before and after your current location. But even then I'm not
convinced. (Unless you know the previous node is "full", how do you know that
you will start the next node. Unless you restrict it to never crossing a
certain boundary regardless of how full the node is. But at that point, does
it still handle both edge cases?)

> 
>>> A useful thing to note is that while we need access to parts of an
>>> inventory, we don't need that keyed by revision id - that is, we can be
>>> sure that we'll start at a revision object, which has the property of
>>> having a normal revision id; we can follow from the into whatever
>>> storage/serialisation is used for inventories.
>> "follow from there" ?
> 
> Yes.
> 
>  
>>> In terms of generating fragments for storage and naming them - I'm
>>> inclined to propose using CHK - content hash keys - to name them. For
>>> instance, we might call fragments ('sha1:123456789....',). I suggest
>>> this because such keys are dependent purely on the inventory content
>>> rather than how we arrived at it - so its part of the answer for goal 3.
>>> Note that this is not the thin edge of the wedge to be same as GIT - it
>>> is not changing the naming scheme for revisions, simply some of the
>>> internals for how inventories are stored.
>> Note also that splitting into fragments makes things like file_id <=>
>> path mapping a bit harder. For example, if split up based on path you
>> could pass a hint about where the file is located in *this* inventory as
>> a likely place that you would want to look for it in *that* inventory.
> 
> I don't think we can satisfy the 'does less than size-tree work' without
> splitting the document into smaller pieces - fragments. So it might make
> things a little harder, but that shouldn't matter.
> 
>> It should be noted that one problem with CHK is that you don't know the
>> key until you've finished serializing. Which is one of the reasons to
>> make mostly a storage detail.
> 
> The entry key to the system has always been the revision id - and I
> explicitly noted that I don't want to change that. So I don't see any
> reason that CHK's would be a problem for us.
> 
>> Oh, and it seems a bit inefficient if you actually had to generate the
>> serialized form for every fragment, try to insert it, and then have a
>> no-op on insertion.
> 
> That would violate goal 2.

I can see that given a delta, you can determine which fragments need to be
updated without evaluating all of them.

> 
>> There is also the question of what does a delta look like, when you are
>> using fragments. Do you get a delta-text per fragment? Are the deltas a
>> logical whole and then fragmented  (rather than fragment and delta, you
>> delta and fragment). Is the top level of a delta just the list of
>> fragments that were effected? (Or doesn't it matter because the
>> fragments would be 'chained', the intermediates will always be updated
>> when a child is updated.)
> 
> As I said above deltas are the inventory deltas we have today - the in
> memory list of changes.

then again, is that the "apply_inventory_delta" style of delta, or the
TreeDelta style, the 'iter_changes' style, or the "knit-delta" style?

> 
>>> There are many ways we can split up an inventory - the most obvious in
>>> some respects is a tree formed around the directory structure of the
>>> tree. This is problematic in several regards for me. The primary problem
>>> I have is that we cannot balance such a tree - the user determines the
>>> tree shape and if they want 100K directories with files in the leaf;
>>> we'll be stuck trying to make that work - or equally very deep trees, or
>>> very broad trees with 10's of thousands of files in a single directory.
>> I agree that we can't force balancing, I'm not sure it is strictly
>> problematic. It can simply be an edge case that we don't support yet,
>> because the 90% most common use cases don't trigger it, and are thus not
>> penalized (either by performance, or just simply complexity).
>>
>> As long as it would still *work*, I don't think it is critical that it
>> also be *optimal* in all possible cases.
> 
> We've said that about many things in the past; and then they bite us.
> I'd rather not put a bunch of effort into creating and optimising
> something with *known flaws*.

I would guess that any design we can come up with will have *some* flaws. We
just need to decide where the best trade off is. (complexity being a flaw in
itself)

> 
>> I think having it at least be recommended to be able to go from
>> serialized delta <=> logical delta directly would be a nice property
>> (which I don't feel you explicitly defined). I think you defined
>> serialized fulltext <=> serialized delta.
> 
> I don't think we want a serialised delta concept - it will violate goal
> 3. Trivially, you have revision X and Y:[X], I have just Y with X a
> ghost (because you elided patent affected code or something when you
> pushed). My Y cannot have a serialised delta against X. So serialised
> delta <=> logical delta would be a misfeature.
> 
> Having efficient 'give me a delta from X and Y in serialised form' is a
> key goal though: goal 4 above. And it means you don't need to be able to
> go from serialised delta <=> logical delta.
> 
> -Rob

Sure. I'm not 100% convinced that we'll have an optimal (in all cases) system
without on-disk deltas, but I think we can try to think about it at as many
levels as we can, without a specific implementation written yet.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIo2HXJdeBCYSNAAMRAsUdAJ93SxSiAQRxFQ6zygMd6cLKsxNc5ACghGuU
xeDRmjhRa6FeZrRGyzH8QDo=
=vI/F
-----END PGP SIGNATURE-----