Loggerhead directions

Thu Apr 15 02:54:32 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> On 15/04/10 06:10, John Arbash Meinel wrote:
> 
>> The better, the cache can be shared between all branches. So switching
>> from emacs/trunk to emacs/feature1 to emacs/feature2 will be avoid the
>> 'fresh-start w/ on disk cache' performance, and even better always
>> avoids the 'fresh start no-disk cache' performance. So once you've done
>> the 30s import. Your worst-case time becomes closer to 700ms, (440+188)
>> rather than 13000ms when a new feature branch is browsed.
> 
> That's certainly a major improvement over the per-branch architecture of
> my historycache plugin. I'm curious though about how stacked branches
> play in with this? Can you chain the caches so that the cache for a
> stacked branch gets deleted if and when a stacked branch is deleted?
> 

So I just have 1 cache. Shared by all branches you want to put in it.
Without chaining. I haven't really worked out how to clean data out of
the cache. Certainly it is possible, but I don't see it being a big deal.

Right now, the cache is a local-only thing, based purely on the revision
graph. The location of the cache is a branch config, but I would expect
people to either set it up as a global, or possibly per repo, etc.

>> So now I'm trying to figure out if I should go back to at least an
>> in-memory cache, and populate it from the data I query via bzrlib apis,
>> or whether I should just mandate that bzr-history-db is available.
>> If I do the latter, then I may as well change the Loggerhead code to
>> query the database directly, and only use stuff like
>> Branch.last_revision()...
>>
>> Some minor guidance would be appreciated.
> 
> In terms of performance tuning, I think you ought to assume that
> history-db or equivalent is deployed. In terms of dependencies though,
> you should assume it's *not* there. In other words, please go via
> bzrlib, even if you need to introduce new Branch and Repository APIs, to
> access data.

The main problem is that we instantiate a new Branch instance *per
request*. Which means that any caching I assume to happen on or under
Branch won't persist between HTTP requests.

So far, I've just gone via the bzrlib apis you added recently
(dotted_revno_to_revision_id, iter_merge_sorted_revisions, etc.) I
haven't quite worked out if they are enough. But the *big* issue is that
you have 0 caching between requests.

A Branch caches the iter_merge_sorted result as long as you hold a read
lock. However, Branches do not share that cache.

loggerhead is in a bit of a pickle, trying to stay stateless and yet
handle cache state... I don't have a great answer here.

> 
> Why do I say that? I suspect 90% of projects on any hosting site are
> small to medium in size. They *may* benefit from history-db but
> Loggerhead ought to perform fine for those branches without it. By
> sticking with bzrlib APIs, we can selectively enable history-db only on
> large projects, at least until a descendant of it makes it into the core.

What are you defining as 'medium'? Bzr itself is now 30k revisions and
1k files.

If I stick with Branch apis, it is 1-2s to load the revision graph, as
overhead to pretty much every query. Unless we do something to try to
cache Branch objects (with the fairly major downside that Branch isn't
particularly threadsafe.)

Also note that in loggerhead trunk, viewing the 'trunk' branch of emacs
(when cached) takes say 700ms, but consumes 120MB of RAM. My history-db
branch can do the work in say 450ms, and consume only 30MB of RAM.
(Note, viewing 2 emacs branches only goes up to ~140MB, as a lot of the
StaticTuples get to be shared between them [I think].)

> 
> On a semi-related topic, I'm hoping to do a review of Loggerhead's UI
> soon and look at what we can take out to get closer to O(1) performance.
> For example, I don't think that we need to render the revno column in
> the Files view. (I'm planning to introduce one or more configuration
> settings that will control whether that data is displayed or not.)
> 
> Ian C.

A link is pretty cheap (via revision-id). I don't think it is great for
display.

The main display is also fairly cheap (it only shows the mainline, and
links to merged revs). As such if we just got rid of dotted revnos, we
could have everything be O(1) without any caching.

Oh, and "merged_into" links would have to be sorted out. (They walk
children pointers, which we don't have cheaply.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvGcdgACgkQJdeBCYSNAAMLWACeJ0z+g7QdtrU6FCTyiLXQGF4y
8OUAoKGft6iHBT274Tni5ot+3BUqyRHX
=FR07
-----END PGP SIGNATURE-----