Out of Memory a bridge too far

Sun Nov 13 06:48:23 UTC 2011

Chris Hecker wrote:
> 
> >> What I am actually looking for (I believe) is something along the
> >> lines of the new largefile support in Mercurial.
> > Yes, we are interested in both helping you write it, and in getting
> > it merged in to core.
> 
> My fear about doing a hacky "external largefile solution" like in hg is
> that it will be "good enough" and relieve any pressure to solve the real
> problem, but it's really a quite crappy solution to the actual problem.
>  Solving it the right way seems like it would be only a little more work
> (once you take into account testing and everything over the lifetime),
> yet it would set bzr up to be a real "dvcs 2.0" project, leaving the
> hacks behind.
> 
> Of course, I ranted about this before on this list, and since I don't
> have time to do it myself, I guess it's just a bunch of hot air right now.

I think you underestimate the effort involved.  If it were only a
“little more work” to do this “the right way” we would have done it by
now.  There's been threads delving into the details in the past, but
I'll recap the basic points I remember.

In principle, it's not too hard to modify the way bzr stores large files
to store them as N moderate-sized chunks rather than one really huge
record.

But there are significant practical issues:

1. it's a format change.  We don't do those lightly.

2. it does involve considerable (but entirely feasible) work to make
   sure the internals of bzr always deal with chunks or streams for
   large files, never just a string of the whole file.  There has been
   constant, quiet progress on this in basically every release for ages,
   but it does mean fixing up a *lot* of code paths so it's not there
   yet.

If I were to try make a tasteful workaround in current formats I'd
probably look into a different approach to the ones suggested on this
thread so far: a view plugin (or something in that style) to
transparently notice when a new revision adds a large file and break it
into a series of 10MB (say) chunks and commit those instead.  The
plugin would then of course recombine those chunks when extracting that
file.

My thinking here is that:

 - this has a fairly clear upgrade path to a future format change that
   works along the same lines;
 - clients without the plugin can still make use of the branch and
   revision involved (and with some effort, even reconstruct the large
   file manually if they have to: 'cat largefile-chunk-* > largefile' or
   similar);
 - and, of course, this greatly mitigates the memory consumption caused
   by bzr's current implementation that IIRC needs roughly 2-3x the
   memory of the largest file in the tree.

Even this I don't think I'd call “easy”: I bet there are some fiddly
corner cases, and also IIRC the existing view hooks assume a 1-to-1
relationship between transformed and untransformed files.

-Andrew.