Recommended backup procedure and preserving my data...

Thu Oct 15 19:22:11 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Szakmeister wrote:
> On Thu, Oct 15, 2009 at 12:47 PM, John Arbash Meinel
> <john at arbash-meinel.com> wrote:
> [snip]
>> You *can* do a fast-export, but last I heard, it wasn't made as a
>> 'round-tripping' tool. Meaning that you can export, and potentially
>> import all of that history into a new repository. But the new repository
>> will likely get all new revision ids and file-ids, and not be compatible
>> with the old repository.
> 
> Bummer.  It'd be nice to have a format, were I could be guaranteed to
> get back the original repository.

There has been some work to make this possible. The current main problem
is that bzr can represent "ghosts" (revisions whose identifier we know,
but where we do not have the actual content for the revision.)

The fast-export stream does not seem to have a way to talk about those
sorts of objects. (The stream is based on git, where everything is
addressed as the hash of the content, thus if you can hash it, you must
have the actual content available.)

Aside from that, I thought Ian had round-tripping working. Though you
would need to get that from *him*, since I've never done any experiments
myself.

I think it is something that you need the fast-export stream plus a
'marks' file that indicates what identifier bzr uses for each object in
the stream.

> 
>> If you *just* wanted a single-file dump of the whole history, you could do:
>>
>> bzr init empty-branch
>> cd trunk
>> bzr send -o ../big-dump.patch ../empty-branch
>>
>> It effectively generates the delta of your entire history, and puts it
>> into the file. It doesn't work with multiple branches, though. (Well you
>> can do them each separately, but you'll have lots of really big files
>> when you are done.
>>
>> It also isn't optimized for this case, and will probably be quite slow.
> 
> I can imagine. :-)
> 
>>> Sorry for all the questions, but I'd like to seriously consider
>>> rolling out Bazaar in our infrastructure.  I can't really do it
>>> though, unless I can take care of these issues as well.
> [snip handy script]
> 
> The script assumes that everything is in a shared repository... which
> I suppose is a fair assumption.

Well, you could use it once per repo. Or do something like:

find . -path '*.bzr/branch' -print0 | xargs -0 -n1 ...

There are other possibilities, though. Bazaar is careful to order its
reads and writes such that things always stay consistent. So if you were
to do a backup in the same order that we do our reading, you would be
sure to maintain the same consistency.

Namely, you would have to:

1) Back up the contents of all branches before backing up the content of
   the repository.
2) Back up the repository such that '.bzr/repository/pack-names' is
   copied before the contents of .bzr/repository/packs and
   .bzr/repository/indices.

3) You shouldn't have to backup the contents of .bzr/repository/upload
   or .bzr/repository/obsolete_packs. Those directories must exist when
   restoring (we don't create them on demand), but anything written
   there is in a 'temporary' state. (when updating, we write everything
   to .bzr/repository/upload, and then rename it into position, and then
   update pack-names to reference it.)

As long as you worked in that order, you *could* end up with
unreferenced data, but you shouldn't ever end up with data that is
referenced but is not available.

Well, I have to take that back slightly. If an autopack/pack was
occurring at the same time you were backing up, it may move content into
a new pack file, and rename it into upload, and then update the
pack-names file. It won't write the new pack-names file until the data
has been fully updated, though. So one option would be to loop. So you
could grab pack-names, backup everything in .bzr/repository/packs and
indices, and then check to see if pack-names has been updated. If it
has, go around again until you have successfully copied everything
without pack-names changing.

Note that this is what 'bzr branch' does, which is why I recommend
staging everything to a 'warm backup' location first.

> 
>> Note that this should be ~nice to your backup tapes. Bazaar will
>> autopack the repository from time to time, but does so in an
>> 'exponential backoff' fashion. So the *first* time you run this script,
>> I would add a "bzr pack" just before $RUN_BACKUP_TO_TAPE.
>> That should give you a single minimal pack file that gets backed up.
> 
> It doesn't need to be absolutely minimal churn... I can cope with the
> autopacking.  We don't have much (in terms of size), but he have 50 or
> more Subversion repositories at the moment.  And it seems to grow
> every week. :-)

As for Bazaar repos, you can have as many or as few as works for you.
You can share multiple projects in one repo, or have one repo per
project, or one repo per branch... The actual layout tends to be
dictated by access control (balanced against disk storage).

> 
> [snip]
>> You can use 'bzr_access', you can use bzr+http + .htaccess files. You
>> can use just "bzr://" access and just use firewall rules to restrict who
>> can actually access the server.
> 
> Is that new?  I don't remember seeing that in the guides.  What's the
> performance impact?

I'm not sure what guide you would be looking for, but it has been around
for quite some time. Basically, it just uses 'POST' as an RPC layer to
send requests. The protocol is specifically designed to be 'stateless'
so that we can tunnel over HTTP. So the specific impact should be
negligible to performing better than 'ssh' because you don't have the
ssh handshake overhead.

Though if you do "bzr+https://" then it should essentially be identical.

Note that Loggerhead now supports:

  bzr serve --http

Which provides a bzr smart server pre-configured. And many people like
to run loggerhead and proxy it through Apache. I *think* that in doing
so you can get ACL at the Apache layer, and minimal setup overhead via
loggerhead. Not to mention nice visuals when you manually browse to
"http://host/my/branch".

> 
>> It really depends how much access control support you really need. I
>> think someone was also working on adding AC to 'bzr;//' but I don't
>> think that is 'ready' in any sense.
>>
>> bzr+http might be your best bet here.
> 
> Thanks for taking the time to answer my questions John!  BTW, are any
> of you guys going to be at PyCon?  I'd love to meet you.
> 
> Thanks again!
> 
> -John
> 

I'm not sure if we are sending anyone this year. I've gone to the last
2, because they were in Chicago (about 1hr away from where I live).
Being in Atlanta would require traveling away from my family...

I know in the past we've had quite a few Canonical people travel to
PyCon, I just don't know if someone from the Bazaar group will be
specifically going.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrXaFMACgkQJdeBCYSNAAMAlQCfZhz+JLWQ2NEQeGHCzFGGjnFr
K30AnR6n+jH0a64AWrr+uVWSiJcZl6Mq
=nzb6
-----END PGP SIGNATURE-----