Any plans/needs to extend the fast-import format?

Tue Aug 25 03:38:21 BST 2009

+Sverre, who has been doing work on hg->git lately.

Ian Clatworthy <ian.clatworthy at canonical.com> wrote:
> I'd like to use bzr fast-export + bzr fast-import to "round-trip" Bazaar
> branches but the format doesn't currently support all the metadata I
> need to do that. One option is to extend the above tools to use a
> superset and only dump the additional data if requested.

Yea, git supports round trips, almost, but the format started for
git so its easier to be lossless there.  :-)

> Before I do that though, I thought I should ping you and other
> fast-import hackers to check if any of you had any plans/needs along
> those lines. Is there additional data required for round-tripping hg
> repositories, for example, that I ought to allow for while designing the
> extended format? If so, it would be good to pool our requirements so we
> end up with one common extended format (not 10) and higher fidelity
> migrations.

I think there is, rename information for starters might be hard to
encode, though fast-import does have a rename command, but given
that the language says the rename is applied immediately, this can
make cases like "rename A->B, rename B->A" really nasty to encode.

> To be explicit, here's the sort of stuff I'm thinking of:
> 
> * An optional properties section in a commit. Each property would
>   have a name and value, both utf-8 encoded.

This I find could be dangerous, what are additional properties,
and what should they look like when loaded in another VCS?  Git has
tried to resist adding hidden fields to the commit headers, because
then they aren't as easy to access for a human.

> * Multiple author sections in a commit, not just one. (I guess authors
>   beyond the first wouldn't need the when data so I'm leaning towards
>   leaving that out.)

I guess this is sane, an importer could just take the extra authors
and drop them into the commit message if the VCS doesn't have a
native way to represent them.  But that does make round-tripping
out back to a system that supports multiple authors harder.

> I'm sure more will be required (e.g. empty directory support, handling
> differences in tag name rules, ghost revisions) so I'm hesitant to lock
> down a detailed design without trying some code, but we should certainly
> chat about some key policies w.r.t. extending the format ...

Agree.

> Do you have any preferred direction w.r.t. indicating extended vs
> current data streams? For example, should we add a format command that
> goes at the top of the stream something like ...
> 
> format (git|bzr) [version]

Sverre recently added patches to git fast-import to declare options
at the top of the stream, but these are implementation specific
options unique to git-fast-import:

  http://article.gmane.org/gmane.comp.version-control.git/125853

> If missing, git 1.6 would be assumed. If present, old importers would
> stop on finding a command they didn't understand?

Yes, that seems sane, but I'd hate to lock into a particular
version number.  Instead we might get better mileage by declaring
features that the stream uses, and then parses which do not know
that feature abort:

  feature multiple-authors
  feature ghost-revisions
  feature ...

> Alternatively, we could put the format information (and all extended
> metadata?) into meta-comments something like ...
> 
> #+ format bzr 2.0
> ...
> commit /ref/heads/master
> mark :22
> author Bill Bloggs <bill at example.com> datetime
> #+ author Sue Wong <sue at example.com>
> #+ author Chuck Jones <chuck at example.com>
> committer Sarah Watson <sarah at example.com> datetime
> data 13
> fix bug #1234
> from :19
> merge :21
> #+ properties 2
> #+ name 11 branch-nick
> #+ value 12 bug-fix-1234
> #+ name 5 fixes
> #+ value 7 lp:1234
> M 644 inline NEWS ...
> 
> I find that a lot less readable myself but it's worth considering.

No, please, lets not do that.  It risks a parser claiming it
understands the format, when really, it doesn't.  That's worse,
the whole point of the fast-import language is to convert the data
from one VCS to another, quickly, but accurately.  If there is a
conversion error, we need to report it upfront and allow the user
to resolve it (maybe by teaching their importer to understand
or skip a directive), but we should never silently produce an
incomplete import.

-- 
Shawn.