Advice/help wanted on bzr fast-export-from-cvs

Tue Aug 11 23:45:24 BST 2009

Ian Clatworthy wrote:
> I'm the primary author of bzr-fastimport. I'm pretty sure we've chatted
> before re cvs2git and using it to generate fast-import streams for Bazaar?

Yes.

> I've been ill for much of the last year so I haven't been around as
> often as I hoped. bzr-fastimport has been on the back burner
> accordingly. :-( I'm getting back to consistently good health now

I'm very glad for that and continue wishing you the best :-)

> In the last few days, I've been working on making migration to Bazaar
> via bzr fast-import easier and more reliable. In particular, I've added
> 4 wrapper scripts that make it dead simple for average users to generate
> fast-import dump files from foreign tools:
> 
> * bzr fast-export-from-darcs
> * bzr fast-export-from-hg
> * bzr fast-export-from-git
> * bzr fast-export-from-svn
> 
> In a nutshell, these commands:
> 
> * check required dependencies are installed and tell the user what to
>   install if they are missing
> 
> * call the bundled or external fast-export scripts using the recommended
>   set of options
> 
> * all follow the same UI pattern:
> 
>   bzr fast-export-from-xxx source-repo dump-file

That's an interesting concept and very user-friendly.

> I'm keen to add fast-export-from-cvs and I need your advice/help to do
> that. The other fast-export-from-xxx commands needed 2 small bits of
> code to implement:
> 
> * a command wrapper than defines the usage and help
> 
> * an XxxExporter class rather defines the dependences and have to call
>   the underlying script/command that does the actual work.
> 
> See
> http://bazaar.launchpad.net/~bzr/bzr-fastimport/fastimport.dev/annotate/head%3A/exporters/__init__.py
> for the existing XxxExporter classes.
> 
> So what I need to know is:
> 
> 1. Are there commonly recommended options that projects ought to use as
>    a starting point?

First of all, I suggest basing your work on the trunk version cvs2svn.
It is long overdue that I make a new release, and trunk contains a few
improvements that would be important for exporting to git-fast-import
format.

The trunk version of cvs2svn includes a "cvs2git" script that exports a
CVS repository to git-fast-export format.  It only requires a few
options to be set; all of the rest have sensible defaults.  It should be
a very good starting point (and maybe even an ending point!) for cvs2bzr.

For experimentation, the most flexible way to configure a conversion is
by using an options file (specified with the --options option).  The
options files cvs2git-example.options and cvs2hg-example.options, which
are included in the cvs2svn source tree, should be good examples.  There
are a couple of places where they differ, and undoubtedly a
cvs2bzr-example.options file would have other differences.

For example, the cvs2git script creates separate git-fast-import
dumpfiles for the blobs and for the commits, whereas until recently "hg
fastimport" only supported inline blobs.  If "bzr fastimport" requires
inline blobs, than you would have to use the (much slower) output option
using GitRevisionInlineWriter as done in the cvs2hg-example.options file.

Another difference between cvs2git and cvs2hg is that hg only supports
0, 1, or 2 parents per commit, whereas it allows an unlimited number.
This can also be adjusted in cvs2svn, using GitOutputOption's max_merges
parameter.

But anyway, there is a serious question as to what parents to record for
branch-creation commits and commits that involve adding new files to a
branch.  Currently, cvs2git records all branches that contribute files
to a branch as parents, but (having gained more experience with git) I
am skeptical whether that behavior is correct.  I think it would be more
in the spirit of DAG-based VCSs to only consider the "best" source
branch to be a parent of a new branch.  Greg Ward, who has recently done
some work on an improved cvs2svn-based cvs2hg, plans to do the latter.

> 2. How often do users use 'rcs co' vs cvs to access the data?

I don't have any insight into this.  "--use-cvs" is the default because,
even though it is *much* slower than "--use-rcs", it is more robust in a
few unusual situations.  I figured that the naive users who rely on
default parameters will mostly have small repositories, whereas people
who worry about performance are likely to read the instructions more
carefully and perhaps opt for --use-rcs.

Even faster than those two options, by a significant factor, is a
--use-internal-co option that I have prototyped on my hard disk but not
yet released to the wild.  The analogous option is the default for
cvs2svn and would probably be the best default for the other converters,
if it can get released in time.

> 3. How stable is the code?

There hasn't been much feedback from cvs2git users, but what there has
been has mostly been positive.  I have heard very little from cvs2hg
users and don't recall any feedback at all for "cvs2bzr".  I have the
feeling that (1) most users of DVCSs are early adopters who have
probably already left CVS to something else (maybe Subversion), and (2)
many other git users just use the "default" converter, git-cvsimport,
buggy though it is.

Greg Ward was working on making contrib/verify-cvs2svn.py able to test
the accuracy of conversions to all backends.  It only tests that the
contents of tags and the tips of branches were correct, but that would
already be a useful confidence-builder.  I think that very little code
would suffice to teach it to test cvs2bzr conversions.

> 4. Is is worth bundling the necessary pieces in bzr-fastimport itself
>    rather than asking users to separately install it? (A separate
>    install is a minor thing for Ubuntu/Debian users, say, but a PITA
>    for Windows users IIUIC.)(#)

That's a nice idea.  Here are some considerations:

* As you can probably imagine, I am not anxious to have to support
multiple cvs2xxx variants as distributed (and perhaps even modified) by
different downstream projects.  But I suppose if you would distribute an
unmodified, defined version of cvs2svn and make it easy for users to see
the cvs2svn command line that was used and to report problems upstream
in a usable form, it wouldn't be so terrible.

* Aside from the few inconvenient dependencies (GNU sort, gdbm, CVS or
RCS), cvs2svn is pure Python and doesn't need to be compiled or
"installed" in any special way.  This both lowers the barrier for users
to install it, and lowers the difficulty of your including it in your
distribution.

* cvs2svn is currently under a CollabNet license which, as far as I can
figure out, is not GPL-compatible.  This might or might not present a
problem, depending on how you want to connect your code to cvs2svn.  It
is conceivable that CollabNet would agree to change the license, but
that is unfortunately not my decision.

> 5. Does the gnu 'sort' dependency still hold? Is there a good reason for
>    needing that versus doing the sorting in Python, say?

Yes, we still require GNU sort.  The sorting could definitely be done in
Python, but not in-memory because we often have to sort enormous files.
 Our issue #123 [1] is about this possibility.  One suggestion is to use
the sorting recipe from [2], but the maximum file size that it can sort
depends on the number of file handles that are available to the process.
 I never had the time either to research whether either the limits are
OK on all operating systems of interest, or to rewrite the sort routine
to make it work hierarchically, to allow arbitrarily large files with a
fixed number of file handles.  Patches would be welcome :-)

The other dependency that sometimes proves problematic for users is that
we require a reasonable DBM (of the anydbm flavor).  Our code explicitly
disallows dumbdbm, dbm, and older versions of bsddb (see
cvs2svn_lib/database.py for the DBM-selection code).  The DBM-selection
code was here before I started on the project, so I don't know the details.

But actually, I just checked and found out that the only remaining user
of the Database class is the checkout_internal code.  It could be that
we don't need to be so picky about DBMs anymore.

Michael

[1] http://cvs2svn.tigris.org/issues/show_bug.cgi?id=123
[2] http://code.activestate.com/recipes/466302/