Making diff fast (was Re: Some notes on distributed SCM)

Mon Apr 11 00:58:16 BST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel Phillips wrote:
> On Sunday 10 April 2005 18:07, Aaron Bentley wrote:
>>Personally, I prefer the Arch approach, which is essentially to assign a
>> uuid to each file.
> 
> 
> Why bloat up the canonical log structure unnecessarily?

Are we talking about the files-modified list?  I can't imagine the size
of that list being at all significant.

> Uuids have their place, but not in every log entry, imho.  It is easy to have 
> a (possibly throwaway) table to translates between fileids and uuids.  This 
> is a "normalized" database model.

Mapping everyone's ids to everyone else's ids makes the common case more
complex.

>>With uuids, you get this correspondance automatically.  If the file came
>>from the same ultimate source, it's treated the same in every tree that
>>contains it.
> 
> 
> And if files didn't come from the same source but are the same regardless then 
> things start to get messy, and the uuid is just a confusing liability.

Actually, uuids are helpful there, too.  If the files in tree A have one
set of uuids and the files in tree B have a different set, then you can
publish a table asserting the equivalence of each id pair.  This means
that only one person has to establish the equivalence.  In fact, you can
publish the table as part of A or B, and it will Just Work for anyone
trying to merge between them.

>  The 
> same with SHA1, which attempts to work around this, imho.

Using a hash of file info to represent identity means that two files
with the same info will get the same identity.  But if they're not
supposed to have the same identity, that's a worse situation that two
equivalent files being treated as different.

>  Not to mention 
> that both are considerably bulkier than a simple sequence number.

Hey, let's not prematurely optimize our file size.  How many bytes do
they really cost?

>>In this case, the Arch model requires tables similar to the ones you
>>described earlier, to map one uuid to another.
> 
> 
> So since the tables are required anyway, let's rely on them and thereby 
> normalize and shrink the database a little.

They're not normally required.  Implementation can be deferred until
they are required.  In the life of Arch, this hasn't itched anyone
enough that they scratched it.

> Note my schizophrenic position with regards to micro-optimizing the canonical 
> verlog, vs other things.  I freely admit to that, this is the important one.

Well, as long as you admit it, it's okay.  :-)

>>As has been done so far in Arch :-)  This problem is only likely to
>>occur when multiple people import the same well-known project.

> This situation comes up _all the time_ for me, I don't know about you.

It never comes up for me.  The projects I work on all have a canonical
Arch archive, so I don't re-import anything.  Still, it would be quite
cheap to provide a merge --filenames option.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCWb2Y0F+nu1YWqI0RAitvAJ469c4bbpRdB7Er4W0CDfu0pfv2awCdFK/T
bW4aHMqpvLeMD3qtUulkkZk=
=FpYY
-----END PGP SIGNATURE-----