Signing snapshots

Wed Jun 22 18:41:24 BST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 21 Jun 2005, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
>>For example, if you sign a hash of
>>mbp at sourcefrog.net-20050309040815-13242001617e4a06, and the hash
>>algorithm is later broken, it should be possible to re-sign that
>>revision using a later hash, yet still be able verify it using the old
>>hash.  And it would also be nice to be able to remove the old hash
>>without disturbing the new hash.
> 
> 
> There are a few possibilities.  For one thing, we can just have
> multiple signatures.  For example if my signing key expires or is
> revoked, I might want to go through and add signatures using a
> different key.

Yes, multiple signatures make sense, but that wasn't really what I
meant.  I mean, first associate a new hash with the revision, and then
sign the new hash.  I'm pretty sure we already agreed that multiple
signatures are a good thing, but the mechanism of associating new data
with the revision seems to need tuning.

I guess I should point how that none of this data technically *has* to
be in the revision or the inventory.  We could introduce a "validation
store" to hold these hashes.

> If we decide we want to use a stronger hash algorithm then we'll
> probably want to not just add a new signature at the top level, but
> also regenerate inventories and revision records that use the stronger
> hash.  Since that changes the text of the revision the original
> signature will not be valid.

I was proposing that we not let this happen.  For trees at least, it's
nice to be able to verify that an independently-constructed tree (e.g.
using changesets delivered by a smart server) is in fact a true copy of
the revision FOO tree.  I think this is one of the advantages of
snapshot-orientation-- you don't care how a tree was produced, as long
as you can verify that it is what you expected it to be.

> One approach is to let that happen and
> just re-sign the revisions.   
> 
> Another is to keep the old form of the object with the old signature,
> and make a new form of the object with a new signature.  That means
> that mbp at sourcefrog-29381938123 will refer to more objects that have
> different texts and signatures, but supposedly equivalent meaning.
> I'm not completely sure that would be worth the possible confusion but
> it is an option.

I think the best thing is to not treat hashes as part of the inventory
identity, but as supplemental verification data.  That way, you're not
really creating different object when you add new hashes.

> At the moment, when we generate a hash, we simply make a hash of the
> text; the code that generates a hash of the inventory doesn't know or
> care what kind of hash the inventory uses to identify the contained
> files.

I assume this means that two valid inventories with the same meaning
could differ textually, and have different hashes.  That seems really
unfortunate to me.  It would be nice to be able to say "If the inventory
sha-1 hash is not X, it is not a true copy of Y."

> I guess we could do what you say by pre-processing the inventory file
> to strip out all the hashes but those relevant to the one we're
> computing.

No, the way I'd do it is by not signing the inventory file-- sign the
inventory data instead.  As a straw man, you'd sort it by unicode
codepoint, then write out a space-delimited inventory summary with id,
name, parent, type and contents-hash(if applicable) fields for each
entry.  The format doesn't need to be parseable, just unique for each tree.

>>In light of this, I don't know what to make of the recently-added
>>"revision_sha1" attribute for parent revisionss.  I thought the notion
>>was that we would sign the entire revision history.  This means that
>>creating a sha-160 signature for a revision requires adding sha-160s to
>>every ancestor revision.  I think this makes merge horizons impossible.
> 
> 
> I don't see why that follows; we could have a sequence of revisions
> where at some point we switch from using sha-1 to sha-160.

Oh, it follows if you assume you shouldn't hash data generated by
another hash algorithm.

>>Requiring people to commit in order to
>>produce changesets seems onerous.
> 
> 
> I'm not quite sure what you mean.  I think normally I would ask people
> to commit before e.g. submitting a changeset by mail, because
> otherwise we don't have any good identifier of what was submitted.

Right now, I make all my quickie patches to your tree, save them, and
revert the tree.  Because until recently, we didn't have enough metadata
to make merging painless, and we're still not using that data.

> Remember that gpg internally hashes the input data before computing a
> signature; the signature is actually the signature of a hash.  So arch
> is actually storing a signature of a hash of a file containing hashes.
> If we just make a detached-signature of the revision xml then we will
> avoid one extra step and just store the signature of the hash of that
> file.

For validating that the data is not malicious, that would be more than
adequate.

> Right, so we could say that only data signed by a trusted key gets
> considered at all, and then in a second round we check if it's
> authoritative.

Yes, I just wonder whether it's useful to make that distinction.  I
suppose it would reduce exposure if a trusted key was compromised.  But
in general, someone who wouldn't send us malicious data also wouldn't
attempt to impersonate the branch owner.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCuaLE0F+nu1YWqI0RAtHAAJsGoFP+9ZuNOXqTSRUtg0kv3OUxsQCcDq5h
LReEjCDk1liZgaUOfxzYaLo=
=fDfV
-----END PGP SIGNATURE-----