Signing snapshots

Wed Jun 22 04:05:07 BST 2005

On 21 Jun 2005, John A Meinel <john at arbash-meinel.com> wrote:
> >My understanding is that we're signing the tree and revision state, not
> >the storage of the tree or revision.  The point of signing is so that
> >you can verify that a given revision is or is not a true copy of
> >revision $foo.  That way, it doesn't matter how a given revision was
> >produced, and you can produce it whatever way is most effective for the
> >context.
> 
> Sure. You are signing the *output* format. Meaning that you can get the
> working tree into the state that I put it in, not the actual archive files.
> However, in the end, it comes down to signing a bunch of bytes. And at
> least the last proposal I read was to sign the bytes that make up the
> revision file. Which means that modifying that at all means you need to
> re-sign.

Right.  All (mainstream?) signature algorithms just sign a stream of
bytes, so eventually we have to get in to that form.  There is a tension
here between wanting to sign something abstract (so we can change the
storage method), and signing something directly (so there is less scope
for attacks that change semantics without changing the signature), and
being simple.  So a related question is, say, whether we should sign XML
in the form it happens to be written by the library, or whether we
should try to canonicalize it first.  The latter doesn't seem to buy
anything and does complicate it.

> >I think that's not great, because it allows someone to stuff bogus
> >recent revisions into a branch, and get away with it.  I think designing
> >for re-signing is better.

Can you expand a bit on this attack?

> But what happens if *I* upgrade to sha-160, but I'm branching off of
> *you* and you haven't upgraded. I can't re-sign your signatures, I still
> have to trust sha1 for the first 600 revisions.

Yes.  

It should be possible to upgrade and re-sign previous revisions, which
is one reason why I don't like using the hash as the main identifier.
But it should also be possible to keep referring to old revisions that
haven't been ugpraded.

> Plus I was trying to work out what the notation would be if you wanted
> to handle rollup changsets. (Give me a changset for the last 10
> revisions). Mostly it was an interface issue, and how to translate what
> the usergave you into an appropriate set of revision ids. Some
> examples: bzr changeset -r 10: should it give you a rollup from
> 10->working tree or should it be 10->lastrev, "bzr changeset" is that
> give me the last committed, or the change relative to working tree. I
> was just having trouble with when None meant no user input, and when
> None meant use last/first. So I punted for now.

I was wondering about that too.  On the whole I'd say the "export
changeset" code shouldn't ever worry about the working tree.  This does
make it a bit different from diff, but in some ways it would be nice to
unify them.

Perhaps we can have "bzr diff --format=changeset" which defaults to
head:working and produces similar output.

> >>The idea with a detached signature is that you don't actually have to
> >>parse a semantic meaning to the bytes. Just read in some bytes and
> >>compute the sha hash, then check the signature. If you have a
> >>for where the ---BEGIN and ---END lines exist.
> >
> >What I don't parse is "this approach flakes out of the more important
> >problem of evaluating whether the code is signed by a meaningful key".
> >It sounds like he's talking about *not* signing the hash, to me.

No, that wasn't what I meant.

"This approach" meant verifying the signature without also considering
whether the signing key is an authoritative one for what it signs.

> The issue, I think, is that after checking that the signature is valid,
> you want to go back and make sure the person who signed the signature is
> the person doing the committing. (So that <revision committer="foo">
> matches the name on the gpg key).
> The problem is that you still have to parse the <revision> xml in order
> to be able to match to the gpg key. Which means that someone who is
> being malicious (but has a valid key), can still require you to parse
> bogus data.

Right.

> I don't think there is any way to get around needing to try and
> understand the data they sent, in order to make sure it is valid. You
> could restrict what keys you are willing to look more closely, though.
> So you check the signature, make sure it is valid, *and* that you trust
> it enough to read the revision contents to make sure that the
> committer="" tag matches the gpg key.

Right, and again while the default should be to use the web of trust I
can imagine some users wanting only specific identified keys to even be
considered.

> >Yes, I'm considering the possibility of two layers of signing and two
> >levels of trust.  One signature on the binary output format, to prove
> >that the data is not malicious, and one signature on the revision/tree,
> >to prove that the output is a true copy of a given revision.
> >
> >That way, you can make a revfile version of the bzr codebase, and I can
> >trust that revfile version.
> >
> It depends how you want to trust revfiles. Because it is certainly
> possible that chunks are added as you go, and some of them may not be
> used. Removing them does not invalidate the revfile.
> 
> >Note however, that since we don't want to download the entire revfile,
> >we can't quickly validate them against a signature, and worse, their
> >hash will change with every commit.  I guess we'll have to sign the
> >logical chunks contained within revfiles.
> 
> The current method, is to sign the revision-store file, with the idea
> being that if you started at the beginning, you could validate the
> revfile. Validating a full-text is easy, then you patch it, and can
> validate that against the next inventory sha, which is validated from
> the revision entry.

I don't see any harm in also signing the representation (revfiles or
gzipped files or whathaveyou) as additional protection against malicious
formats.  I do think the most important thing is to sign the logical
commits.

> >The issue is that I would like to be able to associate a given branch
> >with a given key, and not accept anyone else's signature on that branch,
> >even if they are trusted.
> 
> Sure. You can do this with custom keyrings. So that you say "trust
> things on this branch using this keyring."
> I'm not really sure how to specify what keys can commit to what
> branches. Are you thinking to add something into the .bzr/ directory
> such as "x-allowed-keys"?

Yes, I think that can make sense as a client-side setting.  So perhaps
we initially get the branch with 

  bzr branch http://fooo/bar --trust-key john at meinel

and that's remembered for all future pulls from that branch.  If a
revision appears on the mainline that is not signed by that key then
either the branch is compromised; or john has got a new key or added a
new committer and needs to tell me.

This guards against Vlad having a key that is well-signed and trusted
for other branches, and using it to put a signed malicious commit onto a
branch he's not supposed to directly commit to.

We've discussed a bit the idea of having the upstream branch say which
keys are allowed to be present there, but it just goes in circles with
no trusted foundation.  How do you know the attacker didn't add their
own key to the list?  Perhaps in the example above we could look for a
message signed by John saying Alice is now also allowed to write to this
branch.

Consider also that Alice should be able to copy John's branch and start
adding on to it without needing permission from John; she needs to tell
people that both John's and her own key should be trusted.

As a UI feature we should check from time to time that keys have not
been revoked; I don't think GPG does that by default.

> >It seems to me that chrooting is the more paranoid option.  If you focus
> >on validating data, you're saying: "This code has few bugs, but if they
> >are exploited, you can get in trouble".  The chroot approach says "This
> >code may or may not have bugs, but the chance for damage if they are
> >exploited is minimal".
> >
> >Of course, the importance of being able to say the second is a value
> >judgement.
> 
> The problem is that a chroot is inherently of limited functionality. So
> you have to make sure that everything you need is in that chroot. For
> instance, are you checking signatures, and thus need gpg, do you need to
> have access to the python standard libraries? Is it possible to exit the
> chroot(). I forget the specific steps, but I thought there were some
> ways to do it. Is the "StreamTree" class free from bugs, such that you
> can't exploit the remote conversation to cause the local bzr to do bad
> things.
> 
> I agree, limiting the damage is nice, but it might be a considerable
> effort, which might be better off spent finding & eliminating bugs.
> If the effort is not very large, I'm certainly for adding the limitation.

I think you're right that it's better to just aim for correctness and
simplicity.

Rather than a chroot, perhaps we should use a Python jail, so that the
code that unpacks a changeset can't directly do any IO or touch the rest
of the library.  It just passes out a changeset object which can be
examined before it's used.

chroot on standard unix has some limitations here: we'd need a setuid
helper to get into the jail, and it only covers filesystem access not
e.g. ptracing another process.

-- 
Martin