millions of warnings per day, state DB grows by 4GB/day

Wed Aug 26 15:05:11 UTC 2015

Looking at a MAAS+Juju collection of 12 hosts, with 3 Juju
"state" nodes. The collection runs OpenStack and Ceph,
apparently still working. It was mostly installed in May, 3
months ago, on ULTS 14 with package versions:

ii  juju-core          1.23.2-0ubuntu amd64          Juju is devops distilled - client
ii  juju-deployer      0.4.3-0ubuntu1 all            Deploy complex stacks of services using J
ii  juju-local         1.23.2-0ubuntu all            dependency package for the Juju local pro
ii  juju-mongodb       2.4.9-0ubuntu3 amd64          MongoDB object/document-oriented database
ii  juju-quickstart    2.0.1+bzr124+p all            Easy configuration of Juju environments
ii  python-jujuclient  0.17.5-0ubuntu all            Python API client for juju

The 2 "state" nodes are 'node01' (machine 0), 'node02' (machine
9), 'node09' (machine 10). There are some worrying symptoms:

*   The MongoDB database size is 204GiB on 'node01', 199GiB on
    'node02', 208GiB on 'node09' (roughly the same of course)
    and grows by around 4GiB per day. That is the number of
    'juju.NNN' files grows constantly, and is currently around
    800.

*   Probably correlatedly there are several MB/s of transfer
    rate among the "state" nodes on the Juju port, mostly
    currently from 'node02' to 'node01' and 'node09'.

*   'jujud' consumes from 50% of a rather speedy recent Xeon
    CPU to 2-3 CPUS, except on 'node09'.

*   'mongod' consumes 1-3 CPUs on 'node01' and 'node02'.

*   The 'machine-0.log*', 'machine-9.log*', 'machine-10.log*'
    are large, and in particular there are often millions of
    lines per day of these warnings:

      juju.lease lease.go:301 A notification timed out after 1m0s

    Sometimes with a frequency of thousands per second.

As to the logs I have prepared these statistical summaries:

  Number of notification time outs per day, worst days, per node:
    http://paste.ubuntu.com/12199755/ 'node01'
    http://paste.ubuntu.com/12199756/ 'node02'
    http://paste.ubuntu.com/12199757/ 'node09'

  Most popular non-timeout warnings and errors, by worst day,
  per node:

    http://paste.ubuntu.com/12199759/ 'node01'
    http://paste.ubuntu.com/12199760/ 'node02'
    http://paste.ubuntu.com/12199761/ 'node03'

Apparently there have been a couple of operation mishaps, but
the dates when they are reported to happen are not quite those
on which I see the most logged errors, or later. Some collegues
think that some endpoints are "dangling".

Please let me know what to look at and ideally where there is
some internals documentation, as I am not at all familiar with
the internals of the Juju state system.