millions of warnings per day, state DB grows by 4GB/day
Peter Grandi
pg at juju.list.sabi.co.UK
Wed Aug 26 15:05:11 UTC 2015
Looking at a MAAS+Juju collection of 12 hosts, with 3 Juju
"state" nodes. The collection runs OpenStack and Ceph,
apparently still working. It was mostly installed in May, 3
months ago, on ULTS 14 with package versions:
ii juju-core 1.23.2-0ubuntu amd64 Juju is devops distilled - client
ii juju-deployer 0.4.3-0ubuntu1 all Deploy complex stacks of services using J
ii juju-local 1.23.2-0ubuntu all dependency package for the Juju local pro
ii juju-mongodb 2.4.9-0ubuntu3 amd64 MongoDB object/document-oriented database
ii juju-quickstart 2.0.1+bzr124+p all Easy configuration of Juju environments
ii python-jujuclient 0.17.5-0ubuntu all Python API client for juju
The 2 "state" nodes are 'node01' (machine 0), 'node02' (machine
9), 'node09' (machine 10). There are some worrying symptoms:
* The MongoDB database size is 204GiB on 'node01', 199GiB on
'node02', 208GiB on 'node09' (roughly the same of course)
and grows by around 4GiB per day. That is the number of
'juju.NNN' files grows constantly, and is currently around
800.
* Probably correlatedly there are several MB/s of transfer
rate among the "state" nodes on the Juju port, mostly
currently from 'node02' to 'node01' and 'node09'.
* 'jujud' consumes from 50% of a rather speedy recent Xeon
CPU to 2-3 CPUS, except on 'node09'.
* 'mongod' consumes 1-3 CPUs on 'node01' and 'node02'.
* The 'machine-0.log*', 'machine-9.log*', 'machine-10.log*'
are large, and in particular there are often millions of
lines per day of these warnings:
juju.lease lease.go:301 A notification timed out after 1m0s
Sometimes with a frequency of thousands per second.
As to the logs I have prepared these statistical summaries:
Number of notification time outs per day, worst days, per node:
http://paste.ubuntu.com/12199755/ 'node01'
http://paste.ubuntu.com/12199756/ 'node02'
http://paste.ubuntu.com/12199757/ 'node09'
Most popular non-timeout warnings and errors, by worst day,
per node:
http://paste.ubuntu.com/12199759/ 'node01'
http://paste.ubuntu.com/12199760/ 'node02'
http://paste.ubuntu.com/12199761/ 'node03'
Apparently there have been a couple of operation mishaps, but
the dates when they are reported to happen are not quite those
on which I see the most logged errors, or later. Some collegues
think that some endpoints are "dangling".
Please let me know what to look at and ideally where there is
some internals documentation, as I am not at all familiar with
the internals of the Juju state system.
More information about the Juju
mailing list