Last round of Scale testing

John Arbash Meinel john at arbash-meinel.com
Thu Nov 7 13:18:44 UTC 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I did one last round of scale testing before handing this over to Dave.

I was using trunk @2037 with 2 patches (one to set GOMAXPROCS, and one
to allow "juju add-unit -n X --to Y".)

I started with:

juju bootstrap --constraints="mem=7G cpu-cores=4"
which started the state server on an m1.xlarge (4CPU 15GB RAM)

then I did

juju deploy -n 15 ubuntu --constraints="mem7G cpu-cores=2"
which creates 15 m1.larges.

In this case, 1 of those machines just never came up (ended up showing
as Terminated as soon as I started it).

I increased the total number of agents over 1500:

for i in `seq 15`; do juju add-unit ubuntu -n 100 --to $i & done; time
wait
2min50s

I then brought the number of agents up over 4000 with:
for i in `seq 15`; do juju add-unit ubuntu -n 200 --to $i & done; time
wait
9min26s

It took another 3 minutes (12 total) for the log to become quiet
indicating all agents had reach steady state.


I then triggered "sudo restart jujud-machine-0"

It took 6m41s before the log became quiet again. Which is actually
pretty good.

I then ran another "add 3000 agents" with
for i in `seq 15`; do juju add-unit ubuntu -n 200 --to $i & done; time
wait
19min8s

It took 27 minutes for all units to come up.
We ended up with 2.6GB resident on the API server.

"juju status" shows 7029 started agents, 502 pending (the ones on the
dead machine), 0 down, and lsof shows 7048 open file handles on jujud.


I then tried adding nrpe-external-master to all the units.
juju deploy nrpe-external-master
juju add-relation nrpe-external-master ubuntu

This didn't take very long, but at the end of it things seem jammed.
The API server responds when another agent tries to connect to it
(tested by restarting a unit agent on another machine and seeing the
login request locally). However, the Login request never returns and
there is 0 CPU being consumed on the machine.

So I restarted jujud again "restart jujud-machine-0"

I see it hit 14GB of resident memory (and this machine has no swap)
and jujud+mongodb are able to hit 400% CPU, but the agents show as
"started" again.

But that says 7000 agents + nrpe-external-master has killed the
machine. But probably smaller than that we'd be ok.

The big concern is that I do see the CPU go all the way down to 5%
while there are requests still waiting to be processed.

Some interesting bits. While spinning up the 4000 agents, this is the
summary of API requests made:
...
  1724 "Type":"Machiner","Request":"Life"
  4214 "Type":"Deployer","Request":"APIAddresses"
  4214 "Type":"Deployer","Request":"Life"
  4214 "Type":"Deployer","Request":"SetPasswords"
  4214 "Type":"Deployer","Request":"StateAddresses"
  4214 "Type":"StringsWatcher","Id":"6","Request":"Stop"
  4214 "Type":"Uniter","Request":"CharmArchiveSha256"
  4214 "Type":"Uniter","Request":"CharmArchiveURL"
  4214 "Type":"Uniter","Request":"CurrentEnvironUUID"
  4214 "Type":"Uniter","Request":"ProviderType"
  4214 "Type":"Uniter","Request":"SetCharmURL"
  4214 "Type":"Uniter","Request":"SetPrivateAddress"
  4214 "Type":"Uniter","Request":"SetPublicAddress"
  4214 "Type":"Uniter","Request":"WatchConfigSettings"
  4215 "Type":"NotifyWatcher","Id":"7","Request":"Next"
  4226 "Type":"NotifyWatcher","Id":"2","Request":"Next"
  4228 "Type":"Agent","Request":"SetPasswords"
  4228 "Type":"StringsWatcher","Id":"6","Request":"Next"
  4229 "Type":"Agent","Request":"GetEntities"
  4229 "Type":"Logger","Request":"WatchLoggingConfig"
  4229 "Type":"Upgrader","Request":"SetTools"
  4229 "Type":"Upgrader","Request":"WatchAPIVersion"
  4230 "Type":"StringsWatcher","Id":"8","Request":"Next"
  4230 "Type":"Upgrader","Request":"DesiredVersion"
  4243 "Type":"NotifyWatcher","Id":"3","Request":"Next"
  4260 "Type":"Admin","Request":"Login"
  8428 "Type":"Uniter","Request":"SetStatus"
  8428 "Type":"Uniter","Request":"Watch"
  8428 "Type":"Uniter","Request":"WatchServiceRelations"
  8459 "Type":"Logger","Request":"LoggingConfig"
  8780 "Type":"NotifyWatcher","Id":"4","Request":"Next"
 12642 "Type":"Uniter","Request":"APIAddresses"
 12642 "Type":"Uniter","Request":"PrivateAddress"
 12642 "Type":"Uniter","Request":"PublicAddress"
 13003 "Type":"Uniter","Request":"Resolved"
257797 "Type":"NotifyWatcher","Id":"5","Request":"Next"
268730 "Type":"Uniter","Request":"CharmURL"
294375 "Type":"Uniter","Request":"Life"

So you can see we're doing better than N^2 (because I'm using the
- -n100), but we are still spending a lot of time in this portion.
Everything else is quite constant time (1-3 calls per agent).

on restarting jujud (triggering each agent to reconnect) the log looks
like:
...
    494 "Type":"NotifyWatcher","Id":"7","Request":"Next"
    508 "Type":"StringsWatcher","Id":"6","Request":"Next"
   3721 "Type":"NotifyWatcher","Id":"6","Request":"Next"
   3721 "Type":"StringsWatcher","Id":"7","Request":"Next"
   4214 "Type":"Uniter","Request":"APIAddresses"
   4214 "Type":"Uniter","Request":"CurrentEnvironUUID"
   4214 "Type":"Uniter","Request":"PrivateAddress"
   4214 "Type":"Uniter","Request":"ProviderType"
   4214 "Type":"Uniter","Request":"PublicAddress"
   4214 "Type":"Uniter","Request":"SetPrivateAddress"
   4214 "Type":"Uniter","Request":"SetPublicAddress"
   4214 "Type":"Uniter","Request":"SetStatus"
   4214 "Type":"Uniter","Request":"WatchConfigSettings"
   4214 "Type":"Uniter","Request":"WatchServiceRelations"
   4228 "Type":"NotifyWatcher","Id":"5","Request":"Next"
   4229 "Type":"Agent","Request":"GetEntities"
   4229 "Type":"Logger","Request":"WatchLoggingConfig"
   4229 "Type":"NotifyWatcher","Id":"2","Request":"Next"
   4229 "Type":"Upgrader","Request":"DesiredVersion"
   4229 "Type":"Upgrader","Request":"SetTools"
   4229 "Type":"Upgrader","Request":"WatchAPIVersion"
   4230 "Type":"Admin","Request":"Login"
   4940 "Type":"NotifyWatcher","Id":"4","Request":"Next"
   7715 "Type":"NotifyWatcher","Id":"3","Request":"Next"
   8428 "Type":"Deployer","Request":"Life"
   8428 "Type":"Uniter","Request":"Watch"
   8458 "Type":"Logger","Request":"LoggingConfig"
  12638 "Type":"Uniter","Request":"Resolved"
  12642 "Type":"Uniter","Request":"CharmURL"
  37922 "Type":"Uniter","Request":"Life"

Things look pretty sane there.

When adding another 3000 agents, we get:
  1490 "Type":"Machiner","Request":"Life"
  2799 "Type":"StringsWatcher","Id":"8","Request":"Next"
  2800 "Type":"Agent","Request":"GetEntities"
  2800 "Type":"Agent","Request":"SetPasswords"
  2800 "Type":"Deployer","Request":"APIAddresses"
  2800 "Type":"Deployer","Request":"Life"
  2800 "Type":"Deployer","Request":"SetPasswords"
  2800 "Type":"Deployer","Request":"StateAddresses"
  2800 "Type":"Logger","Request":"WatchLoggingConfig"
  2800 "Type":"StringsWatcher","Id":"6","Request":"Stop"
  2800 "Type":"Uniter","Request":"CharmArchiveSha256"
  2800 "Type":"Uniter","Request":"SetCharmURL"
  2800 "Type":"Uniter","Request":"WatchConfigSettings"
  2800 "Type":"Upgrader","Request":"DesiredVersion"
  2800 "Type":"Upgrader","Request":"SetTools"
  2800 "Type":"Upgrader","Request":"WatchAPIVersion"
  2801 "Type":"NotifyWatcher","Id":"7","Request":"Next"
  2801 "Type":"Uniter","Request":"CharmArchiveURL"
  2801 "Type":"Uniter","Request":"CurrentEnvironUUID"
  2801 "Type":"Uniter","Request":"ProviderType"
  2801 "Type":"Uniter","Request":"SetPrivateAddress"
  2801 "Type":"Uniter","Request":"SetPublicAddress"
  2803 "Type":"NotifyWatcher","Id":"2","Request":"Next"
  2804 "Type":"NotifyWatcher","Id":"3","Request":"Next"
  2815 "Type":"Admin","Request":"Login"
  2955 "Type":"StringsWatcher","Id":"6","Request":"Next"
  5600 "Type":"Logger","Request":"LoggingConfig"
  5600 "Type":"Uniter","Request":"SetStatus"
  5601 "Type":"Uniter","Request":"WatchServiceRelations"
  5602 "Type":"Uniter","Request":"Watch"
  8400 "Type":"Uniter","Request":"APIAddresses"
  8400 "Type":"Uniter","Request":"PrivateAddress"
  8400 "Type":"Uniter","Request":"PublicAddress"
  8687 "Type":"Uniter","Request":"Resolved"
210506 "Type":"NotifyWatcher","Id":"5","Request":"Next"
317213 "Type":"NotifyWatcher","Id":"4","Request":"Next"
528756 "Type":"Uniter","Request":"CharmURL"
545845 "Type":"Uniter","Request":"Life"

So if we could solve the CharmURL (http://pad.lv/1245649) issue, then
we'd be looking pretty good for all the other request issues.

There is still the issue of what the process was doing when it hit 0%
CPU. I uploaded the traceback (xz compressed) to Ubuntu One, and it
should be here:
* sigquit.log.xz (2.7 MB) hosted on Ubuntu One:
http://ubuntuone.com/6vKrM07H7BTHUa9Xqvy16V

Looking at that traceback nothing is in [running] which is probably
why it is at 0% CPU :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJ7kzQACgkQJdeBCYSNAAPmcQCfXHumSDIXVIp51NsQdZPDOCvB
3xsAoKWooU1547qG82O+lSopNp3YK3/R
=dOjj
-----END PGP SIGNATURE-----


More information about the Juju-dev mailing list