Enabling GOMAXPROCS for jujud

Tue Oct 29 06:07:31 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Do we want to enable multiprocessing for Jujud? I have some evidence
that it would actually help things.

I'm soliciting feedback about this patch:
=== modified file 'cmd/jujud/main.go'
- --- cmd/jujud/main.go	2013-09-13 14:48:13 +0000
+++ cmd/jujud/main.go	2013-10-28 17:47:52 +0000
@@ -8,6 +8,7 @@
 	"net/rpc"
 	"os"
 	"path/filepath"
+	"runtime"

 	"launchpad.net/juju-core/cmd"
 	"launchpad.net/juju-core/worker/uniter/jujuc"
@@ -107,6 +108,7 @@
 func Main(args []string) {
 	var code int = 1
 	var err error
+	runtime.GOMAXPROCS(runtime.NumCPU())
 	commandName := filepath.Base(args[0])
 	if commandName == "jujud" {
 		code, err = jujuDMain(args)

I'm not sure exactly how we want to spell it, but this *does* help
when scaling up the jujud on machine-0.

While doing my "create 5000 connections and then restart jujud" it
turns out that the time it takes to get back to a sane state is
actually CPU limited, and jujud is capable of using all 4 cores on my VM.

I can see that we might only want this on state server nodes, because
on other machines agents might be competing for resources and we want
to make sure the agents aren't saturating the machine.

FWIW, I tried the test I was doing in Burlingame and with the root
machine being an m1.xlarge and the above patch it doesn't get 'hung'
like the m1.small did.

With 6000 Units of Ubuntu-1 running, I did "restart jujud-machine-0"
and it took 23 minutes before the log went quiet again. During this
time, jujud generated 1.6M lines of log file (285MB). I have a
machine-0.log.gz but it is 160MB compressed (2.4GB uncompressed).

So my current guess about my m1.small test is that we just saturated
the 1 CPU that the system had to work with, and that wasn't giving any
cycles to mongodb to actually answer the requests that were coming in.

We do end up with 2.2G with 5429 active connection (machine-2's
machine agent was down for a long time and I couldn't even ssh into
the machine [the terminal would just hang], it did come back after
another 30min or so, but then it just spun indefinitely because there
was a corrupt file in the .git checkout:

error: object file
.git/objects/53/94dcc08c1ae1519b87bc994640e9f6c5c7295c is empty
fatal: loose object 5394dcc08c1ae1519b87bc994640e9f6c5c7295c (stored
in .git/objects/53/94dcc08c1ae1519b87bc994640e9f6c5c7295c) is corrupt

And it was using 7GB+ on disk, and there were a *lot* of
/var/log/juju/tools/unpacking-* directories.

I'm curious what the story is if you have a machine that is just
broken and how to bring it back to life, though I don't think a
standard use case for us is to have 800 units on one machine :).

Anyway, I feel a bit better that my scale testing was only really
failing because we were on an m1.small.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJvUKMACgkQJdeBCYSNAAN0uQCg0oTn1esz7PXp4o3zw8RwP1Wh
hfAAoMvm95V6ND9cKLuWp/gkezzdCBmo
=KPZ+
-----END PGP SIGNATURE-----