Enabling GOMAXPROCS for jujud

Tue Oct 29 06:25:32 UTC 2013

Not lgtm. The number of CPU cores available on the default ec2
bootstrap machine is 1.

On Tue, Oct 29, 2013 at 5:07 PM, John Arbash Meinel
<john at arbash-meinel.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Do we want to enable multiprocessing for Jujud? I have some evidence
> that it would actually help things.
>
> I'm soliciting feedback about this patch:
> === modified file 'cmd/jujud/main.go'
> - --- cmd/jujud/main.go 2013-09-13 14:48:13 +0000
> +++ cmd/jujud/main.go   2013-10-28 17:47:52 +0000
> @@ -8,6 +8,7 @@
>         "net/rpc"
>         "os"
>         "path/filepath"
> +       "runtime"
>
>         "launchpad.net/juju-core/cmd"
>         "launchpad.net/juju-core/worker/uniter/jujuc"
> @@ -107,6 +108,7 @@
>  func Main(args []string) {
>         var code int = 1
>         var err error
> +       runtime.GOMAXPROCS(runtime.NumCPU())
>         commandName := filepath.Base(args[0])
>         if commandName == "jujud" {
>                 code, err = jujuDMain(args)
>
> I'm not sure exactly how we want to spell it, but this *does* help
> when scaling up the jujud on machine-0.
>
> While doing my "create 5000 connections and then restart jujud" it
> turns out that the time it takes to get back to a sane state is
> actually CPU limited, and jujud is capable of using all 4 cores on my VM.
>
> I can see that we might only want this on state server nodes, because
> on other machines agents might be competing for resources and we want
> to make sure the agents aren't saturating the machine.
>
> FWIW, I tried the test I was doing in Burlingame and with the root
> machine being an m1.xlarge and the above patch it doesn't get 'hung'
> like the m1.small did.
>
> With 6000 Units of Ubuntu-1 running, I did "restart jujud-machine-0"
> and it took 23 minutes before the log went quiet again. During this
> time, jujud generated 1.6M lines of log file (285MB). I have a
> machine-0.log.gz but it is 160MB compressed (2.4GB uncompressed).
>
> So my current guess about my m1.small test is that we just saturated
> the 1 CPU that the system had to work with, and that wasn't giving any
> cycles to mongodb to actually answer the requests that were coming in.
>
> We do end up with 2.2G with 5429 active connection (machine-2's
> machine agent was down for a long time and I couldn't even ssh into
> the machine [the terminal would just hang], it did come back after
> another 30min or so, but then it just spun indefinitely because there
> was a corrupt file in the .git checkout:
>
> error: object file
> .git/objects/53/94dcc08c1ae1519b87bc994640e9f6c5c7295c is empty
> fatal: loose object 5394dcc08c1ae1519b87bc994640e9f6c5c7295c (stored
> in .git/objects/53/94dcc08c1ae1519b87bc994640e9f6c5c7295c) is corrupt
>
> And it was using 7GB+ on disk, and there were a *lot* of
> /var/log/juju/tools/unpacking-* directories.
>
> I'm curious what the story is if you have a machine that is just
> broken and how to bring it back to life, though I don't think a
> standard use case for us is to have 800 units on one machine :).
>
> Anyway, I feel a bit better that my scale testing was only really
> failing because we were on an m1.small.
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.13 (Cygwin)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlJvUKMACgkQJdeBCYSNAAN0uQCg0oTn1esz7PXp4o3zw8RwP1Wh
> hfAAoMvm95V6ND9cKLuWp/gkezzdCBmo
> =KPZ+
> -----END PGP SIGNATURE-----
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev