[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Thu Feb 1 18:15:31 UTC 2018

@Jason,

Packet 90573 doesn't seem to me as an indication of what you are
describing. What I see is this:

1. grub makes ~30 requests for PXE config on grub.cfg-<mac>, after which it gives up because it didn't receive a response.
2. grub moves on and requests grub.cfg-default-amd64, and it receives a response from MAAS.

Now, the difference between the above, is that 1 does *database*
lookups, while 2 does not. In other words, 1 causes a request to obtain
the 'node' object based on the MAC to provide, and if grub is making 30+
requests, then this can definitely flood the db with requests.

That said, based on my understanding of how your environment is
configured, you have other 3 VM's in the system PXE booting from MAAS +
other machines at the same time, where each VM has assigned to itself 8
CPU's on a system that has 20 CPU's (that means that the VM's alone, in
other words, you are over committing CPU), combined with other machines
PXE booting off MAAS at the same time, plus the performance implications
of the recent kernel, then it does seem to me that all of the other
things could be impacting maas in contending resources, when we already
know postgresql is running in degraded performance due to the newer
kernels.

That said, did you disable spectre features and rebooted your machine?
Did you test this by NOT running VM's in the same system as MAAS or at least, reducing the number of cores each VM access to (since there's 3 VM's, with 8 cores each, that means 24 cores on a 20 core system).

Also, do you have any CPU load at the time of failure?

** Changed in: maas
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  Incomplete
Status in grub2 package in Ubuntu:
  New

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions