[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Jason Hobbs jason.hobbs at canonical.com
Mon Feb 5 21:30:12 UTC 2018


The packetdump (comment #35) of MAAS not responding to grub's request
for the mac specific grub.cfg before grub times out, and then responding
immediately to the generic-amd64 grub cfg, clearly shows a race
condition in MAAS.

MAAS's design of dynamically generating the interface specific grub
config only after it receives the tftp request for it is susceptible to
a race condition where grub times out before MAAS can respond.

That design is not the only possible design.  All the information
required for the interface specific grub.cfg is available before the
machine ever powers on, and could be made available on the rack
controllers at that time too.

Doing so would eliminate that race condition, or at least reduce the
opportunity greatly, as we see MAAS has no problems immediately
responding and serving files that it doesn't need to dynamically
generate at request time.

There is still some question around what in the environment is
contributing to MAAS not responding faster, and what MAAS is doing while
it takes 60+ seconds to respond to the request, but that doesn't change
the fact that the current MAAS design is racy (and that's a bug).

Whatever we change in the environment to reduce the likelihood of
hitting this issue there doesn't solve the underlying race condition in
MAAS, and leaves open the possibility of hitting the issue other places
too.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions



More information about the foundations-bugs mailing list