[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Mon Feb 5 22:15:19 UTC 2018

On Mon, Feb 05, 2018 at 09:27:15PM -0000, Andres Rodriguez wrote:
> MAAS already has a mechanism to collapse retries into the initial request.

Are we certain that this is working correctly?  If so, why are packet
captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for
the duplicate incoming requests?

It's clear to me that MAAS's handling at the wire level is incorrect - 10
retries of the same tftp request should result in a single OACK, not 10 of
them (unless MAAS receives a retry *after* it has sent its OACK).  I don't
know if that also means that MAAS is inefficiently translating these into
database requests on the backend.  It had been suggested in this bug log and
on IRC that MAAS *was* sending duplicate db requests for each of these
packets; OTOH the timing of the stacked responses shows no latency in
between them that would imply additional db round-trips.

I think someone needs to directly inspect the behavior of a running MAAS
server in this scenario to be sure.

> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for a
> response from the region, these request get "ignored" and the Rack will
> only answer the first request.

That is absolutely contradicted by the packet captures.  The rack does not
ignore the additional requests, it answers *ALL* of the requests.  It's only
the *client* that consolidates the duplicate responses from MAAS.  (And
then, because of a grub bug higher up the stack, re-requests the same file
that it has already received.)

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions