[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres Rodriguez
andreserl at ubuntu-pe.org
Thu Feb 1 20:24:48 UTC 2018
@Steve,
On Thu, Feb 1, 2018 at 1:49 PM, Steve Langasek <steve.langasek at canonical.com
> wrote:
> On Thu, Feb 01, 2018 at 06:15:31PM -0000, Andres Rodriguez wrote:
> > @Jason,
>
> > Packet 90573 doesn't seem to me as an indication of what you are
> > describing. What I see is this:
>
> > 1. grub makes ~30 requests for PXE config on grub.cfg-<mac>, after which
> it gives up because it didn't receive a response.
> > 2. grub moves on and requests grub.cfg-default-amd64, and it receives a
> response from MAAS.
>
> > Now, the difference between the above, is that 1 does *database*
> > lookups, while 2 does not. In other words, 1 causes a request to obtain
> > the 'node' object based on the MAC to provide, and if grub is making 30+
> > requests, then this can definitely flood the db with requests.
>
> Then as I've said on IRC, this is a bug in maas, because 30 udp retries
> should not generate 30 requests to the database.
>
> GRUB is *not* wrong to retransmit its udp packets when it doesn't get a
> response. If each of these increases the load in MAAS, then MAAS should be
> fixed.
> The case where GRUB retrieves the same file multiple times is a GRUB bug,
> but I don't see any evidence linking this GRUB bug to the timeout and
> fallback problem in Jason's latest trace.
I agree with you if we are only considering this 1 system.
Let's not forget that we have other systems booting at around the same
time, each of which may be making at least 4 requests (for those grub
systems) that may or may not be answered immediately after each request.
But if requests are being served at the same time that more requests come
in, I do see how making multiple requests can indeed be causing the
degraded performance.
Specially, now that we've learned that we have multiple VM's in the same
host, all consuming 18 CPU's, on a 20 CPU system, and when MAAS alone, runs
5 processes that we typically recommend a dedicated CPU for each.
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=New; importance=Undecided; assignee=None;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
Status in MAAS:
Incomplete
Status in grub2 package in Ubuntu:
New
Bug description:
A node failed to deploy after it failed to retrieve a grub.cfg from
MAAS due to a timeout. In the logs, it's clear that the server tried
to retrieve the grub cfg many times, over about 30 seconds:
http://paste.ubuntu.com/26387256/
We see the same thing for other hosts around the same time:
http://paste.ubuntu.com/26387262/
It seems like MAAS is taking way too long to respond to these
requests.
This is very similar to bug 1724677, which was happening pre-
metldown/spectre. The only difference is we don't see "[critical] TFTP
back-end failed" in the logs anymore.
I connected to the console on this system and it had errors about
timing out retrieving the grub-cfg, then it had an error message along
the lines of "error not an ip" and then "double free". After I
connected but before I could get a screenshot the system rebooted and
was directed by maas to power off, which it did successfully after
booting to linux.
Full logs are available here:
https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
ed277a020e7c/cpe_cloud_395/infra-logs.tar
This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
More information about the foundations-bugs
mailing list