[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres Rodriguez
andreserl at ubuntu-pe.org
Mon Feb 5 21:45:10 UTC 2018
@Jason,
On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs <jason.hobbs at canonical.com>
wrote:
> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
> <andreserl at ubuntu-pe.org> wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has. That doesn't change whether or not there is a bug in
> MAAS. Can you please address the critical log errors that I mentioned
> in comment #36? This seems like enough to establish something is
> going wrong in MAAS.
>
The tftp issue shows no evidence this is causing any booting failures. We
have seen this issue before and confirmed that it doesn't cause boot
issues. See [1]. If you want to try it, it is available in
ppa:maas/proposed.
[1].https://bugs.launchpad.net/maas/+bug/1376483
As far as the postgresql logs with "maas at maasdb ERROR: could not serialize
access due to concurrent update" that is *not* a bug in MAAS or an issue.
That's perfectly normal messages with the isolation level the MAAS DB is
running with. This basically means something else is trying to update the
db while something else is updating it, and MAAS already handles this by
doing retries.
> > 2. No tests have been run with fixed grub that have caused boot
> failures.
>
> The comments from #56 were testing with the fixed grub - sorry if that
> wasn't clear.
>
> > 3. AFAIK, the VM config has not changed to use less CPU to compare
> results and whether this config change causes the bugs in question.
>
> The CPU load data from comments #48 and #50 shows that CPU load is not
> the problem. The max load average was under 12 on a 20 thread system.
> That means there was lots of free CPU time, and that this workload is
> not CPU bound.
>
CPU load is not CPU utilization. We know that at the time there's 6 other
VM's with 150%+ CPU usage are writing to the disk because they are being
deployed and/or configured (e.g. software installation). Correct me if
wrong, but this can cause the prioritization of whatever is writing to disk
over anything else, like the MAAS processes access for resources.
That being said, because CPU load doesn't show high we are making the
*assumption* that it is not impacting MAAS, but again, this is an
assumption. Making the requested change for having at least 4 CPUs (ideally
6) would allow us to determining what are the effects and see whether
there's any difference on behavior and would help identify what other
issues.
Without having the comparison then we are making it more difficult to
isolate the problem.
>
> Jason
>
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu.tl at gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
Status in MAAS:
New
Status in grub2 package in Ubuntu:
In Progress
Bug description:
A node failed to deploy after it failed to retrieve a grub.cfg from
MAAS due to a timeout. In the logs, it's clear that the server tried
to retrieve the grub cfg many times, over about 30 seconds:
http://paste.ubuntu.com/26387256/
We see the same thing for other hosts around the same time:
http://paste.ubuntu.com/26387262/
It seems like MAAS is taking way too long to respond to these
requests.
This is very similar to bug 1724677, which was happening pre-
metldown/spectre. The only difference is we don't see "[critical] TFTP
back-end failed" in the logs anymore.
I connected to the console on this system and it had errors about
timing out retrieving the grub-cfg, then it had an error message along
the lines of "error not an ip" and then "double free". After I
connected but before I could get a screenshot the system rebooted and
was directed by maas to power off, which it did successfully after
booting to linux.
Full logs are available here:
https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
ed277a020e7c/cpe_cloud_395/infra-logs.tar
This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
More information about the foundations-bugs
mailing list