[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Jason Hobbs jason.hobbs at canonical.com
Tue Feb 6 16:39:25 UTC 2018


Andres,

I ran the test with VMs limited to 9 of 20 cores (cut the core limit
in half for VMs).  The first time range from this dump is with the
cores at their normal limit (18).

As you can see, the behavior didn't change much from one set to the
other.  Both sets had instances where grub started doing retries,
although in neither case did it take very long.

http://paste.ubuntu.com/26530737/

So it seems that changing the CPU limits for the VMs doesn't change
the results drastically, which lines up with the data showing CPU
utilization never gets over 50%.

Jason


On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez
<andreserl at ubuntu-pe.org> wrote:
>>
>>
>> > That being said, because CPU load doesn't show high we are making the
>> > *assumption* that it is not impacting MAAS, but again, this is an
>> > assumption. Making the requested change for having at least 4 CPUs
>> (ideally
>> > 6) would allow us to determining what are the effects and see whether
>> > there's any difference on behavior and would help identify what other
>> > issues.
>> >
>> > Without having the comparison then we are making it more difficult to
>> > isolate the problem.
>>
>> To improve performance the typical pattern is 1) identify the
>> bottleneck 2) eliminate that as the bottleneck 3) repeat.
>>
>> We have not identified CPU as a bottleneck.  The top data says it is
>> not!
>>
>
> Jason,
>
> That doesn't change the fact that we are requesting tests to be run with
> different CPU configuration for VM's, so we can make a *comparison* and see
> if there is any material difference or none at all with the current
> conditions. While I agree with you that the data /seems/ to show that there
> is not issue with CPU, that doesn't change the fact that we don't have any
> data to compare with, as there could still be an impact even if it is
> minimum.
>
> Without the data, we cannot certainly assert that there's no issue caused
> by CPU usage because we don't have a reference or point of comparison. So
> while all fingers seem to be pointing to storage, It strongly believe it is
> worth gathering the data now and fully discard.
>
> If this is something that your environment is unable to do, I would
> appreciate that you clarify that instead of asserting that there's no
> performance impact in MAAS due to CPU usage, when we don't really know for
> sure (e.g. we don't know if MAAS behaves differently with less CPU usage in
> the current conditions, and that's data worth gathering to be able to
> better support you in the future).
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions



More information about the foundations-bugs mailing list