[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Jason Hobbs
jason.hobbs at canonical.com
Tue Feb 6 23:09:04 UTC 2018
dm-delay looks very interesting along those lines.
https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-
delay.html
https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <jason.hobbs at canonical.com> wrote:
> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
> <andreserl at ubuntu-pe.org> wrote:
>> I don't have logs anymore as I have since rebuilt my environment, but I can
>> confirm seeing improvements on a maas server running with high IO (note it
>> was a single region/rack).
>>
>> see inlien:
>>
>>
>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <jason.hobbs at canonical.com>
>> wrote:
>>
>>> Andres, it was a single test in both cases, and in both cases there was
>>> almost no delay from MAAS. It's not significant enough to call it
>>> positive results.
>>>
>>>
>> Comment #93 shows there are /some/ improvements when comparing those two
>> samples only, but as I have already said, we need data over time to in both
>> scenarios to properly compare and determine whether the changes do make any
>> material performance improvements with the current conditions of the
>> samples (both samples are with a fixed io starvation on the environment).
>>
>>
>>> Since neither of you answered yes, I'll assume the answer was no to my
>>> question of whether there was anything in my logs or data that showed
>>> reading the template from disk on the rack controller was the culprit,
>>> and that this fix just represents a guess at what might be causing the
>>> delay.
>>>
>>
>> To be fair, your logs do not provide anything concrete to determine what's
>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>> we have since then determine that those issues were a result of IO
>> starvation (from the VM's writing to disk). As such, the only way we can
>> *really* see if the patch brings any significant performance improvements
>> is to run tests in the environment were you were seeing the issues in the
>> first place.
>
> I didn't think my logs provided anything concrete! That's because the
> logging built into MAAS is not sufficient enough to do so.
>
> I can't break that environment to test anymore - we got it working
> thanks to you guy's help and it's a production environment that needs
> to keep running other tests.
>
> It might possible to recreate this on another maas server, using
> 'stress' or a similar tool to cause disk contention.
>
> Jason
>
>> As such, if you are willing to test if these make any material difference,
>> I would unfix your environment and do two runs (one without the fix, and
>> one with the fix). That's the only way we can really compare and be certain
>> in *your* environment.
>>
>>>
>>> --
>>> You received this bug notification because you are subscribed to MAAS.
>>> https://bugs.launchpad.net/bugs/1743249
>>>
>>> Title:
>>> Failed Deployment after timeout trying to retrieve grub cfg
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>>
>>> Launchpad-Notification-Type: bug
>>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>>> importance=Undecided; assignee=None;
>>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>>> status=Fix Released; importance=Medium; assignee=mathieu.tl at gmail.com;
>>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>>> Launchpad-Bug-Information-Type: Public
>>> Launchpad-Bug-Private: no
>>> Launchpad-Bug-Security-Vulnerability: no
>>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
>>> jason-hobbs mpontillo vorlon
>>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>>> Launchpad-Message-Rationale: Subscriber (MAAS)
>>> Launchpad-Message-For: andreserl
>>>
>>
>>
>> --
>> Andres Rodriguez (RoAkSoAx)
>> Ubuntu Server Developer
>> MSc. Telecom & Networking
>> Systems Engineer
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> Status in MAAS:
>> New
>> Status in grub2 package in Ubuntu:
>> Fix Released
>>
>> Bug description:
>> A node failed to deploy after it failed to retrieve a grub.cfg from
>> MAAS due to a timeout. In the logs, it's clear that the server tried
>> to retrieve the grub cfg many times, over about 30 seconds:
>>
>> http://paste.ubuntu.com/26387256/
>>
>> We see the same thing for other hosts around the same time:
>>
>> http://paste.ubuntu.com/26387262/
>>
>> It seems like MAAS is taking way too long to respond to these
>> requests.
>>
>> This is very similar to bug 1724677, which was happening pre-
>> metldown/spectre. The only difference is we don't see "[critical] TFTP
>> back-end failed" in the logs anymore.
>>
>> I connected to the console on this system and it had errors about
>> timing out retrieving the grub-cfg, then it had an error message along
>> the lines of "error not an ip" and then "double free". After I
>> connected but before I could get a screenshot the system rebooted and
>> was directed by maas to power off, which it did successfully after
>> booting to linux.
>>
>> Full logs are available here:
>> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>>
>> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
Status in MAAS:
New
Status in grub2 package in Ubuntu:
Fix Released
Bug description:
A node failed to deploy after it failed to retrieve a grub.cfg from
MAAS due to a timeout. In the logs, it's clear that the server tried
to retrieve the grub cfg many times, over about 30 seconds:
http://paste.ubuntu.com/26387256/
We see the same thing for other hosts around the same time:
http://paste.ubuntu.com/26387262/
It seems like MAAS is taking way too long to respond to these
requests.
This is very similar to bug 1724677, which was happening pre-
metldown/spectre. The only difference is we don't see "[critical] TFTP
back-end failed" in the logs anymore.
I connected to the console on this system and it had errors about
timing out retrieving the grub-cfg, then it had an error message along
the lines of "error not an ip" and then "double free". After I
connected but before I could get a screenshot the system rebooted and
was directed by maas to power off, which it did successfully after
booting to linux.
Full logs are available here:
https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
ed277a020e7c/cpe_cloud_395/infra-logs.tar
This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
More information about the foundations-bugs
mailing list