[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Andres Rodriguez andreserl at ubuntu-pe.org
Tue Feb 6 17:29:33 UTC 2018


>
> >
> > Yes, it is not an unknown machine, but that doesn;t change the fact that
> > this is working as designed. If the client didn't get a response for the
> > request it makes, and the client decides to move on and makes a different
> > request, then it is working as designed. Again, the bug here is not on
> the
> > clients behavior, the bug here is on the fact that the response is not
> > being done in a timely manner.
>
> Yes, agreed 100%.  It's not a client bug, it's a server bug.
>
> >
> >>
> >> > So this is *not* a race condition in MAAS. This is working as designed
> >> and
> >> > is expected. The problem here is that MAAS takes too long to answer
> the
> >> > initial request, which causes grub to timeout and move on to request a
> >> > different config file.
> >>
> >> Yes, because there is a race condition in the design - the MAC
> >> specific file has to be generated before grub times out.  It could
> >> instead be generated before the node ever starts booting, allowing it
> >> to be served just as fast as the -default-amd64 file is, eliminating
> >> that race condition.
> >>
> >
> > It is not a race condition. It is doing exactly what it was told to do.
> It
> > request X thing, didn't get a response, then it requested Y thing, and
> got
> > a response. The fact that there's no response when X happens on a
> /timely/
> > manner is not a race, its a bug on the server side. So, if the machine
> were
> > to not be known to MAAS, it would work as expected. But since it is known
> > and the response doesn't come on a timely manner for grub, it moves on.
> > This is the same behavior pxe, uboot and other network bootloaders
> follow.
>
> Right - it's a bug on the server side!  That's what I've been saying.
>
> > And yes, you could argue that the config could be generated before the
> node
> > starts booting, but what you are not considering is that the node can
> boot
> > from any rack controller really and that would require maas to send the
> > same file to all rack controllers in the same vlan the machine is booting
> > from and write files onto the disk dynamically, which in fact, can impact
> > performance even more. The fact the config is generated on the fly is
> > because it is generated for the specific rack controller where the
> machine
> > is booting from and that;'s the intended design.
>
> I never suggested the files had to be written to disk, but yes, they
> would need to be sent to each rack controller that it could boot from.
>
> I know it's the intended design, but it has a race condition built in
> that could be eliminated with another design.  That's all I'm saying.
>
> It sounds like you agree and you point out there would be trade offs,
> and that's fine.
>

Actually we dont believe this is a good change. In fact, this will cause
booting issues and overall performance issues.

We already know of two areas where this can be improved. One is
non-backportable to 2.3, the other one is this:

https://paste.ubuntu.com/26530972/

Is there any chance you can test that patch, or do you want me to put a
patched package somehwere?

>
> Jason
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu.tl at gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

Status in MAAS:
  New
Status in grub2 package in Ubuntu:
  In Progress

Bug description:
  A node failed to deploy after it failed to retrieve a grub.cfg from
  MAAS due to a timeout.  In the logs, it's clear that the server tried
  to retrieve the grub cfg many times, over about 30 seconds:

  http://paste.ubuntu.com/26387256/

  We see the same thing for other hosts around the same time:

  http://paste.ubuntu.com/26387262/

  It seems like MAAS is taking way too long to respond to these
  requests.

  This is very similar to bug 1724677, which was happening pre-
  metldown/spectre. The only difference is we don't see "[critical] TFTP
  back-end failed" in the logs anymore.

  I connected to the console on this system and it had errors about
  timing out retrieving the grub-cfg, then it had an error message along
  the lines of "error not an ip" and then "double free".  After I
  connected but before I could get a screenshot the system rebooted and
  was directed by maas to power off, which it did successfully after
  booting to linux.

  Full logs are available here:
  https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
  ed277a020e7c/cpe_cloud_395/infra-logs.tar

  This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions



More information about the foundations-bugs mailing list