[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Blake Rouse
blake.rouse at canonical.com
Tue Feb 6 21:12:20 UTC 2018
Actually caching does make a difference. That method is not just caching
the reading of a file, it caches the searching of the file based on the
purpose, the reading of that file from disk (sure can be in kernel
cache), the parsing of the template by tempita.
All of that is redudant work that is being done on every single request.
Searching the filesystem and reading the file from cache is all syscalls
even if they come from the kernel cache. Since MAAS is async based that
means that coroutine will be placed on hold while we wait for the result
to be loaded from the kernel into the memory of the process. That gives
other coroutines time to do other things, which means that coroutine
doesn't get to execute until others are done or blocked by there own
async request.
Caching this information can greatly improve that by not requiring the
coroutine to be pushed back into the eventloop while it is waiting for
data from the kernel and without this change when the data comes back it
still has to be processed by tempita which will take time and block the
eventloop from completing other work.
So its not simply that we should use the kernel to cache reads from the
disk there is a lot more involved here. We have noticed improvements
with this change on systems that are being ran with large number of VM's
because of the reduction of IO.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
Status in MAAS:
New
Status in grub2 package in Ubuntu:
Fix Released
Bug description:
A node failed to deploy after it failed to retrieve a grub.cfg from
MAAS due to a timeout. In the logs, it's clear that the server tried
to retrieve the grub cfg many times, over about 30 seconds:
http://paste.ubuntu.com/26387256/
We see the same thing for other hosts around the same time:
http://paste.ubuntu.com/26387262/
It seems like MAAS is taking way too long to respond to these
requests.
This is very similar to bug 1724677, which was happening pre-
metldown/spectre. The only difference is we don't see "[critical] TFTP
back-end failed" in the logs anymore.
I connected to the console on this system and it had errors about
timing out retrieving the grub-cfg, then it had an error message along
the lines of "error not an ip" and then "double free". After I
connected but before I could get a screenshot the system rebooted and
was directed by maas to power off, which it did successfully after
booting to linux.
Full logs are available here:
https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
ed277a020e7c/cpe_cloud_395/infra-logs.tar
This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
More information about the foundations-bugs
mailing list