[Bug 1929860] Re: initrdfail can result in resuming with different initrd images and hanging resume

Francis Ginther 1929860 at bugs.launchpad.net
Wed Aug 18 12:37:45 UTC 2021


@juliank

This does not appear to be a complete fix for the issue seen on AWS
t2.nano. I see failures in about 5 - 10% of cases. There are two
different symptoms:

 * grub-initrd-fallback.service sometimes runs right before the system hibernates (it runs after the hibernation request was sent to the VM and before it and before it fully hibernates).
 * grub-initrd-fallback.service sometimes runs after resume, but checking the status of `initrdfail` a few minutes later indicates it is still set (not sure if it was ever cleared).

Both of these were determined by checking the last active timestamp
reported by 'systemctl status grub-initrd-fallback.service' and
comparing this with timestamps generated by the hibernation test. If
either situation occurs, the next hibernation/resume will fail.

I'm trying to collect more information.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1929860

Title:
  initrdfail can result in resuming with different initrd images and
  hanging resume

Status in grub2 package in Ubuntu:
  Fix Released
Status in grub2 source package in Focal:
  Fix Committed

Bug description:
  [Impact]
  Ubuntu Focal (and new releases) on AWS will normally boot without an initrd image (just the microcode.cpio). There is a fallback mechanism to reboot with the full initrd image when the boot fails to complete. The grub environment variable "initrdfail" is used to track when a boot failed and switch between the optimized initrd-less boot path and the full initrd path.

  On a normal successful boot, the "initrdfail" variable is cleared by
  grub-initrd-fallback.service. However, this doesn't happen when
  resuming from hibernation. As a result, the initrd fallback will get
  triggered on the second hibernation / resume cycle despite the
  original boot using only the microcode.cpio. This switch in initrd
  images leads to the second resume hanging.

  We've been able to successfully avoid this issue by adding the
  following to the ec2-hibinit-agent resume handler:

  /usr/bin/grub-editenv - unset initrdfail
  /usr/bin/grub-editenv - unset recordfail

  (Note: clearing recordfail may not be necessary, will need to try
  again without it.)

  This bug was filed against grub2 as it appears to own initrdfail.

  [Test plan]
  TBD w/ CPC

  [Regression potential]
  Services get changed to oneshot, and wantedby=multi-user sleep; maybe we miss other places it should run, or record the wrong thing on resume?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1929860/+subscriptions




More information about the foundations-bugs mailing list