[Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs
Matthew Ruffell
2036467 at bugs.launchpad.net
Mon Oct 9 02:47:34 UTC 2023
** Summary changed:
- superblock checksum mismatch in resize2fs
+ Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs
** Description changed:
- Hi,
- We run ext4 on EBS volumes on EC2. During provisioning, cloud-init will occasionally report that resize2fs has failed due to a superblock checksum mismatch. We debugged this internally, and were able to come up with the following reproducer:
+ [Impact]
+
+ This is a long running bug plaguing cloud-images, where on a rare
+ occasion resize2fs would fail and the image would not resize to fit the
+ entire disk.
+
+ Online resizes would fail due to a superblock checksum mismatch, where
+ the superblock in memory differs from what is currently on disk due to
+ changes made to the image.
+
+ Changing the read of the superblock to Direct I/O solves the issue.
+
+ [Testcase]
+
+ Start an c5.large instance on AWS, and attach a 60gb gp3 volume for use
+ as a scratch disk.
+
+ Run the following script, courtesy of Krister Johansen and his team:
#!/usr/bin/bash
set -euxo pipefail
while true
do
parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
sleep .5
mkfs.ext4 /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /mnt
stress-ng --temp-path /mnt -D 4 &
STRESS_PID=$!
sleep 1
growpart /dev/nvme1n1 1
resize2fs /dev/nvme1n1p1
kill $STRESS_PID
wait $STRESS_PID
umount /mnt
wipefs -a /dev/nvme1n1p1
wipefs -a /dev/nvme1n1
done
- (This was on a 60gb gp3 volume attached to a c5.4xlarge)
+ Test packages are available in the following ppa:
- We were able to find a fix that works and get the patch accepted
- upstream. The short explanation is that by switching the superblock
- read to direct io, we no longer see the problem.
+ https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test
- The patch is available here, but hasn't been published in a released
- version of e2fsprogs:
+ If you install the test packages, the race no longer occurs.
- https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
+ [Where problems could occur]
- A longer thread with the maintainer is available here:
+ We are changing how resize2fs reads the superblock from underlying
+ disks.
+ If a regression were to occur, resize2fs could fail to resize offline or
+ online volumes. As all cloud-images are online resized during their
+ initial boot, this could have a large impact to public and private
+ clouds should a regression occur.
+
+ [Other info]
+
+ Upstream mailing list discussion:
+ https://lore.kernel.org/linux-ext4/20230605225221.GA5737@templeofstupid.com/
https://lore.kernel.org/linux-ext4/20230609042239.GA1436857@mit.edu/
- This bug report is to request that Ubuntu backport this patch to the
- versions of e2fsprogs that are in releases that are available in images
- on AWS, preferably Focal and Jammy.
+ This was fixed in the below commit upstream:
+
+ commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
+ Author: Theodore Ts'o <tytso at mit.edu>
+ Date: Thu, 15 Jun 2023 00:17:01 -0400
+ Subject: resize2fs: use Direct I/O when reading the superblock for
+ online resizes
+ Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
+
+ The commit has not been tagged to any release. All supported Ubuntu
+ releases require this fix, and need to be published in standard non-ESM
+ archives to be picked up in cloud images.
** Tags added: sts
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467
Title:
Resizing cloud-images occasionally fails due to superblock checksum
mismatch in resize2fs
Status in cloud-images:
New
Status in e2fsprogs package in Ubuntu:
In Progress
Status in e2fsprogs source package in Trusty:
In Progress
Status in e2fsprogs source package in Xenial:
In Progress
Status in e2fsprogs source package in Bionic:
In Progress
Status in e2fsprogs source package in Focal:
In Progress
Status in e2fsprogs source package in Jammy:
In Progress
Status in e2fsprogs source package in Lunar:
In Progress
Status in e2fsprogs source package in Mantic:
In Progress
Bug description:
[Impact]
This is a long running bug plaguing cloud-images, where on a rare
occasion resize2fs would fail and the image would not resize to fit
the entire disk.
Online resizes would fail due to a superblock checksum mismatch, where
the superblock in memory differs from what is currently on disk due to
changes made to the image.
Changing the read of the superblock to Direct I/O solves the issue.
[Testcase]
Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
use as a scratch disk.
Run the following script, courtesy of Krister Johansen and his team:
#!/usr/bin/bash
set -euxo pipefail
while true
do
parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
sleep .5
mkfs.ext4 /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /mnt
stress-ng --temp-path /mnt -D 4 &
STRESS_PID=$!
sleep 1
growpart /dev/nvme1n1 1
resize2fs /dev/nvme1n1p1
kill $STRESS_PID
wait $STRESS_PID
umount /mnt
wipefs -a /dev/nvme1n1p1
wipefs -a /dev/nvme1n1
done
Test packages are available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test
If you install the test packages, the race no longer occurs.
[Where problems could occur]
We are changing how resize2fs reads the superblock from underlying
disks.
If a regression were to occur, resize2fs could fail to resize offline
or online volumes. As all cloud-images are online resized during their
initial boot, this could have a large impact to public and private
clouds should a regression occur.
[Other info]
Upstream mailing list discussion:
https://lore.kernel.org/linux-ext4/20230605225221.GA5737@templeofstupid.com/
https://lore.kernel.org/linux-ext4/20230609042239.GA1436857@mit.edu/
This was fixed in the below commit upstream:
commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <tytso at mit.edu>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
online resizes
Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
The commit has not been tagged to any release. All supported Ubuntu
releases require this fix, and need to be published in standard non-
ESM archives to be picked up in cloud images.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions
More information about the foundations-bugs
mailing list