[Bug 2036467] Re: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs
Matthew Ruffell
2036467 at bugs.launchpad.net
Fri Feb 2 05:56:06 UTC 2024
Hi Krister,
I have finally seen this occur in real life with my own two eyes!
You are absolutely correct, the 4-retry doesn't seem to be sufficient
sometimes.
The reproducer works on Focal and previous in about 20 minutes, so its
easy to see the issue trigger on Focal. But Focal and previous doesn't
retry at all.
On Jammy, Mantic and noble, it took about a week straight, but I managed
to get it to trigger for each of them.
Start
----------------------------
Tue Jan 16 01:57:20 UTC 2024
Tue Jan 16 02:18:53 UTC 2024
End
----------------------------
Tue Jan 23 20:12:28 UTC 2024
Tue Jan 23 14:32:08 UTC 2024
The 4-retry does help, and helps quite a lot really.
Anyway, I upgraded my test environment to the test packages, and I will
leave them running for a week.
If things look good then, I'll get these patches sponsored for SRU.
Sorry for the delay, but I really wanted to see it fail on Jammy, Mantic
and Noble before we go patching them.
Thanks,
Matthew
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467
Title:
Resizing cloud-images occasionally fails due to superblock checksum
mismatch in resize2fs
Status in cloud-images:
New
Status in e2fsprogs package in Ubuntu:
In Progress
Status in e2fsprogs source package in Trusty:
Won't Fix
Status in e2fsprogs source package in Xenial:
Won't Fix
Status in e2fsprogs source package in Bionic:
Won't Fix
Status in e2fsprogs source package in Focal:
In Progress
Status in e2fsprogs source package in Jammy:
In Progress
Status in e2fsprogs source package in Lunar:
Won't Fix
Status in e2fsprogs source package in Mantic:
In Progress
Bug description:
[Impact]
This is a long running bug plaguing cloud-images, where on a rare
occasion resize2fs would fail and the image would not resize to fit
the entire disk.
Online resizes would fail due to a superblock checksum mismatch, where
the superblock in memory differs from what is currently on disk due to
changes made to the image.
$ resize2fs /dev/nvme1n1p1
resize2fs 1.47.0 (5-Feb-2023)
resize2fs: Superblock checksum does not match superblock while trying to open /dev/nvme1n1p1
Couldn't find valid filesystem superblock.
Changing the read of the superblock to Direct I/O solves the issue.
[Testcase]
Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
use as a scratch disk.
Run the following script, courtesy of Krister Johansen and his team:
#!/usr/bin/bash
set -euxo pipefail
while true
do
parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
sleep .5
mkfs.ext4 /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /mnt
stress-ng --temp-path /mnt -D 4 &
STRESS_PID=$!
sleep 1
growpart /dev/nvme1n1 1
resize2fs /dev/nvme1n1p1
kill $STRESS_PID
wait $STRESS_PID
umount /mnt
wipefs -a /dev/nvme1n1p1
wipefs -a /dev/nvme1n1
done
Test packages are available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test
If you install the test packages, the race no longer occurs.
[Where problems could occur]
We are changing how resize2fs reads the superblock from underlying
disks.
If a regression were to occur, resize2fs could fail to resize offline
or online volumes. As all cloud-images are online resized during their
initial boot, this could have a large impact to public and private
clouds should a regression occur.
[Other info]
Upstream mailing list discussion:
https://lore.kernel.org/linux-ext4/20230605225221.GA5737@templeofstupid.com/
https://lore.kernel.org/linux-ext4/20230609042239.GA1436857@mit.edu/
This was fixed in the below commit upstream:
commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <tytso at mit.edu>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
online resizes
Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
The commit has not been tagged to any release. All supported Ubuntu
releases require this fix, and need to be published in standard non-
ESM archives to be picked up in cloud images.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions
More information about the foundations-bugs
mailing list