[SRU][B][F][G][H][PATCH 0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Mon May 10 12:08:36 UTC 2021

On 5/9/21 7:25 PM, Matthew Ruffell wrote:
> Hi Tim,
> 
> I appreciate your cautiousness, and I too am happy to play the conservative game
> on this particular patchset. I _really_ don't want to cause another regression,
> the one in December caused too much trouble for everyone as it was.
> 
> I'm happy to NACK the patchset for this cycle, and resubmit in the next cycle,
> as long as we set out and agree on what additional regression testing should be
> performed, and if I go perform those tests then you would seriously consider
> accepting these patches.
> 
> Testing that I have currently performed:
> - Testing against the testcase of mkfs.xfs / mkfs.ext4 / fstrim on a Raid10 md
>    block device, as specified in LP #1896578 [1].
> - Testing against the disk corruption regression testcase, as reported in
>    LP #1907262 [2]. All disks now fsck clean with this new revision of the
>    patchset, which you can see in comment #15 [3] on LP #1896578.
> - Three customers have tested test kernels and haven't found any issues (yet).
>    
> Testing that I could go perform over the next week or two:
> - Run xfstests with the generic/*, xfs/* and ext4/* testsuites over two Raid10
>    md block devices, one test block device, one scratch block device. I would
>    do two runs, one with a released kernel without the patches, and one with the
>    test kernels in [4], to see if there is any regressions.
> - I could use a cloud instance with NVMe drives as my primary computer over the
>    next two or so weeks, and have my /home on a Raid10 md block device, and
>    change the fstrim systemd timer from weekly to every 30 minutes, and see if I
>    come across data corruption.
> 

I like this setup ^. Its a bit more live and random then focused 
testing. Plus, you're more likely to notice regressions or delays in 
file system access.

> Just so that you are aware, I have three different customers who would very much
> like these patches to land in the Ubuntu kernels. Two are deploying systems via
> MAAS or curtin, and are seeing deployment timeouts when deploying Raid10 to
> NVMe disks, since their larger arrays take 2-3 hours to perform a block discard
> on with current kernels, when normal systems only take 15-30 mins to deploy, so
> they don't want to increase the deployment timeout setting for this one outlier.
> The other customer doesn't want to spend 2-3 hours waiting for their Raid10 md
> arrays to format with a filesystem, when it could take 4 seconds instead.
> 

I can see why they are pushing to get this fixed.

> I know this is a big change, and I know that this set has already caused grief
> with a regression in December, but customers are requesting this feature, and
> because of that, I'm willing to work with you to figure out appropriate testing
> required, and hopefully get this landed in Ubuntu kernels safely in the near
> future.
> 
> Let me know what additional testing you would like to see, and I will go and
> complete it.
> 
> Thanks,
> Matthew
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578
> [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262
> [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/comments/15
> [4] https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 

This patch set won't make the 2021.05.10 SRU cycle, so lets schedule it 
for the next SRU cycle (beginning about June 1). That'll give another 
few weeks of test. If there are no regressions then I think we can get 
it included. Be sure to ping me about then so I can annoy someone else 
on the team to review as well. Watch for upstream Fix patches in the 
meantime.

rtg
-----------
Tim Gardner
Canonical, Inc