[Lucid] SRU: Some testing on the larger writeback set

Fri Aug 20 13:20:29 UTC 2010

On 08/20/2010 03:59 AM, Stefan Bader wrote:
> The discussion upstream which solution to take for the writeback umount
> regression seems not really near a final conclusion. So I think we should make a
> decision to move forward with the bigger set (which also seems to have good
> effects on normal performance and responsiveness).
>
> I ran 2 tests of my own and the xfstests test suite on ext4 and saw no
> regression compared to before. Run-times were usually shorter with the patchset
> applied:
>
> mount-umount of tmpfs with other IO:     0.33s ->      0.02s
> mount-cp-umount of ext4            :     9.00s ->      8.00s
> xfstests on ext4                   : 24m30.00s ->  19m40.00s
>
> The xfstests failed two aio testcases (239, 240) in both cases with very similar
> looking errors. My kernels are based on the 2.6.32-24.41 release, so there can
> be fixes to ext4 in upcoming stable.
>
> Then I tried the xfstests on xfs and got scared by this on the new kernel:
>
> INFO: task xfs_io:5764 blocked for more than 120 seconds.
> "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfs_io        D ffff880206439650     0  5764   3651 0x00000000
>   ffff8801f29dfd38 0000000000000082 0000000000015bc0 0000000000015bc0
>   ffff8801de759ab0 ffff8801f29dffd8 0000000000015bc0 ffff8801de7596f0
>   0000000000015bc0 ffff8801f29dffd8 0000000000015bc0 ffff8801de759ab0
> Call Trace:
>   [<ffffffff815595f7>] __mutex_lock_slowpath+0xe7/0x170
>   [<ffffffff8114f1e1>] ? path_put+0x31/0x40
>   [<ffffffff81559033>] mutex_lock+0x23/0x50
>   [<ffffffff81152b59>] do_filp_open+0x3d9/0xba0
>   [<ffffffff810f4487>] ? unlock_page+0x27/0x30
>   [<ffffffff81112a19>] ? __do_fault+0x439/0x500
>   [<ffffffff81115b78>] ? handle_mm_fault+0x1a8/0x3c0
>   [<ffffffff8115e4ca>] ? alloc_fd+0x10a/0x150
>   [<ffffffff81142219>] do_sys_open+0x69/0x170
>   [<ffffffff81142360>] sys_open+0x20/0x30
>   [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
>
> However the test run completes after 118m30s (and fails 10 out of 146 [017 109
> 194 198 225 229 232 238 239 240] tests). I did not see the dump on the old
> kernel, but that might just be because writeback is too slow to show that race.
> I would re-run the test on the old kernel to get the run-time and tests that
> fail. Though the run-time seems tremendous (that's why I forgot to note things
> down yesterday).
>
> Though all in all I think it should be safe for a larger regression testing in
> proposed. If there is no veto (and enough oks), I would add the set to our
> master branch.
>
> Stefan
>

I think the bigger patch set makes the most sense. Its certainly had the 
most testing. I've been running it for a couple of weeks now. Lets get 
it into -proposed soon, lots of folks are hating life because of this issue.

In the meantime, maybe you should rerun the XFS tests without all of the 
dmesg hung_task noise (echo 0>  /proc/sys/kernel/hung_task_timeout_secs) 
just to be sure.

rtg
-- 
Tim Gardner tim.gardner at canonical.com