[Lucid] SRU: Some testing on the larger writeback set

Fri Aug 20 13:26:43 UTC 2010

On 08/20/2010 03:20 PM, Tim Gardner wrote:
> On 08/20/2010 03:59 AM, Stefan Bader wrote:
>> The discussion upstream which solution to take for the writeback umount
>> regression seems not really near a final conclusion. So I think we
>> should make a
>> decision to move forward with the bigger set (which also seems to have
>> good
>> effects on normal performance and responsiveness).
>>
>> I ran 2 tests of my own and the xfstests test suite on ext4 and saw no
>> regression compared to before. Run-times were usually shorter with the
>> patchset
>> applied:
>>
>> mount-umount of tmpfs with other IO:     0.33s ->      0.02s
>> mount-cp-umount of ext4            :     9.00s ->      8.00s
>> xfstests on ext4                   : 24m30.00s ->  19m40.00s
>>
>> The xfstests failed two aio testcases (239, 240) in both cases with
>> very similar
>> looking errors. My kernels are based on the 2.6.32-24.41 release, so
>> there can
>> be fixes to ext4 in upcoming stable.
>>
>> Then I tried the xfstests on xfs and got scared by this on the new
>> kernel:
>>
>> INFO: task xfs_io:5764 blocked for more than 120 seconds.
>> "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> xfs_io        D ffff880206439650     0  5764   3651 0x00000000
>>   ffff8801f29dfd38 0000000000000082 0000000000015bc0 0000000000015bc0
>>   ffff8801de759ab0 ffff8801f29dffd8 0000000000015bc0 ffff8801de7596f0
>>   0000000000015bc0 ffff8801f29dffd8 0000000000015bc0 ffff8801de759ab0
>> Call Trace:
>>   [<ffffffff815595f7>] __mutex_lock_slowpath+0xe7/0x170
>>   [<ffffffff8114f1e1>] ? path_put+0x31/0x40
>>   [<ffffffff81559033>] mutex_lock+0x23/0x50
>>   [<ffffffff81152b59>] do_filp_open+0x3d9/0xba0
>>   [<ffffffff810f4487>] ? unlock_page+0x27/0x30
>>   [<ffffffff81112a19>] ? __do_fault+0x439/0x500
>>   [<ffffffff81115b78>] ? handle_mm_fault+0x1a8/0x3c0
>>   [<ffffffff8115e4ca>] ? alloc_fd+0x10a/0x150
>>   [<ffffffff81142219>] do_sys_open+0x69/0x170
>>   [<ffffffff81142360>] sys_open+0x20/0x30
>>   [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
>>
>> However the test run completes after 118m30s (and fails 10 out of 146
>> [017 109
>> 194 198 225 229 232 238 239 240] tests). I did not see the dump on the
>> old
>> kernel, but that might just be because writeback is too slow to show
>> that race.
>> I would re-run the test on the old kernel to get the run-time and
>> tests that
>> fail. Though the run-time seems tremendous (that's why I forgot to
>> note things
>> down yesterday).
>>
>> Though all in all I think it should be safe for a larger regression
>> testing in
>> proposed. If there is no veto (and enough oks), I would add the set to
>> our
>> master branch.
>>
>> Stefan
>>
> 
> I think the bigger patch set makes the most sense. Its certainly had the
> most testing. I've been running it for a couple of weeks now. Lets get
> it into -proposed soon, lots of folks are hating life because of this
> issue.
> 
> In the meantime, maybe you should rerun the XFS tests without all of the
> dmesg hung_task noise (echo 0>  /proc/sys/kernel/hung_task_timeout_secs)
> just to be sure.
> 
> rtg

Actually (after some more testing... these runs take ages) it seems a successful
run depends on the phase of the moon or other weird issues, regardless of which
kernel I use. I had hangs now with old and new and the same messages sometimes
happen or not. So I will go ahead later and queue up the writeback batch.

Stefan