[Bug 628530] [NEW] 10.04 LTS Servers Hang with task kjournald blocked for more than 120 seconds

Dewi 628530 at bugs.launchpad.net
Wed Sep 1 23:43:31 UTC 2010


Public bug reported:

We recently upgraded out server infrastructure to 10.04.  Our servers run as guests on XenServer 5.6 running on Dell M600 Blades processors are 2 quad core Intel Xeon E5430.
All disk for the guests is hosted on a Dell Equilogic SAN storage array. With 8.04 after our Xen5.6 upgrade we had no issues with the server stability. Now that we have upgraded to 10.04, we are experiencing daily hangs and crashes of our servers at random times. The symptoms seem related to intensive disk-io.
Other issues since the upgrade are that the system clocks are way out of alignment which may have some relation to the hanging errors.

We are booting the servers with clocksource=jiffies which was the
default we used for our 8.04 machines, I have tried a couple of machines
without the jiffies option and they come up with clocksource=xen. These
machines seem to keep time reasonably well, BUT when I tried this option
with our main filesevers, they crashed almost immediately and I had to
revert to using clocksource=jiffies

syslog output from the time of the crash on a samba fileserver :

Sep  1 09:53:41 bilbo kernel: [153120.060025] INFO: task kjournald:700 blocked for more than 120 seconds.
Sep  1 09:53:41 bilbo kernel: [153120.060040] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 09:53:41 bilbo kernel: [153120.060049] kjournald     D ffff880003a72bc0     0   700      2 0x00000000
Sep  1 09:53:41 bilbo kernel: [153120.060056]  ffff8800d767bc30 0000000000000246 0000000000015bc0 0000000000015bc0
Sep  1 09:53:41 bilbo kernel: [153120.060063]  ffff8800d94b1ab0 ffff8800d767bfd8 0000000000015bc0 ffff8800d94b16f0
Sep  1 09:53:41 bilbo kernel: [153120.060069]  0000000000015bc0 ffff8800d767bfd8 0000000000015bc0 ffff8800d94b1ab0
Sep  1 09:53:41 bilbo kernel: [153120.060076] Call Trace:
Sep  1 09:53:41 bilbo kernel: [153120.060095]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060101]  [<ffffffff81558bb7>] io_schedule+0x47/0x70
Sep  1 09:53:41 bilbo kernel: [153120.060104]  [<ffffffff8116e615>] sync_buffer+0x45/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060107]  [<ffffffff815591df>] __wait_on_bit+0x5f/0x90
Sep  1 09:53:41 bilbo kernel: [153120.060110]  [<ffffffff8116d3b1>] ? submit_bh+0x111/0x140
Sep  1 09:53:41 bilbo kernel: [153120.060113]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060116]  [<ffffffff81559288>] out_of_line_wait_on_bit+0x78/0x90
Sep  1 09:53:41 bilbo kernel: [153120.060120]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060123]  [<ffffffff8116e5c6>] __wait_on_buffer+0x26/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060127]  [<ffffffff81214d4b>] journal_commit_transaction+0x31b/0xe40
Sep  1 09:53:41 bilbo kernel: [153120.060132]  [<ffffffff8100f37f>] ? xen_restore_fl_direct_end+0x0/0x1
Sep  1 09:53:41 bilbo kernel: [153120.060135]  [<ffffffff8155ab5e>] ? _spin_unlock_irqrestore+0x1e/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060139]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
Sep  1 09:53:41 bilbo kernel: [153120.060142]  [<ffffffff8100f37f>] ? xen_restore_fl_direct_end+0x0/0x1
Sep  1 09:53:41 bilbo kernel: [153120.060145]  [<ffffffff8155ab5e>] ? _spin_unlock_irqrestore+0x1e/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060148]  [<ffffffff81218c1d>] kjournald+0xed/0x250
Sep  1 09:53:41 bilbo kernel: [153120.060151]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060154]  [<ffffffff81218b30>] ? kjournald+0x0/0x250
Sep  1 09:53:41 bilbo kernel: [153120.060157]  [<ffffffff81084d16>] kthread+0x96/0xa0
Sep  1 09:53:41 bilbo kernel: [153120.060160]  [<ffffffff810141ea>] child_rip+0xa/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060164]  [<ffffffff810133d1>] ? int_ret_from_sys_call+0x7/0x1b
Sep  1 09:53:41 bilbo kernel: [153120.060167]  [<ffffffff81013b5d>] ? retint_restore_args+0x5/0x6
Sep  1 09:53:41 bilbo kernel: [153120.060170]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060175] INFO: task flush-251:0:5836 blocked for more than 120 seconds.
Sep  1 09:53:41 bilbo kernel: [153120.060185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 09:53:41 bilbo kernel: [153120.060191] flush-251:0   D ffff880003a72bc0     0  5836      2 0x00000000
Sep  1 09:53:41 bilbo kernel: [153120.060195]  ffff88003f1b78c0 0000000000000246 0000000000015bc0 0000000000015bc0
Sep  1 09:53:41 bilbo kernel: [153120.060200]  ffff88007a5dc890 ffff88003f1b7fd8 0000000000015bc0 ffff88007a5dc4d0
Sep  1 09:53:41 bilbo kernel: [153120.060205]  0000000000015bc0 ffff88003f1b7fd8 0000000000015bc0 ffff88007a5dc890
Sep  1 09:53:41 bilbo kernel: [153120.060209] Call Trace:
Sep  1 09:53:41 bilbo kernel: [153120.060212]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060216]  [<ffffffff81558bb7>] io_schedule+0x47/0x70
Sep  1 09:53:41 bilbo kernel: [153120.060219]  [<ffffffff8116e615>] sync_buffer+0x45/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060222]  [<ffffffff8155908a>] __wait_on_bit_lock+0x5a/0xc0
Sep  1 09:53:41 bilbo kernel: [153120.060225]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060228]  [<ffffffff8116eb60>] ? end_buffer_async_write+0x0/0x180
Sep  1 09:53:41 bilbo kernel: [153120.060231]  [<ffffffff81559168>] out_of_line_wait_on_bit_lock+0x78/0x90
Sep  1 09:53:41 bilbo kernel: [153120.060234]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060237]  [<ffffffff8116e796>] __lock_buffer+0x36/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060241]  [<ffffffff8116f473>] __block_write_full_page+0x383/0x3c0
Sep  1 09:53:41 bilbo kernel: [153120.060245]  [<ffffffff810f4444>] ? end_page_writeback+0x44/0x60
Sep  1 09:53:41 bilbo kernel: [153120.060248]  [<ffffffff8116eb60>] ? end_buffer_async_write+0x0/0x180
Sep  1 09:53:41 bilbo kernel: [153120.060251]  [<ffffffff8116fde0>] block_write_full_page_endio+0xe0/0x120
Sep  1 09:53:41 bilbo kernel: [153120.060255]  [<ffffffff811b70f0>] ? buffer_unmapped+0x0/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060258]  [<ffffffff8116fe35>] block_write_full_page+0x15/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060261]  [<ffffffff811b7bfd>] ext3_ordered_writepage+0x1dd/0x200
Sep  1 09:53:41 bilbo kernel: [153120.060264]  [<ffffffff810fc687>] __writepage+0x17/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060268]  [<ffffffff810fd83f>] write_cache_pages+0x21f/0x4d0
Sep  1 09:53:41 bilbo kernel: [153120.060271]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060274]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060277]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060281]  [<ffffffff81166976>] writeback_single_inode+0xf6/0x3d0
Sep  1 09:53:41 bilbo kernel: [153120.060283]  [<ffffffff811675de>] writeback_inodes_wb+0x40e/0x5e0
Sep  1 09:53:41 bilbo kernel: [153120.060286]  [<ffffffff811678ba>] wb_writeback+0x10a/0x1d0
Sep  1 09:53:41 bilbo kernel: [153120.060290]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
Sep  1 09:53:41 bilbo kernel: [153120.060293]  [<ffffffff81558e6d>] ? schedule_timeout+0x19d/0x300
Sep  1 09:53:41 bilbo kernel: [153120.060296]  [<ffffffff81013b5d>] ? retint_restore_args+0x5/0x6
Sep  1 09:53:41 bilbo kernel: [153120.060299]  [<ffffffff81167bec>] wb_do_writeback+0x18c/0x1a0
Sep  1 09:53:41 bilbo kernel: [153120.060302]  [<ffffffff81167c53>] bdi_writeback_task+0x53/0xe0
Sep  1 09:53:41 bilbo kernel: [153120.060306]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
Sep  1 09:53:41 bilbo kernel: [153120.060308]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
Sep  1 09:53:41 bilbo kernel: [153120.060311]  [<ffffffff81084d16>] kthread+0x96/0xa0
Sep  1 09:53:41 bilbo kernel: [153120.060314]  [<ffffffff810141ea>] child_rip+0xa/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060318]  [<ffffffff810133d1>] ? int_ret_from_sys_call+0x7/0x1b
Sep  1 09:53:41 bilbo kernel: [153120.060322]  [<ffffffff81013b5d>] ? retint_restore_args+0x5/0x6
Sep  1 09:53:41 bilbo kernel: [153120.060325]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060328] INFO: task smbd:30578 blocked for more than 120 seconds.
Sep  1 09:53:41 bilbo kernel: [153120.060334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 09:53:41 bilbo kernel: [153120.060341] smbd          D ffff880003a72bc0     0 30578    727 0x00000000
Sep  1 09:53:41 bilbo kernel: [153120.060345]  ffff8800029ab8f8 0000000000000282 0000000000015bc0 0000000000015bc0
Sep  1 09:53:41 bilbo kernel: [153120.060350]  ffff880002b94890 ffff8800029abfd8 0000000000015bc0 ffff880002b944d0
Sep  1 09:53:41 bilbo kernel: [153120.060354]  0000000000015bc0 ffff8800029abfd8 0000000000015bc0 ffff880002b94890
Sep  1 09:53:41 bilbo kernel: [153120.060359] Call Trace:
Sep  1 09:53:41 bilbo kernel: [153120.060362]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060365]  [<ffffffff81558bb7>] io_schedule+0x47/0x70
Sep  1 09:53:41 bilbo kernel: [153120.060368]  [<ffffffff8116e615>] sync_buffer+0x45/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060371]  [<ffffffff8155908a>] __wait_on_bit_lock+0x5a/0xc0
Sep  1 09:53:41 bilbo kernel: [153120.060374]  [<ffffffff8116e5d0>] ? sync_buffer+0x0/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060377]  [<ffffffff81559168>] out_of_line_wait_on_bit_lock+0x78/0x90
Sep  1 09:53:41 bilbo kernel: [153120.060381]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060384]  [<ffffffff8116e796>] __lock_buffer+0x36/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060387]  [<ffffffff8116f5ee>] sync_dirty_buffer+0xbe/0xe0
Sep  1 09:53:41 bilbo kernel: [153120.060390]  [<ffffffff81213b84>] journal_dirty_data+0x1d4/0x270
Sep  1 09:53:41 bilbo kernel: [153120.060393]  [<ffffffff8100f37f>] ? xen_restore_fl_direct_end+0x0/0x1
Sep  1 09:53:41 bilbo kernel: [153120.060396]  [<ffffffff811b7f50>] ext3_journal_dirty_data+0x20/0x50
Sep  1 09:53:41 bilbo kernel: [153120.060399]  [<ffffffff811b7fa5>] journal_dirty_data_fn+0x25/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060401]  [<ffffffff811b7077>] walk_page_buffers+0x87/0xc0
Sep  1 09:53:41 bilbo kernel: [153120.060404]  [<ffffffff811b7f80>] ? journal_dirty_data_fn+0x0/0x30
Sep  1 09:53:41 bilbo kernel: [153120.060407]  [<ffffffff811bb623>] ext3_ordered_write_end+0x83/0x170
Sep  1 09:53:41 bilbo kernel: [153120.060411]  [<ffffffff810f3b22>] generic_perform_write+0x122/0x1d0
Sep  1 09:53:41 bilbo kernel: [153120.060414]  [<ffffffff811c2534>] ? __ext3_journal_stop+0x34/0x60
Sep  1 09:53:41 bilbo kernel: [153120.060417]  [<ffffffff810f4c53>] generic_file_buffered_write+0x73/0xd0
Sep  1 09:53:41 bilbo kernel: [153120.060421]  [<ffffffff810f6240>] __generic_file_aio_write+0x240/0x470
Sep  1 09:53:41 bilbo kernel: [153120.060424]  [<ffffffff8100f392>] ? check_events+0x12/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060427]  [<ffffffff8100f37f>] ? xen_restore_fl_direct_end+0x0/0x1
Sep  1 09:53:41 bilbo kernel: [153120.060430]  [<ffffffff8113568b>] ? kmem_cache_free+0x9b/0x100
Sep  1 09:53:41 bilbo kernel: [153120.060433]  [<ffffffff810f64df>] generic_file_aio_write+0x6f/0xe0
Sep  1 09:53:41 bilbo kernel: [153120.060438]  [<ffffffff811443ea>] do_sync_write+0xfa/0x140
Sep  1 09:53:41 bilbo kernel: [153120.060441]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Sep  1 09:53:41 bilbo kernel: [153120.060446]  [<ffffffff81252ea6>] ? security_file_permission+0x16/0x20
Sep  1 09:53:41 bilbo kernel: [153120.060449]  [<ffffffff811446e8>] vfs_write+0xb8/0x1a0
Sep  1 09:53:41 bilbo kernel: [153120.060452]  [<ffffffff8115418b>] ? do_fcntl+0x29b/0x3d0
Sep  1 09:53:41 bilbo kernel: [153120.060455]  [<ffffffff81145082>] sys_pwrite64+0x82/0xa0
Sep  1 09:53:41 bilbo kernel: [153120.060459]  [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b


On our mailserver :-
Filesystem suddenly went to readonly mode.

Check log files
Sep  1 15:04:15 palantir dovecot: IMAP(susan.quinton): Disconnected: Logged out bytes=536/8304
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): read(mail, uid=50) failed: Input/output error
Sep  1 15:04:17 palantir kernel: [107643.450120] end_request: I/O error, dev xvdb, sector 846521103
Sep  1 15:04:17 palantir kernel: [107643.450120] end_request: I/O error, dev xvdb, sector 846521103
Sep  1 15:04:17 palantir kernel: [107643.450120] end_request: I/O error, dev xvdb, sector 846521103
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): Disconnected: Internal error occurred. Refer to server log for more information. [2010-09-01 15:04:17] bytes=24614/89814044
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): fdatasync(/var/qmail/maildirs/carrb/Maildir/.Faculty.John Garvey/dovecot-uidlist.tmp) failed: Input/output error
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): unlink(/var/qmail/maildirs/carrb/Maildir/.Faculty.John Garvey/dovecot-uidlist.tmp) failed: Read-only file system
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): unlink(/var/qmail/maildirs/carrb/Maildir/.Faculty.John Garvey/dovecot-uidlist.lock) failed: Read-only file system
Sep  1 15:04:17 palantir kernel: [107643.450623] end_request: I/O error, dev xvdb, sector 845873895
Sep  1 15:04:17 palantir kernel: [107643.450623] Buffer I/O error on device dm-0, logical block 105734181
Sep  1 15:04:17 palantir kernel: [107643.450623] lost page write due to I/O error on dm-0
Sep  1 15:04:17 palantir kernel: [107643.450623] end_request: I/O error, dev xvdb, sector 512799183
Sep  1 15:04:17 palantir kernel: [107643.450623] Buffer I/O error on device dm-0, logical block 64099842
Sep  1 15:04:17 palantir kernel: [107643.450623] lost page write due to I/O error on dm-0
Sep  1 15:04:17 palantir kernel: [107643.450623] end_request: I/O error, dev xvdb, sector 4235143
Sep  1 15:04:17 palantir kernel: [107643.450623] Buffer I/O error on device dm-0, logical block 529337
Sep  1 15:04:17 palantir kernel: [107643.450623] lost page write due to I/O error on dm-0
Sep  1 15:04:17 palantir kernel: [107643.450623] end_request: I/O error, dev xvdb, sector 512758263
Sep  1 15:04:17 palantir dovecot: imap-login: Login: user=<carrb>, method=PLAIN, rip=10.100.6.65, lip=192.168.100.14
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): open() failed with file /var/qmail/maildirs/carrb/Maildir/.Faculty.John Garvey/dovecot.index.log: Read-only file system
Sep  1 15:04:17 palantir dovecot: IMAP(carrb): open(/var/qmail/maildirs/carrb/Maildir/.Faculty.John Garvey/dovecot-uidlist) failed: Read-only file system
Sep  1 15:04:17 palantir kernel: [107643.450623] Buffer I/O error on device dm-0, logical block 64094727
Sep  1 15:04:17 palantir kernel: [107643.450623] lost page write due to I/O error on dm-0

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
10.04 LTS Servers Hang with task kjournald blocked for more than 120 seconds
https://bugs.launchpad.net/bugs/628530
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux in ubuntu.




More information about the kernel-bugs mailing list