[ec2] Instances failing in a weird way

Tue Jul 14 00:38:59 BST 2009

Greetings,

I had originally written up a report of some odd behavior that I was
seeing, until this bug report was pointed out to me (my original
write-up is below for all the details):

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/276476

Basically, I'm seeing the behavior described in the bug.  My ec2 image
is based on the last 64 bit Intrepid beta AMI (unfortunately I didn't
write down the AMI id) and the kernel ID is aki-38c12651.

My questions are:

Is anyone else seeing this behavior?
Does anyone have a workaround?
Are there any other official kernels available on ec2, and if so is
there a list of them?
Does anyone know if/when this bug is going to be fixed?

Thanks!

Jeremy

Original report:
-----------------
I've had two instances go down in the last week in a weird way, and
although they were different, I think they might be related.  I was
hoping someone could guide me in further investigation of the cause,
so that I could prevent the issue from recurring.

Both of these are large instances.  For one, I rebooted it to fix the
problem, and the other, I replaced with a new instance, and left the
old one running for further investigation.

When the first instance stopped working correctly, here were the symptoms:

Couldn't ssh in (error on the remote was "connection reset by peer")
Log files stopped writing
Webserver still serving data
Existing ssh still worked, but couldn't sudo
Touching new files worked
Disk was not full
lsof reported far less open files than the system max
DNS was working partially if started by hand, but refused to start via
the init script
Load was normal

The console log showed this:

[3989783.026931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4076193.482201] INFO: task cron:11171 blocked for more than 120 seconds.
[4076193.482222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4162603.921175] INFO: task cron:11171 blocked for more than 120 seconds.
[4162603.921193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4249014.372240] INFO: task cron:11171 blocked for more than 120 seconds.
[4249014.372258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4335424.821709] INFO: task cron:11171 blocked for more than 120 seconds.
[4335424.821749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4421835.272209] INFO: task cron:11171 blocked for more than 120 seconds.
[4421835.272230] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4508245.709979] INFO: task cron:11171 blocked for more than 120 seconds.
[4508245.710000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4594656.156967] INFO: task cron:11171 blocked for more than 120 seconds.
[4594656.156987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

The instance is still up.  I've since lost my shell, but perhaps I can
get some info from the console or underlying OS (via Amazon support)

The second instance I rebooted a few days ago.  It's symptoms were similar:

Couldn't ssh in
Logfiles not writing.
Disks not full.
Filehandles not out of range.

The difference is that on this machine, the load just kept climbing.
It reached 72 before we rebooted.

I saw a lot of entries like this in the console before rebooting:

[2333582.440859] INFO: task cron:17810 blocked for more than 120 seconds.
[2333582.440864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[2333582.440926] INFO: task cron:18074 blocked for more than 120 seconds.
[2333582.440931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[2333582.440984] INFO: task cron:18075 blocked for more than 120 seconds.
[2333582.440989] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[2333582.441043] INFO: task cron:18076 blocked for more than 120 seconds.
[2333582.441048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[2333582.441102] INFO: task cron:18077 blocked for more than 120 seconds.
[2333582.441107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[2333582.441163] INFO: task cron:18081 blocked for more than 120 seconds.
[2333582.441168] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

Just like the other instance.

So, my question is, do you know what causes this type of failure, and
how can I avoid it in the future?

Thanks!