Server increasing load due increasing processes in D state

Eduardo Damato eduardo.damato at canonical.com
Mon Feb 25 14:39:30 UTC 2013


Hi Alessandro,

What's the node you're having problems with? Is this a compute node? Can
you give more information on the layout of your nova installation? I can
see that qemu and rabbit-mq are running on the same node. Do you use the
compute node as an MQ node as well?

The problem here seems more to be related to the kernel, since many many
tasks are stuck in the same W_CHAN.

Ideally It would be good to have the output of sysrq-t from this system,
but this can cause the system to hang or crash depending on what the
status is, specially because we already know that there are many
task_structs blocked in the same place.

you could do:

# echo t > /proc/sysrq-trigger
(wait 5 s)
# echo t > /proc/sysrq-trigger
(wait 5 s)
# echo t > /proc/sysrq-trigger

And then we can have a look at the traces and see if they're moving or not.

lsof is blocked reading the memory maps of process 1227. This could lead
to more information on the problem, but at the same time because there
are so many blocked processes it could be just another sign of the
problem and not a hint to the reason why this is happening.

Without kernel traces (sysrq-t) or a vmcore it would be complicated to
understand what's happening. It doesn't seem to be IO related.

Cheers,
Eduardo.

On 25/02/13 12:10, Alessandro Tagliapietra wrote:
> After an strace of lsof I've seen it hangs on
>
> stat("/proc/1227/", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
> open("/proc/1227/stat", O_RDONLY)       = 4
> read(4, "1227 (nova-dhcpbridge) D 1224 25"..., 4096) = 242
> close(4)                                = 0
> readlink("/proc/1227/cwd", "/"..., 4096) = 1
> stat("/proc/1227/cwd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> readlink("/proc/1227/root", "/", 4096)  = 1
> stat("/proc/1227/root", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> readlink("/proc/1227/exe", "/usr/bin/python2.7"..., 4096) = 18
> stat("/proc/1227/exe", {st_mode=S_IFREG|0755, st_size=2989480, ...}) = 0
> open("/proc/1227/maps", O_RDONLY)       = 4
> read(4,
> Could it be a memory issue?
> Actually I cannot run the memory test, maybe tomorrow. Just to know if someone else had the same issue.
> Thanks in advance
> --
>
> Alessandro Tagliapietra
> alexfu.it <http://www.alexfu.it>
>
> Il giorno lunedì 25 febbraio 2013, alle ore 12:29, Alessandro
> Tagliapietra ha scritto:
>
>> Hello guys,
>>
>> at work we've the openstack controller that since some months started
>> to increase its load after some days of uptime.
>>
>> I've seen that the cause is that processes sometimes hangs and remain
>> in D state.
>>
>> I've used some combination of ps args to get these outputs:
>>
>> http://pastebin.com/raw.php?i=LGGzGrWu
>> http://pastie.org/pastes/6332964/text
>> http://pastie.org/pastes/6332979/text
>>
>> The hdd is a soft-raid1 over 2 disks, which SMART values are fine.
>>
>> Commands like lsof, strace on a D process doesn't return.
>>
>> Any idea on what could be the cause?
>>
>> Thanks in advance
>>
>> --
>>
>> Alessandro Tagliapietra
>> alexfu.it <http://www.alexfu.it>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-server/attachments/20130225/a2ee3673/attachment.html>


More information about the ubuntu-server mailing list