[Bug 1412962] [NEW] Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

Rafael David Tinoco inaddy at inaddy.org
Tue Jan 20 20:56:08 UTC 2015


Public bug reported:

It was brought to my attention that pacemaker could seg fault (stonith) on some conditions. This problem
was brought to me when solving the following bug:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/

So you can check the problem here:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38

And possible explanation here:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40

(Copy and pasting here):

So the cherry-pick (for version
trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream
commit) seems ok since it makes lrmd (services, services_linux) to avoid
repeating a timer when the source was already removed from glib main
loop context:

example:

+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;

etc...

This actually solved lrmd crashes I was getting with the testcase
(explained inside this bug summary).

===
Explanation:
g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===

Analyzing your crash file (from stonith and not lrm), it looks like we
have the following scenario:

==============

exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy 
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has
returned (triggering monitor callback) with the following data on it:

------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------

OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

    if (child->callback) {
        child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained
before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it",
tag);

and this generates the crash (since g_glob is called with a critical
log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);

Under stonith_action_clear_tracking_data(stonith_action_t * action) we
have AGAIN:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
        action->timer_sigterm = 0;
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
        action->timer_sigkill = 0;
    }

This logic probably triggered the same problem the cherry pick addressed
for lrmd, but now for stonith (calling g_source_remove 2 times for the
same source after glib2.0 was changed).

##############

commit 0326f05c9e26f39a394fa30830e31a76306f49c7
Author: Andrew Beekhof <andrew at beekhof.net>
Date: Thu Aug 7 13:49:24 2014 +1000

    Fix: stonith-ng: Reset mainloop source IDs after removing them

diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
index 64bd8f3..2837682 100644
--- a/lib/fencing/st_client.c
+++ b/lib/fencing/st_client.c
@@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,

     if (action->timer_sigterm > 0) {
         g_source_remove(action->timer_sigterm);
+ action->timer_sigterm = 0;
     }
     if (action->timer_sigkill > 0) {
         g_source_remove(action->timer_sigkill);
+ action->timer_sigkill = 0;
     }

     if (action->last_timeout_signo) {

##############

under <stonith_action_async_done>.

Will provide you a hotfix with this fix and ask for feedback.

** Affects: pacemaker (Ubuntu)
     Importance: Undecided
     Assignee: Rafael David Tinoco (inaddy)
         Status: In Progress


** Tags: cts

** Changed in: pacemaker (Ubuntu)
     Assignee: (unassigned) => Rafael David Tinoco (inaddy)

** Summary changed:

- Stonith can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it
+ Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

** Tags added: cts

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1412962

Title:
  Pacemaker (stonith) can seg fault in Trusty and Utopic after following
  message: Source ID XX was not found when attempting to remove it

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962/+subscriptions



More information about the Ubuntu-server-bugs mailing list