[Bug 1865523] Re: [bionic] fence_scsi not working properly with Pacemaker 1.1.18-2ubuntu1.1
Rafael David Tinoco
rafaeldtinoco at ubuntu.com
Mon Mar 16 12:14:38 UTC 2020
** Description changed:
OBS: I have split this bug into 2 bugs:
- fence-agents (this) and pacemaker (LP: #1866119)
#### SRU: fence-agents
[Impact]
* fence_scsi is not currently working in a share disk environment
* all clusters relying in fence_scsi and/or fence_scsi + watchdog won't
be able to start the fencing agents OR, in worst case scenarios, the
fence_scsi agent might start but won't make scsi reservations in the
shared scsi disk.
[Test Case]
* having a 3-node setup, nodes called "clubionic01, clubionic02,
clubionic03", with a shared scsi disk (fully supporting persistent
reservations) /dev/sda, one might try the following command:
sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off
from nodes "clubionic02 or clubionic03" and check if the reservation
worked:
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
* having a 3-node setup, nodes called "clubionic01, clubionic02,
clubionic03", with a shared scsi disk (fully supporting persistent
reservations) /dev/sda, with corosync and pacemaker operational and
running, one might try:
rafaeldtinoco at clubionic01:~$ crm configure
crm(live)configure# property stonith-enabled=on
crm(live)configure# property stonith-action=off
crm(live)configure# property no-quorum-policy=stop
crm(live)configure# property have-watchdog=true
crm(live)configure# property symmetric-cluster=true
crm(live)configure# commit
crm(live)configure# end
crm(live)# end
rafaeldtinoco at clubionic01:~$ crm configure primitive fence_clubionic \
stonith:fence_scsi params \
pcmk_host_list="clubionic01 clubionic02 clubionic03" \
devices="/dev/sda" \
meta provides=unfencing
And see that crm_mon won't show fence_clubionic resource operational.
[Regression Potential]
- * Fix involves adding new cmdline and stdin arguments to the fencing
+ * Fix involves adding new cmdline and stdin arguments to the fencing
agents. Both changes in that direction (normalizing "-" with "_" and
deprecating some commands in favor of others) keep the existing commands
working and allow the new commands to work as well (that part is the
fix, because of the integration with pacemaker).
* Comments #3 and #4 show this new version fully working.
- * This fix has a potential of breaking other "nowadays working" fencing
- agent. If that happens, I suggest that ones affected revert previous to
- previous package AND open a bug against either pacemaker and/or fence-
- agents.
+ * This is a quite complex change and I'd appreciate leaving it in -proposed for a
+ while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
+ report a bug and downgrade to the former version.
* Judging by this issue, it is very likely that any Ubuntu user that
have tried using fence_scsi has probably migrated to a newer version
because fence_scsi agent is broken since its release.
* The way I fixed fence_scsi was this:
I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I
could bisect fence-agents. At that moment I realized that bisecting was
going to be hard because there were multiple issues, not only one. I
backported the latest fence-agents together with Pacemaker 1.1.19-0 and
saw that it worked.
From then on, I bisected the following intervals:
4.3.0 .. 4.4.0 (eoan - working)
4.2.0 .. 4.3.0
4.1.0 .. 4.2.0
4.0.25 .. 4.1.0 (bionic - not working)
In each of those intervals I discovered issues. For example, Using 4.3.0
I faced problems so I had to backport fixes that were in between 4.4.0
and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport
fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at
4.0.25 version, current Bionic fence-agents version.
[Other Info]
* Original Description:
Trying to setup a cluster with an iscsi shared disk, using fence_scsi as
the fencing mechanism, I realized that fence_scsi is not working in
Ubuntu Bionic. I first thought it was related to Azure environment (LP:
#1864419), where I was trying this environment, but then, trying
locally, I figured out that somehow pacemaker 1.1.18 is not fencing the
shared scsi disk properly.
Note: I was able to "backport" vanilla 1.1.19 from upstream and
fence_scsi worked. I have then tried 1.1.18 without all quilt patches
and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19
might tell us which commit has fixed the behaviour needed by the
fence_scsi agent.
(k)rafaeldtinoco at clubionic01:~$ crm conf show
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
meta provides=unfencing
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.18-2b07d5c5a9 \
cluster-infrastructure=corosync \
cluster-name=clubionic \
stonith-enabled=on \
stonith-action=off \
no-quorum-policy=stop \
symmetric-cluster=true
----
(k)rafaeldtinoco at clubionic02:~$ sudo crm_mon -1
Stack: corosync
Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon Mar 2 15:55:30 2020
Last change: Mon Mar 2 15:45:33 2020 by root via cibadmin on clubionic01.private
3 nodes configured
1 resource configured
Online: [ clubionic01.private clubionic02.private clubionic03.private ]
Active resources:
fence_clubionic (stonith:fence_scsi): Started
clubionic01.private
----
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
** Description changed:
OBS: I have split this bug into 2 bugs:
- fence-agents (this) and pacemaker (LP: #1866119)
#### SRU: fence-agents
[Impact]
* fence_scsi is not currently working in a share disk environment
* all clusters relying in fence_scsi and/or fence_scsi + watchdog won't
be able to start the fencing agents OR, in worst case scenarios, the
fence_scsi agent might start but won't make scsi reservations in the
shared scsi disk.
[Test Case]
* having a 3-node setup, nodes called "clubionic01, clubionic02,
clubionic03", with a shared scsi disk (fully supporting persistent
reservations) /dev/sda, one might try the following command:
sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off
from nodes "clubionic02 or clubionic03" and check if the reservation
worked:
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
* having a 3-node setup, nodes called "clubionic01, clubionic02,
clubionic03", with a shared scsi disk (fully supporting persistent
reservations) /dev/sda, with corosync and pacemaker operational and
running, one might try:
rafaeldtinoco at clubionic01:~$ crm configure
crm(live)configure# property stonith-enabled=on
crm(live)configure# property stonith-action=off
crm(live)configure# property no-quorum-policy=stop
crm(live)configure# property have-watchdog=true
crm(live)configure# property symmetric-cluster=true
crm(live)configure# commit
crm(live)configure# end
crm(live)# end
rafaeldtinoco at clubionic01:~$ crm configure primitive fence_clubionic \
stonith:fence_scsi params \
pcmk_host_list="clubionic01 clubionic02 clubionic03" \
devices="/dev/sda" \
meta provides=unfencing
And see that crm_mon won't show fence_clubionic resource operational.
[Regression Potential]
* Fix involves adding new cmdline and stdin arguments to the fencing
agents. Both changes in that direction (normalizing "-" with "_" and
deprecating some commands in favor of others) keep the existing commands
working and allow the new commands to work as well (that part is the
fix, because of the integration with pacemaker).
* Comments #3 and #4 show this new version fully working.
- * This is a quite complex change and I'd appreciate leaving it in -proposed for a
+ * This is a quite complex change and I'd appreciate leaving it in -proposed for a
while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
report a bug and downgrade to the former version.
* Judging by this issue, it is very likely that any Ubuntu user that
have tried using fence_scsi has probably migrated to a newer version
because fence_scsi agent is broken since its release.
+
+ [Other Info]
* The way I fixed fence_scsi was this:
I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I
could bisect fence-agents. At that moment I realized that bisecting was
going to be hard because there were multiple issues, not only one. I
backported the latest fence-agents together with Pacemaker 1.1.19-0 and
saw that it worked.
From then on, I bisected the following intervals:
4.3.0 .. 4.4.0 (eoan - working)
4.2.0 .. 4.3.0
4.1.0 .. 4.2.0
4.0.25 .. 4.1.0 (bionic - not working)
In each of those intervals I discovered issues. For example, Using 4.3.0
I faced problems so I had to backport fixes that were in between 4.4.0
and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport
fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at
4.0.25 version, current Bionic fence-agents version.
-
- [Other Info]
* Original Description:
Trying to setup a cluster with an iscsi shared disk, using fence_scsi as
the fencing mechanism, I realized that fence_scsi is not working in
Ubuntu Bionic. I first thought it was related to Azure environment (LP:
#1864419), where I was trying this environment, but then, trying
locally, I figured out that somehow pacemaker 1.1.18 is not fencing the
shared scsi disk properly.
Note: I was able to "backport" vanilla 1.1.19 from upstream and
fence_scsi worked. I have then tried 1.1.18 without all quilt patches
and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19
might tell us which commit has fixed the behaviour needed by the
fence_scsi agent.
(k)rafaeldtinoco at clubionic01:~$ crm conf show
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
meta provides=unfencing
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.18-2b07d5c5a9 \
cluster-infrastructure=corosync \
cluster-name=clubionic \
stonith-enabled=on \
stonith-action=off \
no-quorum-policy=stop \
symmetric-cluster=true
----
(k)rafaeldtinoco at clubionic02:~$ sudo crm_mon -1
Stack: corosync
Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon Mar 2 15:55:30 2020
Last change: Mon Mar 2 15:45:33 2020 by root via cibadmin on clubionic01.private
3 nodes configured
1 resource configured
Online: [ clubionic01.private clubionic02.private clubionic03.private ]
Active resources:
fence_clubionic (stonith:fence_scsi): Started
clubionic01.private
----
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
--
You received this bug notification because you are a member of Ubuntu
Server, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1865523
Title:
[bionic] fence_scsi not working properly with Pacemaker
1.1.18-2ubuntu1.1
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fence-agents/+bug/1865523/+subscriptions
More information about the Ubuntu-server-bugs
mailing list