[Bug 1865523] Re: [bionic] fence_scsi not working properly with Pacemaker 1.1.18-2ubuntu1.1

Rafael David Tinoco rafaeldtinoco at ubuntu.com
Mon Mar 16 12:14:38 UTC 2020


** Description changed:

  OBS: I have split this bug into 2 bugs:
       - fence-agents (this) and pacemaker (LP: #1866119)
  
  #### SRU: fence-agents
  
  [Impact]
  
   * fence_scsi is not currently working in a share disk environment
  
   * all clusters relying in fence_scsi and/or fence_scsi + watchdog won't
  be able to start the fencing agents OR, in worst case scenarios, the
  fence_scsi agent might start but won't make scsi reservations in the
  shared scsi disk.
  
  [Test Case]
  
   * having a 3-node setup, nodes called "clubionic01, clubionic02,
  clubionic03", with a shared scsi disk (fully supporting persistent
  reservations) /dev/sda, one might try the following command:
  
  sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off
  
  from nodes "clubionic02 or clubionic03" and check if the reservation
  worked:
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there are NO registered reservation keys
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there is NO reservation held
  
   * having a 3-node setup, nodes called "clubionic01, clubionic02,
  clubionic03", with a shared scsi disk (fully supporting persistent
  reservations) /dev/sda, with corosync and pacemaker operational and
  running, one might try:
  
  rafaeldtinoco at clubionic01:~$ crm configure
  crm(live)configure# property stonith-enabled=on
  crm(live)configure# property stonith-action=off
  crm(live)configure# property no-quorum-policy=stop
  crm(live)configure# property have-watchdog=true
  crm(live)configure# property symmetric-cluster=true
  crm(live)configure# commit
  crm(live)configure# end
  crm(live)# end
  
  rafaeldtinoco at clubionic01:~$ crm configure primitive fence_clubionic \
      stonith:fence_scsi params \
      pcmk_host_list="clubionic01 clubionic02 clubionic03" \
      devices="/dev/sda" \
      meta provides=unfencing
  
  And see that crm_mon won't show fence_clubionic resource operational.
  
  [Regression Potential]
  
-  * Fix involves adding new cmdline and stdin arguments to the fencing
+  * Fix involves adding new cmdline and stdin arguments to the fencing
  agents. Both changes in that direction (normalizing "-" with "_" and
  deprecating some commands in favor of others) keep the existing commands
  working and allow the new commands to work as well (that part is the
  fix, because of the integration with pacemaker).
  
   * Comments #3 and #4 show this new version fully working.
  
-  * This fix has a potential of breaking other "nowadays working" fencing
- agent. If that happens, I suggest that ones affected revert previous to
- previous package AND open a bug against either pacemaker and/or fence-
- agents.
+  * This is a quite complex change and I'd appreciate leaving it in -proposed for a
+ while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
+ report a bug and downgrade to the former version.
  
   * Judging by this issue, it is very likely that any Ubuntu user that
  have tried using fence_scsi has probably migrated to a newer version
  because fence_scsi agent is broken since its release.
  
   * The way I fixed fence_scsi was this:
  
  I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I
  could bisect fence-agents. At that moment I realized that bisecting was
  going to be hard because there were multiple issues, not only one. I
  backported the latest fence-agents together with Pacemaker 1.1.19-0 and
  saw that it worked.
  
  From then on, I bisected the following intervals:
  
  4.3.0 .. 4.4.0 (eoan - working)
  4.2.0 .. 4.3.0
  4.1.0 .. 4.2.0
  4.0.25 .. 4.1.0 (bionic - not working)
  
  In each of those intervals I discovered issues. For example, Using 4.3.0
  I faced problems so I had to backport fixes that were in between 4.4.0
  and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport
  fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at
  4.0.25 version, current Bionic fence-agents version.
  
  [Other Info]
  
   * Original Description:
  
  Trying to setup a cluster with an iscsi shared disk, using fence_scsi as
  the fencing mechanism, I realized that fence_scsi is not working in
  Ubuntu Bionic. I first thought it was related to Azure environment (LP:
  #1864419), where I was trying this environment, but then, trying
  locally, I figured out that somehow pacemaker 1.1.18 is not fencing the
  shared scsi disk properly.
  
  Note: I was able to "backport" vanilla 1.1.19 from upstream and
  fence_scsi worked. I have then tried 1.1.18 without all quilt patches
  and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19
  might tell us which commit has fixed the behaviour needed by the
  fence_scsi agent.
  
  (k)rafaeldtinoco at clubionic01:~$ crm conf show
  node 1: clubionic01.private
  node 2: clubionic02.private
  node 3: clubionic03.private
  primitive fence_clubionic stonith:fence_scsi \
          params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
          meta provides=unfencing
  property cib-bootstrap-options: \
          have-watchdog=false \
          dc-version=1.1.18-2b07d5c5a9 \
          cluster-infrastructure=corosync \
          cluster-name=clubionic \
          stonith-enabled=on \
          stonith-action=off \
          no-quorum-policy=stop \
          symmetric-cluster=true
  
  ----
  
  (k)rafaeldtinoco at clubionic02:~$ sudo crm_mon -1
  Stack: corosync
  Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
  Last updated: Mon Mar  2 15:55:30 2020
  Last change: Mon Mar  2 15:45:33 2020 by root via cibadmin on clubionic01.private
  
  3 nodes configured
  1 resource configured
  
  Online: [ clubionic01.private clubionic02.private clubionic03.private ]
  
  Active resources:
  
   fence_clubionic        (stonith:fence_scsi):   Started
  clubionic01.private
  
  ----
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there are NO registered reservation keys
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there is NO reservation held

** Description changed:

  OBS: I have split this bug into 2 bugs:
       - fence-agents (this) and pacemaker (LP: #1866119)
  
  #### SRU: fence-agents
  
  [Impact]
  
   * fence_scsi is not currently working in a share disk environment
  
   * all clusters relying in fence_scsi and/or fence_scsi + watchdog won't
  be able to start the fencing agents OR, in worst case scenarios, the
  fence_scsi agent might start but won't make scsi reservations in the
  shared scsi disk.
  
  [Test Case]
  
   * having a 3-node setup, nodes called "clubionic01, clubionic02,
  clubionic03", with a shared scsi disk (fully supporting persistent
  reservations) /dev/sda, one might try the following command:
  
  sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off
  
  from nodes "clubionic02 or clubionic03" and check if the reservation
  worked:
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there are NO registered reservation keys
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there is NO reservation held
  
   * having a 3-node setup, nodes called "clubionic01, clubionic02,
  clubionic03", with a shared scsi disk (fully supporting persistent
  reservations) /dev/sda, with corosync and pacemaker operational and
  running, one might try:
  
  rafaeldtinoco at clubionic01:~$ crm configure
  crm(live)configure# property stonith-enabled=on
  crm(live)configure# property stonith-action=off
  crm(live)configure# property no-quorum-policy=stop
  crm(live)configure# property have-watchdog=true
  crm(live)configure# property symmetric-cluster=true
  crm(live)configure# commit
  crm(live)configure# end
  crm(live)# end
  
  rafaeldtinoco at clubionic01:~$ crm configure primitive fence_clubionic \
      stonith:fence_scsi params \
      pcmk_host_list="clubionic01 clubionic02 clubionic03" \
      devices="/dev/sda" \
      meta provides=unfencing
  
  And see that crm_mon won't show fence_clubionic resource operational.
  
  [Regression Potential]
  
   * Fix involves adding new cmdline and stdin arguments to the fencing
  agents. Both changes in that direction (normalizing "-" with "_" and
  deprecating some commands in favor of others) keep the existing commands
  working and allow the new commands to work as well (that part is the
  fix, because of the integration with pacemaker).
  
   * Comments #3 and #4 show this new version fully working.
  
-  * This is a quite complex change and I'd appreciate leaving it in -proposed for a
+  * This is a quite complex change and I'd appreciate leaving it in -proposed for a
  while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
  report a bug and downgrade to the former version.
  
   * Judging by this issue, it is very likely that any Ubuntu user that
  have tried using fence_scsi has probably migrated to a newer version
  because fence_scsi agent is broken since its release.
+ 
+ [Other Info]
  
   * The way I fixed fence_scsi was this:
  
  I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I
  could bisect fence-agents. At that moment I realized that bisecting was
  going to be hard because there were multiple issues, not only one. I
  backported the latest fence-agents together with Pacemaker 1.1.19-0 and
  saw that it worked.
  
  From then on, I bisected the following intervals:
  
  4.3.0 .. 4.4.0 (eoan - working)
  4.2.0 .. 4.3.0
  4.1.0 .. 4.2.0
  4.0.25 .. 4.1.0 (bionic - not working)
  
  In each of those intervals I discovered issues. For example, Using 4.3.0
  I faced problems so I had to backport fixes that were in between 4.4.0
  and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport
  fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at
  4.0.25 version, current Bionic fence-agents version.
- 
- [Other Info]
  
   * Original Description:
  
  Trying to setup a cluster with an iscsi shared disk, using fence_scsi as
  the fencing mechanism, I realized that fence_scsi is not working in
  Ubuntu Bionic. I first thought it was related to Azure environment (LP:
  #1864419), where I was trying this environment, but then, trying
  locally, I figured out that somehow pacemaker 1.1.18 is not fencing the
  shared scsi disk properly.
  
  Note: I was able to "backport" vanilla 1.1.19 from upstream and
  fence_scsi worked. I have then tried 1.1.18 without all quilt patches
  and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19
  might tell us which commit has fixed the behaviour needed by the
  fence_scsi agent.
  
  (k)rafaeldtinoco at clubionic01:~$ crm conf show
  node 1: clubionic01.private
  node 2: clubionic02.private
  node 3: clubionic03.private
  primitive fence_clubionic stonith:fence_scsi \
          params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
          meta provides=unfencing
  property cib-bootstrap-options: \
          have-watchdog=false \
          dc-version=1.1.18-2b07d5c5a9 \
          cluster-infrastructure=corosync \
          cluster-name=clubionic \
          stonith-enabled=on \
          stonith-action=off \
          no-quorum-policy=stop \
          symmetric-cluster=true
  
  ----
  
  (k)rafaeldtinoco at clubionic02:~$ sudo crm_mon -1
  Stack: corosync
  Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
  Last updated: Mon Mar  2 15:55:30 2020
  Last change: Mon Mar  2 15:45:33 2020 by root via cibadmin on clubionic01.private
  
  3 nodes configured
  1 resource configured
  
  Online: [ clubionic01.private clubionic02.private clubionic03.private ]
  
  Active resources:
  
   fence_clubionic        (stonith:fence_scsi):   Started
  clubionic01.private
  
  ----
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there are NO registered reservation keys
  
  (k)rafaeldtinoco at clubionic02:~$ sudo sg_persist -r /dev/sda
    LIO-ORG   cluster.bionic.   4.0
    Peripheral device type: disk
    PR generation=0x0, there is NO reservation held

-- 
You received this bug notification because you are a member of Ubuntu
Server, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1865523

Title:
  [bionic] fence_scsi not working properly with Pacemaker
  1.1.18-2ubuntu1.1

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fence-agents/+bug/1865523/+subscriptions



More information about the Ubuntu-server-bugs mailing list