[Bug 1969087] [NEW] failing to mount iSCSI path with first volume

Thu Apr 14 09:14:02 UTC 2022

Public bug reported:

On a customer deployment, on focal-ussuri, with iSCSI backends and multipath enabled we face an issue where iscsiadm will fail to mount one the path of an iSCSI volume with the following error :
"iscsiadm: Could not make /etc/iscsi/nodes: File exists\niscsiadm: Error while adding record: encountered iSCSI database failure"
(see nova-compute.log for more details)

In term of impact more exactly, all the servers mounting an iSCSI volume
for the first time will fail "silently", as the end-user won't be aware,
to mount the first path of the iSCSI target.

I noticed the iscsi database error happens solely on the first iSCSI volume to be mounted on each involved server in the deployment like units running cinder-volume and nova-compute services.
After some investigations, this appears to be a race coundition with os-brick and iscsid daemon.
After a deployment or a reboot, iscsid isn't started and os-brick tries too quickly to mount the first path of the iSCSI volume before iscsid has finished to initialise thus leading to the error we see in cinder-volume or nova-compute logs.
If iscsid is manually started on the server before the error just simply disappears and the target paths are all mounted properly on the first volume.

Here is an example of the processes running on a nova-compute :
# before an instance creation with the first iSCSI volume
ubuntu at nova-compute:~$ ps aux | grep iscsi
ubuntu   3705821  0.0  0.0   6304  2624 pts/1    S+   14:18   0:00 grep --color=auto iscsi
# after the instance creation
ubuntu at nova-compute:~$ ps aux | grep iscsi
root     3707866  0.0  0.0   5108   248 ?        Ss   14:21   0:00 /sbin/iscsid
root     3707867  0.0  0.0   5964  5816 ?        S<Ls 14:21   0:00 /sbin/iscsid
root     3707869  0.0  0.0      0     0 ?        I<   14:21   0:00 [iscsi_eh]
root     3707878  0.0  0.0      0     0 ?        I<   14:21   0:00 [iscsi_q_1]
ubuntu   3708321  0.0  0.0   6436  2524 pts/1    S+   14:21   0:00 grep --color=auto iscsi

To avoid the first issue of iscsiadm encountering the database error,
the current workaround I found for now is simply to start and enable
iscsid on every cinder-volume and nova-compute units before mounting any
iSCSI volume.

Looking more in depth, this issue is also mentionned in this ticket on
os-brick #1944474 , where they have implemented a retry mecanism to try
again to mount the path if iscsiadm returns the database failure error
code (6).

Would it be possible either to backport the fix from #1944474 to the
package and/or to see if it's feasible to start iscsid beforehand
through a charm configuration ?

** Affects: python-os-brick (Ubuntu)
     Importance: Undecided
         Status: New

** Attachment added: "nova-compute.log"
   https://bugs.launchpad.net/bugs/1969087/+attachment/5580635/+files/nova-compute.log

** Summary changed:

- os-brick failing to mount iSCSI path
+ failing to mount iSCSI path with first volume

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to python-os-brick in Ubuntu.
https://bugs.launchpad.net/bugs/1969087

Title:
  failing to mount iSCSI path with first volume

Status in python-os-brick package in Ubuntu:
  New

Bug description:
  On a customer deployment, on focal-ussuri, with iSCSI backends and multipath enabled we face an issue where iscsiadm will fail to mount one the path of an iSCSI volume with the following error :
  "iscsiadm: Could not make /etc/iscsi/nodes: File exists\niscsiadm: Error while adding record: encountered iSCSI database failure"
  (see nova-compute.log for more details)

  In term of impact more exactly, all the servers mounting an iSCSI
  volume for the first time will fail "silently", as the end-user won't
  be aware, to mount the first path of the iSCSI target.

  I noticed the iscsi database error happens solely on the first iSCSI volume to be mounted on each involved server in the deployment like units running cinder-volume and nova-compute services.
  After some investigations, this appears to be a race coundition with os-brick and iscsid daemon.
  After a deployment or a reboot, iscsid isn't started and os-brick tries too quickly to mount the first path of the iSCSI volume before iscsid has finished to initialise thus leading to the error we see in cinder-volume or nova-compute logs.
  If iscsid is manually started on the server before the error just simply disappears and the target paths are all mounted properly on the first volume.

  Here is an example of the processes running on a nova-compute :
  # before an instance creation with the first iSCSI volume
  ubuntu at nova-compute:~$ ps aux | grep iscsi
  ubuntu   3705821  0.0  0.0   6304  2624 pts/1    S+   14:18   0:00 grep --color=auto iscsi
  # after the instance creation
  ubuntu at nova-compute:~$ ps aux | grep iscsi
  root     3707866  0.0  0.0   5108   248 ?        Ss   14:21   0:00 /sbin/iscsid
  root     3707867  0.0  0.0   5964  5816 ?        S<Ls 14:21   0:00 /sbin/iscsid
  root     3707869  0.0  0.0      0     0 ?        I<   14:21   0:00 [iscsi_eh]
  root     3707878  0.0  0.0      0     0 ?        I<   14:21   0:00 [iscsi_q_1]
  ubuntu   3708321  0.0  0.0   6436  2524 pts/1    S+   14:21   0:00 grep --color=auto iscsi

  To avoid the first issue of iscsiadm encountering the database error,
  the current workaround I found for now is simply to start and enable
  iscsid on every cinder-volume and nova-compute units before mounting
  any iSCSI volume.

  Looking more in depth, this issue is also mentionned in this ticket on
  os-brick #1944474 , where they have implemented a retry mecanism to
  try again to mount the path if iscsiadm returns the database failure
  error code (6).

  Would it be possible either to backport the fix from #1944474 to the
  package and/or to see if it's feasible to start iscsid beforehand
  through a charm configuration ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/python-os-brick/+bug/1969087/+subscriptions