[Bug 2143377] Re: Dangling RADOS export index entry in remove_export() crashes NFS-Ganesha

Mon Mar 16 23:50:11 UTC 2026

** Patch added: "lp2143377_uca-caracal.debdiff"
   https://bugs.launchpad.net/manila/+bug/2143377/+attachment/5953358/+files/lp2143377_uca-caracal.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to manila in Ubuntu.
https://bugs.launchpad.net/bugs/2143377

Title:
  Dangling RADOS export index entry in remove_export() crashes NFS-
  Ganesha

Status in Ubuntu Cloud Archive:
  In Progress
Status in Ubuntu Cloud Archive caracal series:
  In Progress
Status in Ubuntu Cloud Archive dalmatian series:
  In Progress
Status in Ubuntu Cloud Archive epoxy series:
  In Progress
Status in Ubuntu Cloud Archive flamingo series:
  In Progress
Status in Ubuntu Cloud Archive gazpacho series:
  New
Status in Ubuntu Cloud Archive yoga series:
  In Progress
Status in OpenStack Shared File Systems Service (Manila):
  Fix Released
Status in manila package in Ubuntu:
  New
Status in manila source package in Jammy:
  In Progress
Status in manila source package in Noble:
  In Progress
Status in manila source package in Questing:
  In Progress
Status in manila source package in Resolute:
  New

Bug description:
  [Impact]

  When manila deletes a CephFS NFS share, remove_export() deletes the
  RADOS export object before removing its URL from the export index.
  If manila-share is interrupted between these two operations, the index
  retains a reference to a non-existent object. On the next NFS-Ganesha
  restart, ganesha.nfsd hits ENOENT on the dangling reference and exits
  FATAL. This takes down the entire NFS gateway — all connected NFS
  clients get "server not responding" and all I/O hangs until manual
  intervention.

  [Test Case]

  I tested this based on Juju/MAAS environment.

  Prerequisites:
    - Juju model with ceph-mon, ceph-osd, ceph-fs, mysql-innodb-cluster,
      rabbitmq-server, keystone, manila, manila-ganesha
    - OpenStack 2024.1/stable, Ceph Quincy
    - NFS client node with nfs-common installed

  Test 1: Dangling index crashes NFS-Ganesha (reproduce bug)
    1. Create a fake RADOS export object in the manila-ganesha pool:
         rados --id manila-ganesha -p manila-ganesha put \
           ganesha-export-test-dangling /tmp/export_obj.conf
    2. Add a %url entry pointing to it in ganesha-export-index:
         echo '%url "rados://manila-ganesha/ganesha-export-test-dangling"' >> index
         rados put ganesha-export-index index
    3. Delete the object (simulating crash after _delete_rados_object):
         rados rm ganesha-export-test-dangling
    4. Restart NFS-Ganesha:
         systemctl restart nfs-ganesha
    5. Observe: 
         service exits FATAL with "Unknown error -2"
    6. Clean up: 
         remove the dangling entry from index, restart ganesha.

  Test 2: Orphan object is harmless (verify fix)
    1. Create a RADOS export object but do NOT add it to the index
       (simulating crash after _remove_rados_object_url_from_index
       but before _delete_rados_object).
    2. Restart NFS-Ganesha.
    3. Observe: 
         service starts normally, orphan object is ignored.

  Test 3: NFS client impact (dangling reference)
    1. Create an NFS share via manila CLI:
         manila type-create cephfsnfstype false \
           --extra-specs share_backend_name=cephfsnfs1
         manila create --share-type cephfsnfstype --name test-nfs NFS 1
    2. Allow access and mount on a client node:
         manila access-allow test-nfs ip <client_ip>
         mount -t nfs <ganesha_ip>:<export_path> /mnt/test-nfs
    3. Verify I/O: touch /mnt/test-nfs/testfile
    4. Inject a dangling entry into the export index (same as Test 1).
    5. Restart NFS-Ganesha — service crashes.
    6. On client: observe "nfs: server <ip> not responding, timed out"
       in kern.log; ls on mount point hangs.
    7. Restore original index, restart ganesha — NFS I/O resumes.

  Test 4: NFS client unaffected after fix
    1. Apply the fix (swap two lines in remove_export() finally block).
    2. Restart manila-share.
    3. Create an NFS share, mount on client, verify I/O.
    4. Drop orphan objects into the RADOS pool (no index entries).
    5. Restart NFS-Ganesha.
    6. Observe: ganesha starts normally, NFS I/O works, no
       "not responding" in dmesg.

  [Regression Potential]

  Low. The change only reorders two independent cleanup operations in
  the finally block of remove_export(). If _remove_rados_object_url_-
  from_index() fails, the object deletion still proceeds as before.
  The only new failure mode is an orphan RADOS object, which is harmless
  (ganesha ignores objects not referenced in the index).

  [Other Info]

  Buggy order in manila/share/drivers/ganesha/manager.py remove_export():
    self._delete_rados_object(...)
    self._remove_rados_object_url_from_index(name)

  Fixed order:
    self._remove_rados_object_url_from_index(name)
    self._delete_rados_object(...)

  Reproduced on OpenStack 2024.1 (Caracal), Ceph Quincy 17.2.9,
  manila-ganesha charm, NFS4 hard mount clients.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2143377/+subscriptions