[Bug 2089565] Re: MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS

Peter Sabaini 2089565 at bugs.launchpad.net
Fri Mar 14 18:00:06 UTC 2025


** Description changed:

- 
  [ Impact ]
  
  - In Ubuntu 24.04 LTS, upgrading Ceph from `19.2.0~git20240301.4c76c50-0ubuntu6` to `19.2.0-0ubuntu0.24.04.1` may cause the `ceph-mon` daemon to crash when it encounters mismatched serialization order (see stack trace below).
  - This can lead to monitor outages, preventing the entire Ceph cluster from reaching a healthy state.
  - Mitigation is only possible by downgrading to the previous version.
  - Monitor stacktrace:
  
  ```
-  ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
-  1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
-  2: pthread_kill()
-  3: gsignal()
-  4: abort()
-  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
-  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
-  7: (std::unexpected()+0) [0x7884096a5a55]
-  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
-  9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
-  10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
-  11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
-  12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
-  13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
-  14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
-  15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
-  16: main()
-  17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
-  18: __libc_start_main()
-  19: _start()
-  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
+  ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
+  1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
+  2: pthread_kill()
+  3: gsignal()
+  4: abort()
+  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
+  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
+  7: (std::unexpected()+0) [0x7884096a5a55]
+  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
+  9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
+  10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
+  11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
+  12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
+  13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
+  14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
+  15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
+  16: main()
+  17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
+  18: __libc_start_main()
+  19: _start()
+  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  ```
  
  - The root cause of this bug is that the on-wire representation changed between the git snapshot 19.2.0~git20240301.4c76c50-0ubuntu6 and the squid release 19.2.0-0ubuntu0.24.04.1 in that the fields for `bal_rank_mask` `max_xattr_size` were swapped.
  - The proposed fix tries to detect the on-wire presentation and adapt the decode order accordingly.
- 
  
  [ Test Plan ]
  
  To validate that the proposed fix addresses the crash and introduces no
  regressions, perform the following steps:
  
  1. Setup
  
  - Deploy a juju model and add 7 24.04 VMs.
  - On each VM, add apt pinning to pin Ceph to the snapshot version.
  - Deploy 3x ceph-mon, 3x ceph-osd and 1x ceph-fs units to these machines.
  - Configure a mount point, mount cephfs via ceph-fuse and write some test data.
  
-    
  2. Baseline Check
-    - Verify the cluster is healthy (`ceph -s` shows `HEALTH_OK` or similar).
-    - Verify Ceph packages correspond to snapshot packages.
-    
- 3. Upgrade Ceph packages
-    - Remove the apt pin.
-    - Upgrade Ceph packages to the fixed version (which includes the new decode logic).
-    
- 4. Verification
+    - Verify the cluster is healthy (`ceph -s` shows `HEALTH_OK` or similar).
+    - Verify Ceph packages correspond to snapshot packages.
  
-    - Verify package versions.
-    - Restart Ceph services incl. MONs and verify services start correctly.
-    - Verify cephfs mounts can be mounted, and data can be read and written.
+ 3. Upgrade Ceph packages to upgrade version 19.2.0
+    - Remove the apt pin on unit ceph-mon/0
+    - Upgrade Ceph packages to the upgrade version (e.g. for noble 19.2.0-0ubuntu0.24.04.2)
  
+ 
+ 4. Verification: bug can be observed
+ 
+    - Verify package versions
+    - Observe: MON service on ceph-mon/0 crashes
+    - Verify log messages
+ 
+ 5. Upgrade to SRU version
+ 
+    - Upgrade to SRU version 19.2.1
+    - Verify package versions
+    - Restart Ceph services incl. MONs and verify services start correctly.
+    - Verify cluster health
+    - Verify cephfs mounts can be mounted, and data can be read and written.
  
  [ Where problems could occur ]
  
  The fix changes how `bal_rank_mask` and `max_xattr_size` fields are
  decoded based on on-wire detection. The patch assumes that the max.
  xattr size is less than 64GiB -- larger extended attrs are highly
  unlikely. xattr sizes larger than 64GiB would probably result in mis-
  decoding the protocol and could result in a crash.
  
  The Reef release should have the same field order as the snapshot. Older
  releases should not be affected.
  
  [ Other Info / Original Description ]
  
  Also see bug https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2089565

Title:
  MON and MDS crash upgrading  CEPH  on ubuntu 24.04 LTS

Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Noble:
  Triaged
Status in ceph source package in Oracular:
  Triaged

Bug description:
  [ Impact ]

  - In Ubuntu 24.04 LTS, upgrading Ceph from `19.2.0~git20240301.4c76c50-0ubuntu6` to `19.2.0-0ubuntu0.24.04.1` may cause the `ceph-mon` daemon to crash when it encounters mismatched serialization order (see stack trace below).
  - This can lead to monitor outages, preventing the entire Ceph cluster from reaching a healthy state.
  - Mitigation is only possible by downgrading to the previous version.
  - Monitor stacktrace:

  ```
   ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
   1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
   2: pthread_kill()
   3: gsignal()
   4: abort()
   5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
   6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
   7: (std::unexpected()+0) [0x7884096a5a55]
   8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
   9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
   10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
   11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
   12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
   13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
   14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
   15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
   16: main()
   17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
   18: __libc_start_main()
   19: _start()
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  ```

  - The root cause of this bug is that the on-wire representation changed between the git snapshot 19.2.0~git20240301.4c76c50-0ubuntu6 and the squid release 19.2.0-0ubuntu0.24.04.1 in that the fields for `bal_rank_mask` `max_xattr_size` were swapped.
  - The proposed fix tries to detect the on-wire presentation and adapt the decode order accordingly.

  [ Test Plan ]

  To validate that the proposed fix addresses the crash and introduces
  no regressions, perform the following steps:

  1. Setup

  - Deploy a juju model and add 7 24.04 VMs.
  - On each VM, add apt pinning to pin Ceph to the snapshot version.
  - Deploy 3x ceph-mon, 3x ceph-osd and 1x ceph-fs units to these machines.
  - Configure a mount point, mount cephfs via ceph-fuse and write some test data.

  2. Baseline Check
     - Verify the cluster is healthy (`ceph -s` shows `HEALTH_OK` or similar).
     - Verify Ceph packages correspond to snapshot packages.

  3. Upgrade Ceph packages to upgrade version 19.2.0
     - Remove the apt pin on unit ceph-mon/0
     - Upgrade Ceph packages to the upgrade version (e.g. for noble 19.2.0-0ubuntu0.24.04.2)

  
  4. Verification: bug can be observed

     - Verify package versions
     - Observe: MON service on ceph-mon/0 crashes
     - Verify log messages

  5. Upgrade to SRU version

     - Upgrade to SRU version 19.2.1
     - Verify package versions
     - Restart Ceph services incl. MONs and verify services start correctly.
     - Verify cluster health
     - Verify cephfs mounts can be mounted, and data can be read and written.

  [ Where problems could occur ]

  The fix changes how `bal_rank_mask` and `max_xattr_size` fields are
  decoded based on on-wire detection. The patch assumes that the max.
  xattr size is less than 64GiB -- larger extended attrs are highly
  unlikely. xattr sizes larger than 64GiB would probably result in mis-
  decoding the protocol and could result in a crash.

  The Reef release should have the same field order as the snapshot.
  Older releases should not be affected.

  [ Other Info / Original Description ]

  Also see bug
  https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list