[Bug 2089565] Re: MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS
Guillaume COEUGNET
2089565 at bugs.launchpad.net
Wed Jul 9 09:56:27 UTC 2025
Hello,
I've tested again ceph upgrade from 19.2.0~git20240301.4c76c50-0ubuntu6
to 19.2.0-0ubuntu0.24.04.2 this time. Unfortunatly, it is not working.
Ceph-Mon crashed after upgrade to 19.2.0-0ubuntu0.24.04.2 :
[root at ugfsicpd01 ~]# ceph --version
ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
[root at ugfsicpd01 ~]# journalctl -xeu ceph-mon at ugfsicpd01.service
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x76f49542a1ca]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 18: __libc_start_main()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 19: _start()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 0> 2025-07-09T11:50:17.641+0200 76f495ad0a80 -1 *** Caught signal (Aborted) **
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: in thread 76f495ad0a80 thread_name:ceph-mon
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x76f495445330]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 2: pthread_kill()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 3: gsignal()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 4: abort()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x76f4958a5ff5]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x76f4958bb0da]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 7: (std::unexpected()+0) [0x76f4958a5a55]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x76f4958bb391]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_t>
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x76f4966c3ab1]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x76f4966e4303]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x76f4966e6ef0]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x5eb27acc7801]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x5eb27aa57164]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 15: (Monitor::preinit()+0x98e) [0x5eb27aa8ffbe]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 16: main()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x76f49542a1ca]
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 18: __libc_start_main()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: 19: _start()
Jul 09 11:50:17 ugfsicpd01 ceph-mon[385507]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 09 11:50:17 ugfsicpd01 systemd[1]: ceph-mon at ugfsicpd01.service: Main process exited, code=killed, status=6/ABRT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit ceph-mon at ugfsicpd01.service has exited.
░░
░░ The process' exit code is 'killed' and its exit status is 6.
It was working with 19.2.0-0ubuntu0.24.04.1, not with this
19.2.0-0ubuntu0.24.04.2
** Tags removed: verification-needed-noble
** Tags added: verification-failed-noble
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2089565
Title:
MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Noble:
Fix Committed
Status in ceph source package in Oracular:
Fix Released
Bug description:
[ Impact ]
- In Ubuntu 24.04 LTS, upgrading Ceph from `19.2.0~git20240301.4c76c50-0ubuntu6` to `19.2.0-0ubuntu0.24.04.1` may cause the `ceph-mon` daemon to crash when it encounters mismatched serialization order (see stack trace below).
- This can lead to monitor outages, preventing the entire Ceph cluster from reaching a healthy state.
- Mitigation is only possible by downgrading to the previous version.
- Monitor stacktrace:
```
ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
2: pthread_kill()
3: gsignal()
4: abort()
5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
7: (std::unexpected()+0) [0x7884096a5a55]
8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
16: main()
17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
18: __libc_start_main()
19: _start()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
```
- The root cause of this bug is that the on-wire representation changed between the git snapshot 19.2.0~git20240301.4c76c50-0ubuntu6 and the squid release 19.2.0-0ubuntu0.24.04.1 in that the fields for `bal_rank_mask` `max_xattr_size` were swapped.
- The proposed fix tries to detect the on-wire presentation and adapt the decode order accordingly.
[ Test Plan ]
To validate that the proposed fix addresses the crash and introduces
no regressions, perform the following steps:
1. Setup
- Deploy a juju model and add 7 24.04 VMs.
- On each VM, add apt pinning to pin Ceph to the snapshot version.
- Deploy 3x ceph-mon, 3x ceph-osd and 1x ceph-fs units to these machines.
- Configure a mount point, mount cephfs via ceph-fuse and write some test data.
2. Baseline Check
- Verify the cluster is healthy (`ceph -s` shows `HEALTH_OK` or similar).
- Verify Ceph packages correspond to snapshot packages.
3. Upgrade Ceph packages to upgrade version 19.2.0
- Remove the apt pin on unit ceph-mon/0
- Upgrade Ceph packages to the upgrade version (e.g. for noble 19.2.0-0ubuntu0.24.04.2)
4. Verification: bug can be observed
- Verify package versions
- Observe: MON service on ceph-mon/0 crashes
- Verify log messages
5. Upgrade to SRU version
- Upgrade to SRU version 19.2.1
- Verify package versions
- Restart Ceph services incl. MONs and verify services start correctly.
- Verify cluster health
- Verify cephfs mounts can be mounted, and data can be read and written.
[ Where problems could occur ]
The fix changes how `bal_rank_mask` and `max_xattr_size` fields are
decoded based on on-wire detection. The patch assumes that the max.
xattr size is less than 64GiB -- larger extended attrs are highly
unlikely. xattr sizes larger than 64GiB would probably result in mis-
decoding the protocol and could result in a crash.
The Reef release should have the same field order as the snapshot.
Older releases should not be affected.
[ Other Info / Original Description ]
Also see bug
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list