[Bug 2089565] Re: MON and MDS crash upgrading CEPH on ubuntu 24.04 LTS

Maksym Medvied 2089565 at bugs.launchpad.net
Thu Feb 20 17:24:47 UTC 2025


(this description of the fix is added to the patch as well)

bal_rank_mask is stored as a text string with a decimal representation
of a number inside. The string is stored as length of the string (4
bytes, little endian) and then the string itself (without trailing 0,
just the string itself).

max_xattr_size is stored as uint64_t little endian integer.

The following patch was used to get the hexdumps below:

diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc
index 76a57ac443..d36bed2257 100644
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -143,6 +143,7 @@ void MDSMonitor::update_from_paxos(bool *need_bootstrap)

   ceph_assert(fsmap_bl.length() > 0);
   dout(10) << __func__ << " got " << version << dendl;
+  fsmap_bl.hexdump(std::cout);
   try {
     PaxosFSMap::decode(fsmap_bl);
   } catch (const ceph::buffer::malformed_input& e) {


This is how the relevant part of the bufferlist looks like for the squid snapshot 19.2.0~git20240301.4c76c50-0ubuntu6:

ceph-mon[...]: 00000620  00 00 00 00 00 00 00 00  00 00 00 00 01 07 00 00  |................|
ceph-mon[...]: 00000630  00 63 65 70 68 2d 66 73  00 00 00 00 00 00 00 00  |.ceph-fs........|
ceph-mon[...]: 00000640  ff ff ff ff 00 00 00 00  00 00 00 00 00 02 00 00  |................|
                                                                 ******** <<< bal_rank_mask
                                                                              string length
                                                                   (4 bytes, the value is 2)
the top byte of the                                                     
string length  >>>>>>>>> ** ##### <<< bal_rank_mask itself (it's string "-1" here)
ceph-mon[...]: 00000650  00 2d 31 00 00 01 00 00  00 00 00 00 00 00 00 00  |.-1.............|
                                 ^^^^^^^^^^^^^^^^^^^^^^^^ max_xattr_size
                                                         (the default value 65536)
ceph-mon[...]: 00000660  00 00 00 00 00 00 00 01  01 05 00 00 00 00 00 00  |................|
ceph-mon[...]: 00000670  00 00 01 00 00 00 b1 17  00 00 00 00 00 00 01 00  |................|
ceph-mon[...]: 00000680  00 00 00 00 00 00 00 00  00 00 01 a1 6e ad 67 53  |............n.gS|


And this is how the relevant part of the bufferlist looks like for
the squid release 19.2.0-0ubuntu0.24.04.1:

ceph-mon[...]: 000003a0  00 00 00 00 00 00 00 00  01 07 00 00 00 63 65 70  |.............cep|
ceph-mon[...]: 000003b0  68 2d 66 73 00 00 00 00  00 00 00 00 ff ff ff ff  |h-fs............|
ceph-mon[...]: 000003c0  00 00 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
                                                   ^^^^^^^^^^^^^^^^^^^^
                        vv <<<<<<<<<<<<<<<<<<<<<<<< max_xattr_size (8 bytes, the value is 65536)
ceph-mon[...]: 000003d0  00 02 00 00 00 2d 31 01  01 05 00 00 00 00 00 00  |.....-1.........|
                           ^^^^^^^^^^^ ##### <<< bal_rank_mask (string, the value is "-1")
       bal_rank_mask string length (2)

The fix for the bug looks at the byte 4 bytes ahead (if the current
position is 0x3C9, then the code would look at the byte at 0x3CD). In
the squid release the byte most likely would be 0 (it could be non-zero
for 4GiB+ extended attributes, which is highly unlikely). In the
snapshot it would be the first char of a decimal representation of a
number, which is either "-" (0x2D) or a number (a value from 0x30 to
0x39). This patch assumes that max_xattr_size is less than 64GiB and
checks the byte against 0x10, and then uses the correct decoding order
for bal_rank_mask and max_xattr_size for each situation.

** Patch added: "src/mds/MDSMap: decode max_xattr_size and bal_rank_mask in the right order"
   https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+attachment/5859087/+files/mds-MDSMap-decode-max_xattr_size-and-bal_rank_mask.patch

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2089565

Title:
  MON and MDS crash upgrading  CEPH  on ubuntu 24.04 LTS

Status in ceph package in Ubuntu:
  In Progress

Bug description:
  This issue is a continuation of
  https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2065515

  
  On Ubuntu 24.04 lts we did upgrade Ceph to  19.2.0-0ubuntu0.24.04.1

  Previous release is : 19.2.0~git20240301.4c76c50-0ubuntu6

  whenever  upgrading (tested on 2 different clusters)  the ceph-mon
  ends up crashing repeatedly with the below stack error

  ```
   ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
   1: /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x788409245320]
   2: pthread_kill()
   3: gsignal()
   4: abort()
   5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5) [0x7884096a5ff5]
   6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da) [0x7884096bb0da]
   7: (std::unexpected()+0) [0x7884096a5a55]
   8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391) [0x7884096bb391]
   9: (ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+0x193) [0x78840a293593]
   10: (MDSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xca1) [0x78840a4c3ab1]
   11: (Filesystem::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x1c3) [0x78840a4e4303]
   12: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x280) [0x78840a4e6ef0]
   13: (MDSMonitor::update_from_paxos(bool*)+0x291) [0x631ac5dea801]
   14: (Monitor::refresh_from_paxos(bool*)+0x124) [0x631ac5b7a164]
   15: (Monitor::preinit()+0x98e) [0x631ac5bb2fbe]
   16: main()
   17: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x78840922a1ca]
   18: __libc_start_main()
   19: _start()
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  ```

  
  mitigation:
  a rollback to the previous release 19.2.0~git20240301.4c76c50-0ubuntu6 is still possible to restore service

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2089565/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list