[Bug 1969000] Re: [SRU] bail from handle_command() if _generate_command_map() fails

Fri Sep 15 09:40:10 UTC 2023

** Description changed:

  [Impact]
- If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon.
+ If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon. This is already fixed through https://github.com/ceph/ceph/pull/45891 and already in the Ubuntu octopus point release. This fix is a performance improvement that returns from the function immediately and does not continue executing code anymore in the handle_command() function, since we have caught the exception thrown by _generate_command_map() and dealt with it in Monitor::handle_command().

  [Test Plan]
- Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the crash.
+ Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the exception being thrown.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d '{"prefix":"auth
  add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd
  '\''allow rw pool=testpool01'\''"}'

- The request status shows it is still in the queue if you check with curl
- -k -X GET "$endpoint/request"
- 
- [
-     {
-         "failed": [],
-         "finished": [],
-         "has_failed": false,
-         "id": "140576245092648",
-         "is_finished": false,
-         "is_waiting": false,
-         "running": [
-             {
-                 "command": "auth add entity=client.testuser02 caps=mon 'allow r' osd 'allow rw pool=testpool01'",
-                 "outb": "",
-                 "outs": ""
-             }
-         ],
-         "state": "pending",
-         "waiting": []
-     }
- ]
- 
  This reproduces without restful API too.

- Use this python script to reproduce the issue. Run it on the mon node,
+ This python script run on the mon node also will cause the exception to
+ be thrown due to the particular json which is malformed,

- root at juju-8c5f4a-sts-stein-bionic-0:/root# cat testcrashnorest.py
+ root at focal-testing:/home/nikhil/Downloads/ceph_upstream/ceph/build# cat test_ceph.sh 
  #!/usr/bin/env python3
  import json
  import rados
- c = rados.Rados(conffile='/etc/ceph/ceph.conf')
+ c = rados.Rados(conffile='ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
+ #cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":["mon", "allow r", "osd", "allow rw pool=testpool01"]})
  print(c.mon_command(cmd, b''))
+ root at focal-testing:/home/nikhil/Downloads/ceph_upstream/ceph/build# ./test_ceph.sh 
+ (-22, b'', "bad or missing field 'caps'")

- root at juju-8c5f4a-sts-stein-bionic-0:/root# ceph -s
- cluster:
- id: 6123c916-a12a-11ec-bc02-fa163e9f86e0
- health: HEALTH_WARN
- mon is allowing insecure global_id reclaim
- 1 monitors have not enabled msgr2
- Reduced data availability: 69 pgs inactive
- 1921 daemons have recently crashed
+ Once this exception is caught correctly as above and the error message
+ printed due to this code, and we bail out of the function due to this
+ SRU, the following code
+ https://github.com/ceph/ceph/blob/6a585618451421f0744745e4dd3636f10f678397/src/mon/Monitor.cc#L3349C1-L3358C6
+ should never be then further executed because we return as soon as the
+ exception is caught and handled.

- services:
- mon: 1 daemons, quorum juju-8c5f4a-sts-stein-bionic-0 (age 92s)
- mgr: juju-8c5f4a-sts-stein-bionic-0(active, since 22m)
- osd: 3 osds: 3 up (since 22h), 3 in
+ So therefore, setting debug level to 10 will validate that the message
+ is never seen from _allowed_command(), i.e at the end of that function,

- data:
- pools: 4 pools, 69 pgs
- objects: 0 objects, 0 B
- usage: 0 B used, 0 B / 0 B avail
- pgs: 100.000% pgs unknown
- 69 unknown
+  dout(10) << __func__ << " " << (capable ? "" : "not ") << "capable" <<
+ dendl;

- root at juju-8c5f4a-sts-stein-bionic-0:/root# ./testcrashnorest.py
+ Pasting the function code for reference,
+ (https://github.com/ceph/ceph/blob/octopus/src/mon/Monitor.cc#L3061)

- ^C
+ bool Monitor::_allowed_command(MonSession *s, const string &module,
+ 			       const string &prefix, const cmdmap_t& cmdmap,
+                                const map<string,string>& param_str_map,
+                                const MonCommand *this_cmd) {

- (note the script hangs)
+   bool cmd_r = this_cmd->requires_perm('r');
+   bool cmd_w = this_cmd->requires_perm('w');
+   bool cmd_x = this_cmd->requires_perm('x');

- mon logs show - https://pastebin.com/Cuu9jkmu , the crash is seen, and
- then it seems like systemd restarts ceph, so ceph -s hangs for a while
- then we see the restart messages like.
+   bool capable = s->caps.is_capable(
+     g_ceph_context,
+     s->entity_name,
+     module, prefix, param_str_map,
+     cmd_r, cmd_w, cmd_x,
+     s->get_peer_socket_addr());

- --- end dump of recent events ---
- 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 set uid:gid to 64045:64045 (ceph:ceph)
- 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable), process ceph-mon, pid 490328
- 2022-03-16T05:35:30.111+0000 7ffaf0e3b540 0 pidfile_write: ignore empty --pid-file
- 2022-03-16T05:35:30.139+0000 7ffaf0e3b540 0 load: jerasure load: lrc load: isa
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option compression = kNoCompression
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option level_compaction_dynamic_level_bytes = true
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 0 set rocksdb option write_buffer_size = 33554432
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 1 rocksdb: do_open column families: [default]
- 2022-03-16T05:35:30.143+0000 7ffaf0e3b540 4 rocksdb: RocksDB version: 6.1.2
+   dout(10) << __func__ << " " << (capable ? "" : "not ") << "capable" << dendl;  
+   return capable;
+ }

- While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
- we need a cleanup fix that has now been also merged upstream - https://github.com/ceph/ceph/pull/45547
- 
- The cleanup fix bails out of the function if the exception is
- thrown, therefore avoiding continuing in the function 
- void Monitor::handle_command(MonOpRequestRef op) in this
- error situation.
+ So it would be a reasonable to test the SRU and verify that at loglevel
+ 10, we do not see the
+ https://github.com/ceph/ceph/blob/octopus/src/mon/Monitor.cc#L3077 debug
+ message.

  [Where problems could occur]
  The only potential problem with this cleanup fix is if
  some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.

  [Other Info]
+ While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
+ we need this cleanup fix that has now been also merged to master upstream through https://github.com/ceph/ceph/pull/48044
+ 
+ The cleanup fix bails out of the function if the exception is
+ thrown, therefore avoiding continuing in the function
+ void Monitor::handle_command(MonOpRequestRef op) in this
+ error situation.
+ 
  Upstream tracker - https://tracker.ceph.com/issues/57859
  Fixed in ceph main through https://github.com/ceph/ceph/pull/48044

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1969000

Title:
  [SRU] bail from handle_command() if _generate_command_map() fails

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive ussuri series:
  In Progress
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Focal:
  In Progress
Status in ceph source package in Impish:
  Won't Fix
Status in ceph source package in Jammy:
  Fix Released
Status in ceph source package in Kinetic:
  Fix Released
Status in ceph source package in Lunar:
  Fix Released
Status in ceph source package in Mantic:
  Fix Released

Bug description:
  [Impact]
  If improper json data is passed to rados using a manual curl command, or invalid json data through a script like the python eg. shown, it can end up crashing the mon. This is already fixed through https://github.com/ceph/ceph/pull/45891 and already in the Ubuntu octopus point release. This fix is a performance improvement that returns from the function immediately and does not continue executing code anymore in the handle_command() function, since we have caught the exception thrown by _generate_command_map() and dealt with it in Monitor::handle_command().

  [Test Plan]
  Setup a ceph octopus cluster. A manual run of curl with malformed request like this results in the exception being thrown.

  curl -k -H "Authorization: Basic $TOKEN"
  "https://juju-3b3d82-10-lxd-0:8003/request" -X POST -d
  '{"prefix":"auth add","entity":"client.testuser02","caps":"mon
  '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"}'

  This reproduces without restful API too.

  This python script run on the mon node also will cause the exception
  to be thrown due to the particular json which is malformed,

  root at focal-testing:/home/nikhil/Downloads/ceph_upstream/ceph/build# cat test_ceph.sh 
  #!/usr/bin/env python3
  import json
  import rados
  c = rados.Rados(conffile='ceph.conf')
  c.connect()
  cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":"mon '\''allow r'\'' osd '\''allow rw pool=testpool01'\''"})
  #cmd = json.dumps({"prefix":"auth add","entity":"client.testuser02","caps":["mon", "allow r", "osd", "allow rw pool=testpool01"]})
  print(c.mon_command(cmd, b''))
  root at focal-testing:/home/nikhil/Downloads/ceph_upstream/ceph/build# ./test_ceph.sh 
  (-22, b'', "bad or missing field 'caps'")

  Once this exception is caught correctly as above and the error message
  printed due to this code, and we bail out of the function due to this
  SRU, the following code
  https://github.com/ceph/ceph/blob/6a585618451421f0744745e4dd3636f10f678397/src/mon/Monitor.cc#L3349C1-L3358C6
  should never be then further executed because we return as soon as the
  exception is caught and handled.

  So therefore, setting debug level to 10 will validate that the message
  is never seen from _allowed_command(), i.e at the end of that
  function,

   dout(10) << __func__ << " " << (capable ? "" : "not ") << "capable"
  << dendl;

  Pasting the function code for reference,
  (https://github.com/ceph/ceph/blob/octopus/src/mon/Monitor.cc#L3061)

  bool Monitor::_allowed_command(MonSession *s, const string &module,
  			       const string &prefix, const cmdmap_t& cmdmap,
                                 const map<string,string>& param_str_map,
                                 const MonCommand *this_cmd) {

    bool cmd_r = this_cmd->requires_perm('r');
    bool cmd_w = this_cmd->requires_perm('w');
    bool cmd_x = this_cmd->requires_perm('x');

    bool capable = s->caps.is_capable(
      g_ceph_context,
      s->entity_name,
      module, prefix, param_str_map,
      cmd_r, cmd_w, cmd_x,
      s->get_peer_socket_addr());

    dout(10) << __func__ << " " << (capable ? "" : "not ") << "capable" << dendl;  
    return capable;
  }

  So it would be a reasonable to test the SRU and verify that at
  loglevel 10, we do not see the
  https://github.com/ceph/ceph/blob/octopus/src/mon/Monitor.cc#L3077
  debug message.

  [Where problems could occur]
  The only potential problem with this cleanup fix is if
  some additional code in the void Monitor::handle_command(MonOpRequestRef op) function is needed to run before exit()'ing out. I have looked for such potential conditions and not found any.

  [Other Info]
  While the fix to catch the exception is already part of the Octopus 15.2.17 point release, (PR https://github.com/ceph/ceph/pull/45891),
  we need this cleanup fix that has now been also merged to master upstream through https://github.com/ceph/ceph/pull/48044

  The cleanup fix bails out of the function if the exception is
  thrown, therefore avoiding continuing in the function
  void Monitor::handle_command(MonOpRequestRef op) in this
  error situation.

  Upstream tracker - https://tracker.ceph.com/issues/57859
  Fixed in ceph main through https://github.com/ceph/ceph/pull/48044

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1969000/+subscriptions