[Bug 1921749] Re: nautilus: ceph radosgw beast frontend coroutine stack corruption

Mauricio Faria de Oliveira 1921749 at bugs.launchpad.net
Mon Mar 29 14:00:05 UTC 2021


coredump #5

	Oct 23 16:41:01 HOSTNAME radosgw[1616]: tcmalloc: large alloc 94195528343552 bytes == (nil) @  0x7f97ec494887 0x7f97ec1cb1b2 0x7f97ec1ff948 0x7f97ec1d08be 0x7f97ec1db01e 0x7f97ec1dd433 0x7f97ec1e843d 0x7f97ec1e8680 0x7f97ec1afe3a 0x7f97ec1857ec 0x55ab96ae9dbc 0x55ab967d8a22 0x55ab96b60e7d 0x55ab967d4bcb 0x55ab96ae76c3 0x55ab96af598a 0x55ab967c5a1a 0x55ab9667e1cf 0x55ab96801118 0x55ab96801c87 0x55ab96766a0b 0x55ab966bd660 0x55ab966be86d 0x55ab96bfad5f
	Oct 23 16:41:01 HOSTNAME radosgw[1616]: terminate called without an active exception
	Oct 23 16:41:01 HOSTNAME radosgw[1616]: *** Caught signal (Aborted) **

	#0  raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
	#1  0x000055ab967466b0 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:81
	#2  handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:326
	#3  <signal handler called>
	#4  __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
	#5  0x00007f97e09c18b1 in __GI_abort () at abort.c:79
	#6  0x00007f97e13b4957 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
	#7  0x00007f97e13baae6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47
	#8  0x00007f97e13bab21 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57
	#9  0x000055ab96801eb0 in rgw::auth::Strategy::apply (dpp=0x55ab9a627000, auth_strategy=..., s=<optimized out>) at ./src/rgw/rgw_auth.cc:273
	#10 0x000055ab96766a0b in process_request (store=0x55ab99917800, rest=0x7fff342285a0, req=0x55ab9b020930, frontend_prefix=..., auth_registry=..., client_io=client_io at entry=0x55ab9b0209c0, olog=0x0, yield=..., scheduler=0x55ab9ac1f928, http_ret=0x0) at ./src/rgw/rgw_process.cc:251
	#11 0x000055ab966bd660 in (anonymous namespace)::handle_connection<boost::asio::basic_stream_socket<boost::asio::ip::tcp> > (context=..., env=..., stream=..., buffer=..., pause_mutex=..., scheduler=<optimized out>, ec=..., yield=..., is_ssl=false) at ./src/rgw/rgw_asio_frontend.cc:167
	#12 0x000055ab966be86d in (anonymous namespace)::AsioFrontend::<lambda(boost::asio::yield_context)>::operator() (yield=..., __closure=0x55ab9a8aa1e8) at ./src/rgw/rgw_asio_frontend.cc:638
	#13 boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(boost::asio::yield_context)> >::operator() (ca=..., this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/asio/impl/spawn.hpp:337
	#14 boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(boost::asio::yield_context)> >&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> >::run (this=0x55ab9b021f60) at ./obj-x86_64-linux-gnu/boost/include/boost/coroutine/detail/push_coroutine_object.hpp:302
	#15 boost::coroutines::detail::trampoline_push_void<boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::<lambda(boost::asio::yield_context)> >&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > >(boost::context::detail::transfer_t) (t=...) at ./obj-x86_64-linux-gnu/boost/include/boost/coroutine/detail/trampoline_push.hpp:70
	#16 0x000055ab96bfad5f in make_fcontext ()
	#17 0x000055ab9703fcd0 in vtable for boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::executor_type> >, (anonymous namespace)::AsioFrontend::accept((anonymous namespace)::AsioFrontend::Listener&, boost::system::error_code)::{lambda(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)#4}>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > ()
	#18 0x0000000000000026 in ?? ()
	#19 0x0000000000000000 in ?? ()


The ceph error message provides some more stack frames than GDB.

        (gdb) info symbol <address>

	0x7f97ec494887 tc_newarray + 455 in section google_malloc of /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
	0x7f97ec1cb1b2 void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag)
	0x7f97ec1ff948 std::vector<OSDOp, std::allocator<OSDOp> >::operator=(std::vector<OSDOp, std::allocator<OSDOp> > const&)
	0x7f97ec1d08be Objecter::_prepare_osd_op
	0x7f97ec1db01e Objecter::_send_op
	0x7f97ec1dd433 Objecter::_op_submit
	0x7f97ec1e843d Objecter::_op_submit_with_budget
	0x7f97ec1e8680 Objecter::op_submit
	0x7f97ec1afe3a librados::IoCtxImpl::operate_read
	0x7f97ec1857ec librados::v14_2_0::IoCtx::operate
	0x55ab96ae9dbc rgw_rados_operate
	0x55ab967d8a22 RGWSI_SysObj_Core::read
	0x55ab96b60e7d RGWSI_SysObj_Cache::read
	0x55ab967d4bcb RGWSI_SysObj::Obj::ROp::read
	0x55ab96ae76c3 rgw_get_system_obj
	0x55ab96af598a rgw_get_user_info_from_index
	0x55ab967c5a1a rgw::auth::swift::SignedTokenEngine::authenticate
	0x55ab9667e1cf rgw::auth::swift::SignedTokenEngine::authenticate
	0x55ab96801118 rgw::auth::Strategy::authenticate // ::apply from stack trace gets here, it calls auth_strategy.authenticate()
	0x55ab96801c87 process_request
	0x55ab96766a0b f10
	0x55ab966bd660 f11
	0x55ab966be86d f12
	0x55ab96bfad5f f16

	Thus apparently a normal memory alloc, but what about that number?
	
	Oct 23 16:41:01 HOSTNAME radosgw[1616]: tcmalloc: large alloc 94195528343552 bytes == (nil) @  

        94195528343552 = 0x55AB9B01A000

	not an overflow/negative, but pointer in the range of data pointers seen in the stack.
	apparently similar to what is seen in other case of corruption / next coredump.
	
		req=0x55ab9b020930
		client_io=client_io at entry=0x55ab9b0209c0
		this=0x55ab9b021f60
		
	(gdb) | info files | grep 55ab9b	// 0x55AB9B0_1, 0x55ab9b0_2
        0x000055ab98fe4000 - 0x000055ab9b1a2000 is load5

	in stack, between frames 0-4.
	
	(gdb) frame 4
	(gdb) info reg $rsp
	rsp            0x55ab9b01ecc0      0x55ab9b01ecc0
	
	(gdb) frame 0
	(gdb) info reg $rsp
	rsp            0x55ab9b017cb0      0x55ab9b017cb0

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1921749

Title:
  nautilus: ceph radosgw beast frontend coroutine stack corruption

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Confirmed
Status in ceph package in Ubuntu:
  Fix Released

Bug description:
  [Impact]

  The radosgw beast frontend in ceph nautilus might hit coroutine stack
  corruption on startup and requests.

  This is usually observed right at the startup of the ceph-radosgw systemd unit; sometimes 1 minute later.
  But it might occur any time handling requests, depending on coroutine/request's function path/stack size.

  The symptoms are usually a crash with stack trace listing TCMalloc (de)allocate/release to central cache,
  but less rare signs are large allocs in the _terabytes_ range (pointer to stack used as allocation size)
  and stack traces showing function return addresses (RIP) that are actually pointers to an stack address.

  This is not widely hit in Ubuntu as most deployments use the ceph-radosgw charm that hardcodes 'civetweb'
  as rgw frontend, which is _not_ affected; custom/cephadm deployments that choose 'beast' might hit this.

    @ charm-ceph-radosgw/templates/ceph.conf
          rgw frontends = civetweb port={{ port }}

  Let's report this LP bug for documentation and tracking purposes until
  UCA gets the fixes.

  [Fix]

  This has been reported by an Ubuntu Advantage user, and another user in ceph tracker #47910 [1].
  This had been reported and fixed in Octopus [2] (confirmed by UA user; no longer affected.)

  The Nautilus backport has recently been merged [3, 4] and should be
  available in v14.2.19.

  [Test Case]

  The conditions to trigger the bug aren't clear, but apparently related to EC pools w/ very large buckets,
  and of course the radosgw frontend beast being enabled (civetweb is not affected.)

  [Where problems could occur]

  The fixes are restricted to the beast frontend, specifically to the coroutines used to handle requests.
  So problems would probably be seen in request handling only with the beast frontend.
  Workarounds thus include switching back to the civetweb frontend.

  This changes core/base parts of the RGW beast frontend code, but are in place from Octopus released.
  The other user/reporter in the ceph tracker has been using the patches for weeks with no regression;
  the ceph tests have passed and likely serious issues would be caught by ceph CI upstream.

  [1] https://tracker.ceph.com/issues/47910 report tracker (nautilus)
  [2] https://tracker.ceph.com/issues/43739 master tracker (octopus)
  [3] https://tracker.ceph.com/issues/43921 backport tracker (nautilus)
  [4] https://github.com/ceph/ceph/pull/39947 github PR

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1921749/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list