[Bug 2154304] Re: ceph radosgw segv with keystone auth
Peter Sabaini
2154304 at bugs.launchpad.net
Sat May 30 07:08:54 UTC 2026
Some attempt at bot-backed analysis courtesy of pi/gpt-5.5
----
## Executive summary
`radosgw` on Ubuntu Resolute / Ceph Tentacle crashes during Keystone-
backed S3 authentication because Boost.Asio's
`call_stack<thread_context, thread_info_base>` state is partially
initialized across mixed shared-library Asio configurations.
The immediate crash is:
```text
RGW http_manager thread
-> ceph::async::detail::CompletionImpl<...>::destroy_post()
-> Boost.Asio small-block recycling path
-> pthread_getspecific(0)
-> returns GnuTLS RNG/crypto state beginning "expand 32-byte k"
-> Asio treats bytes at +8 ("2-byte k") as thread_info_base*
-> SIGSEGV
```
The root cause is not GnuTLS. GnuTLS legitimately owns pthread key `0`.
The bug is that Ceph/RGW's Boost.Asio `thread_context` `top_` object is
marked initialized while its pthread key field remains zero.
Tracer v3 identifies the missing writer: `libboost_process.so.1.90.0`
exports an unversioned Boost.Asio guard symbol for
`call_stack<thread_context, thread_info_base>::top_`, but it does not
export or initialize the matching pthread-TSS `top_` key object. Its GOT
relocation for the guard resolves to `radosgw`'s GNU-unique guard. It
therefore sets `radosgw`'s guard to `1` before Ceph's pthread-TSS
constructor runs, causing Ceph to skip `pthread_key_create()` and leave
`radosgw`'s `top_.tss_key_ == 0`.
This is a mixed Boost.Asio TLS-model / DSO symbol-preemption bug:
```text
Ceph/radosgw: BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION => pthread TSS
libboost_process: distro/default Boost.Asio compiler TLS behavior
```
## Affected / observed versions
Confirmed affected:
```text
Ubuntu resolute / 26.04
Ceph Tentacle radosgw 20.2.0-0ubuntu2
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605042247
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605272015
```
Vulnerable PPA `/usr/bin/radosgw` identity used for v3 proof:
```text
SHA256: 122a3f8640fed3d75d88d80d0b33676e9f1ae338f1d92400b04958ff1a7fd3b7
```
Validated compiler-TLS candidate `/usr/bin/radosgw`:
```text
SHA256: 63c88ad26eae42d4ee9793b8e887ee9ede5c107469aa011fa3679aa429f85aa1
Build ID: 46acf7cd0a5615bd1d1eae805a75377c61cf8538
```
## Background: why Ceph disabled Asio compiler TLS
Ceph upstream commit:
```text
29ee772263c7ab3c3bf33038bd989336ae3064ad
librados: workaround for boost::asio use of static member variables
```
added global:
```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```
and exported selected Asio `call_stack<...>::top_`/guard symbols from
`src/librados/librados.map`.
The commit addressed a real historical problem: Boost.Asio is header-
only and uses static member variables for thread-local call-stack state.
If Asio appears in multiple DSOs, such as `librados` and `librbd`, each
DSO can get separate static state unless the dynamic linker coalesces
the symbols correctly. The Ceph workaround forced Asio to use pthread
TSS and manually exported selected static variables so the loader could
consolidate them.
The current RGW crash shows that workaround is no longer safe as a
blanket process-wide assumption when the same process also loads distro
Boost libraries built with the default Asio compiler-TLS configuration.
## Root cause details
### Asio TLS model switch
Boost.Asio chooses its thread-specific pointer implementation
approximately like this:
```cpp
#if defined(BOOST_ASIO_HAS_THREAD_KEYWORD_EXTENSION)
keyword_tss_ptr<T> // compiler TLS, e.g. __thread/thread_local storage
#elif defined(BOOST_ASIO_HAS_PTHREADS)
posix_tss_ptr<T> // pthread_key_create/get/set
#endif
```
`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` disables compiler TLS and
forces the pthread path.
In Ceph's forced pthread mode, `call_stack<thread_context,
thread_info_base>::top_` is a static object containing a
`pthread_key_t`. In compiler-TLS mode, the state is represented by TLS
symbols such as:
```text
boost::asio::detail::keyword_tss_ptr<...>::value_
```
### Proven failure sequence
1. `libgnutls.so.30` creates pthread key `0` during process startup.
2. `libboost_process.so.1.90.0` static initialization runs.
3. `libboost_process` has an unversioned Asio guard relocation for `call_stack<thread_context, thread_info_base>::top_`.
4. The relocation resolves to `radosgw`'s GNU-unique guard address.
5. `libboost_process` writes the guard byte to `1` but has no corresponding pthread-TSS `top_` key object and does not call `pthread_key_create()` for it.
6. Ceph/radosgw constructors later observe the guard as already initialized and skip `posix_tss_ptr_create()` for `radosgw`'s `thread_context` `top_`.
7. `radosgw`'s `thread_context` `top_` key remains zero.
8. Runtime Asio code calls `pthread_getspecific(0)` and reads GnuTLS state as if it were Asio call-stack state.
## Key evidence
Artifact root:
```text
/home/ubuntu/rgw-s3-crash-bug/artifacts/20260529T_asio_guard_tss_v3
```
Important files:
```text
RESULT.md
unit-logs/asio_guard_tls_tracer_v3.after-crash.log.gz
unit-logs/asio_guard_tls_tracer_v3.after-crash.summary.txt
unit-logs/live-top-inspection-v3.txt
unit-logs/v3-loaded-asio-symbols.txt
unit-logs/libboost_process-asio-thread-analysis.txt
unit-logs/live-libboost-process-got-inspection-v3.txt
static-analysis/static-init-and-relocation-report.md
reproducers/ceph-like-reproducer-summary.md
```
Tracer v3 key creation evidence:
```text
GnuTLS: key_ptr=/usr/lib/.../libgnutls.so.30 key=0
libceph-common: key_ptr=/usr/bin/radosgw:...strand...top_ key=1
radosgw: key_ptr=/usr/bin/radosgw:...await...top_ key=3
```
No `pthread_key_create()` was seen for `radosgw`'s
`thread_context/thread_info_base` `top_`.
Crash-path evidence:
```text
op=key_get key=0 ... caller=/usr/bin/radosgw:CompletionImpl<...>::destroy_post(...)+0x578
bytes16=657870616e642033322d62797465206b ascii=expand 32-byte k
```
Live memory after vulnerable restart:
```text
radosgw thread guard: 0x1
radosgw thread top_: 0x0
radosgw strand top_: key 1
radosgw await top_: key 3
```
`libboost_process` evidence:
```text
/usr/lib/x86_64-linux-gnu/libboost_process.so.1.90.0
exports guard variable for boost::asio::detail::call_stack<...thread_context...>::top_
has R_X86_64_GLOB_DAT relocation for that guard
does not export/create the matching pthread-TSS top_ key object
```
Disassembly shows guard-only initialization:
```asm
5fd2: mov GOT(guard for call_stack<thread_context,...>::top_), %r14
5fd9: cmpb $0x0,(%r14)
5fdf: movb $0x1,(%r14) # if zero
```
Live GOT inspection proves the preemption target:
```text
libboost_process GOT_guard_thread_top -> 0x5f92865c5240
radosgw_thread_guard = 0x5f92865c5240 data=01...
libboost_process own_guard symbol data=00...
radosgw_thread_top data=00...
```
## Validated mitigations / candidates
### 1. Preferred root-cause fix: use compiler TLS consistently in Ceph
Change:
```text
stop defining BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION globally
```
Effect:
- Ceph uses the same Asio compiler-TLS path as distro Boost libraries.
- `radosgw`, `libceph-common`, and `librados` expose/use `keyword_tss_ptr<...>::value_` TLS symbols rather than pthread-key objects for this state.
- The vulnerable `pthread_getspecific(0)` path for Asio `thread_context` is eliminated.
Validation so far:
```text
25/25 Keystone-backed S3 attempts passed
5/5 wrapper attempts passed
post-v3 restore check: 3/3 attempts passed
service active, NRestarts=0
```
Caveat:
This can potentially reintroduce the old issue that commit `29ee772...`
worked around: split Boost.Asio static/TLS state across `librados`,
`librbd`, and other DSOs. A package-quality fix must audit and test
this, not just RGW.
Recommended package-quality follow-up:
- build full Debian packages, not only focused binaries;
- verify compile commands contain neither `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` nor `BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING` unless deliberately scoped;
- audit symbols and relocations for `keyword_tss_ptr<...>::value_` and related guards across `radosgw`, `libceph-common`, `librados`, `librbd`, and any other Ceph Asio users;
- test librados/librbd Asio use cases specifically, including in-process multi-DSO scenarios;
- consider updating symbol maps if compiler-TLS Asio symbols must be exported/coalesced for the original librados/librbd use case.
### 2. Targeted containment: rebuild/vendor Boost.Process with Ceph's
Asio pthread-TSS macro
Idea:
Build a private `libboost_process.so.1.90.0` with:
```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```
and ensure `radosgw` loads it instead of the distro `libboost_process`.
Why it may help:
- It aligns Boost.Process with Ceph's pthread-TSS Asio mode.
- The guard-only writer should become a real pthread-TSS `top_` initializer that calls `pthread_key_create()`.
- It may preserve the intent of Ceph's old `librados` workaround while avoiding this specific mixed-mode crash.
Required validation:
- confirm `radosgw` actually loads the vendored Boost.Process;
- confirm no second system Boost.Process copy is loaded;
- inspect symbols: both guard and matching pthread-TSS `top_` state should be present;
- rerun tracer v3: `radosgw` thread guard must not be `1` while key remains `0`;
- rerun Keystone-backed S3 reproducer and broader RGW tests.
Pros:
- likely targeted fix for this exact `libboost_process` guard-preemption problem;
- avoids immediately changing Ceph's global Asio TLS model.
Cons:
- vendoring distro Boost shared libraries is high-maintenance;
- security updates and ABI compatibility become harder;
- only addresses `libboost_process`; any other Asio-using distro DSO with similar symbol behavior could still trigger a related problem;
- less attractive for Ubuntu/Debian packaging and upstream Ceph.
Status: plausible but not yet built/tested in this investigation.
### 3. Symbol isolation / visibility hardening for Ceph Asio detail
symbols
Idea:
Prevent external distro Boost DSOs from partially preempting Ceph/RGW's
Asio detail symbols, especially guard-only preemption.
Possible forms:
- hide Ceph's Boost.Asio detail symbols from the dynamic symbol table where safe;
- use version scripts to keep guard/top pairs local or consistently versioned;
- namespace or otherwise isolate Ceph's bundled/header-only Asio detail instantiations;
- ensure guard and top objects cannot be split across different TLS implementations.
Pros:
- attacks the symbol-preemption class directly;
- may preserve Ceph's pthread-TSS workaround without vendoring Boost libraries.
Cons:
- high risk: the original `librados` workaround intentionally exported some Asio symbols for cross-DSO coalescing;
- hiding them naively can reintroduce librados/librbd split-state bugs;
- requires careful ELF/version-script design and broader Ceph regression testing.
Status: conceptually valid, but riskier than aligning on compiler TLS.
### 4. RGW/radosgw-scoped `BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`
Idea:
Build RGW/radosgw with:
```text
BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING
```
Effect:
- avoids the specific Asio small-block recycling deallocation path that dereferences the corrupted `thread_info_base` pointer;
- does not fix the underlying `guard=1, key=0` state.
Validation so far:
```text
fresh model baseline crashed
RGW/radosgw-scoped no-recycling candidate passed 25/25 Keystone-backed S3 attempts
```
Pros:
- proven containment for the observed crash path;
- smaller behavioral change than a global Asio TLS-model change;
- potentially SRU-friendly as an emergency workaround.
Cons:
- not root cause;
- leaves invalid Asio state in the process;
- another Asio code path could still read the bad `thread_context` state later;
- may affect allocation performance.
Status: validated fallback containment, not preferred final fix.
### Root-cause package fix
Preferred path:
1. Produce a full Debian Ceph build that removes global `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Do not include the RGW no-recycling workaround in the root-fix build unless explicitly choosing fallback containment.
3. Audit Asio symbols/relocations across all Ceph DSOs, especially `librados` and `librbd`, to ensure the old multi-DSO issue is not reintroduced.
4. Add/execute librados/librbd Asio regression tests in addition to RGW Keystone-backed S3 tests.
5. Rerun RGW Keystone S3 reproducer at scale and broader RGW/Ceph smoke tests.
Alternative if compiler TLS is rejected or regresses:
1. Prototype vendored/rebuilt Boost.Process with `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Validate with tracer v3 that it initializes the matching pthread-TSS key and no longer leaves `radosgw` guard/key split.
3. Audit the process for any other loaded Asio-using distro DSOs with similar guard-only symbol behavior.
## Validation checklist for any final fix
A final package-quality fix should pass:
- Keystone-backed S3 CreateBucket/ListBuckets/DeleteBucket loop, at least 25/25;
- repeated wrapper run with clean exit;
- service remains `ActiveState=active`, `NRestarts=0`;
- no `SIGSEGV`, `core-dump`, `destroy_post()+0x586`, or `pthread_getspecific(0)` Asio crash in journal;
- compile command audit for intended Asio macros;
- symbol audit showing no mixed guard/top state across `radosgw`, `libceph-common`, `librados`, `librbd`, `libboost_process`;
- librados/librbd Asio regression coverage for the original `29ee772...` scenario.
## Bottom line
The crash is caused by a mixed Boost.Asio TLS model in one process.
Ceph's historical pthread-TSS workaround collides with distro
Boost.Process's compiler-TLS Asio symbols, allowing Boost.Process to set
`radosgw`'s Asio guard without creating `radosgw`'s pthread key.
The most robust long-term fix is to make Ceph and distro Boost libraries
use the same Asio TLS model, preferably compiler TLS, while explicitly
validating the old librados/librbd multi-DSO concern. The safest
already-validated containment is RGW/radosgw-scoped
`BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`, but that should remain a
fallback because it does not repair the corrupted Asio state.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2154304
Title:
ceph radosgw segv with keystone auth
Status in ceph package in Ubuntu:
New
Bug description:
On Ubuntu resolute radosgw from 20.2.0-0ubuntu2 reproducibly segfaults
when an S3 request is authenticated with Keystone EC2 credentials. The
client sees "502 Bad Gateway" and systemd restarts radosgw.
This is the Keystone-backed S3 auth path. Local RGW users can work
when local auth is tried first, but a request using Keystone EC2
credentials still crashes RGW.
This can be reproduced using the CI bundle with adaptation for
resolute and zaza tests
Stack trace
```
Thread 602 "http_manager" received signal SIGSEGV, Segmentation fault.
0x00005e52632bf3b6 in ceph::async::detail::CompletionImpl<boost::asio::any_io_executor, boost::asio::detail::spawn_handler<boost::asio::any_io_executor, void (boost::system::error_code), void>, void, boost::system::error_code>::destroy_post(std::tuple<boost::system::error_code>&&) ()
Thread 602 (Thread 0x753657aea6c0 (LWP 63226) "http_manager"):
#0 0x00005e52632bf3b6 in ceph::async::detail::CompletionImpl<boost::asio::any_io_executor, boost::asio::detail::spawn_handler<boost::asio::any_io_executor, void (boost::system::error_code), void>, void, boost::system::error_code>::destroy_post(std::tuple<boost::system::error_code>&&) ()
#1 0x00005e5263345d18 in RGWHTTPManager::finish_request(rgw_http_req_data*, int, long) ()
#2 0x00005e5263346c5a in RGWHTTPManager::reqs_thread_entry() ()
#3 0x00005e52633475e1 in RGWHTTPManager::ReqsThread::entry() ()
#4 0x000075365eca40da in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:454
#5 0x000075365ed377ac in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```
Juju status
```
# juju status
Model Controller Cloud/Region Version SLA Timestamp
zaza-9bef958d671a lxd localhost/localhost 3.6.23 unsupported 11:39:43Z
App Version Status Scale Charm Channel Rev Exposed Message
ceph-mon 20.2.0 active 3 ceph-mon 0 no Unit is ready and clustered
ceph-osd 20.2.0 active 3 ceph-osd 15 no Unit is ready (1 OSD)
ceph-radosgw 20.2.0 active 1 ceph-radosgw 26 no Unit is ready
keystone 25.0.0 active 1 keystone latest/edge 774 no Application Ready
mysql 8.0.44-0ubun... active 1 mysql 8.0/stable 444 no
vault 1.8.8 active 1 vault latest/edge 363 no Unit is ready (active: true, mlock: disabled)
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* active idle 4 10.33.157.168 Unit is ready and clustered
ceph-mon/1 active idle 5 10.33.157.25 Unit is ready and clustered
ceph-mon/2 active idle 6 fd42:34da:d2af:4f10:216:3eff:fea3:1a7b Unit is ready and clustered
ceph-osd/0* active idle 1 fd42:34da:d2af:4f10:216:3eff:fe5d:8e95 Unit is ready (1 OSD)
ceph-osd/1 active idle 2 10.33.157.197 Unit is ready (1 OSD)
ceph-osd/2 active idle 3 10.33.157.131 Unit is ready (1 OSD)
ceph-radosgw/0* active idle 0 fd42:34da:d2af:4f10:216:3eff:fece:c073 80/tcp Unit is ready
keystone/0* active idle 8 10.33.157.200 5000/tcp Unit is ready
mysql/0* active idle 7 10.33.157.76 3306,33060/tcp Primary
vault/0* active idle 9 fd42:34da:d2af:4f10:216:3eff:fe01:376d 8200/tcp Unit is ready (active: true, mlock: disabled)
Machine State Address Inst id Base AZ Message
0 started fd42:34da:d2af:4f10:216:3eff:fece:c073 juju-a24339-0 ubuntu at 26.04 sunkern Running
1 started fd42:34da:d2af:4f10:216:3eff:fe5d:8e95 juju-a24339-1 ubuntu at 26.04 sunkern Running
2 started 10.33.157.197 juju-a24339-2 ubuntu at 26.04 sunkern Running
3 started 10.33.157.131 juju-a24339-3 ubuntu at 26.04 sunkern Running
4 started 10.33.157.168 juju-a24339-4 ubuntu at 26.04 sunkern Running
5 started 10.33.157.25 juju-a24339-5 ubuntu at 26.04 sunkern Running
6 started fd42:34da:d2af:4f10:216:3eff:fea3:1a7b juju-a24339-6 ubuntu at 26.04 sunkern Running
7 started 10.33.157.76 juju-a24339-7 ubuntu at 22.04 sunkern Running
8 started 10.33.157.200 juju-a24339-8 ubuntu at 24.04 sunkern Running
9 started fd42:34da:d2af:4f10:216:3eff:fe01:376d juju-a24339-9 ubuntu at 24.04 sunkern Running
```
versions
```
Ubuntu 26.04 LTS (resolute)
radosgw: 20.2.0-0ubuntu2
ceph-common: 20.2.0-0ubuntu2
librados2: 20.2.0-0ubuntu2
radosgw --version:
ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle (stable - RelWithDebInfo)
```
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2154304/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list