[Bug 2154304] Re: ceph radosgw segv with keystone auth

Peter Sabaini 2154304 at bugs.launchpad.net
Sat May 30 07:08:54 UTC 2026


Some attempt at bot-backed analysis courtesy of pi/gpt-5.5

----

## Executive summary

`radosgw` on Ubuntu Resolute / Ceph Tentacle crashes during Keystone-
backed S3 authentication because Boost.Asio's
`call_stack<thread_context, thread_info_base>` state is partially
initialized across mixed shared-library Asio configurations.

The immediate crash is:

```text
RGW http_manager thread
  -> ceph::async::detail::CompletionImpl<...>::destroy_post()
  -> Boost.Asio small-block recycling path
  -> pthread_getspecific(0)
  -> returns GnuTLS RNG/crypto state beginning "expand 32-byte k"
  -> Asio treats bytes at +8 ("2-byte k") as thread_info_base*
  -> SIGSEGV
```

The root cause is not GnuTLS. GnuTLS legitimately owns pthread key `0`.
The bug is that Ceph/RGW's Boost.Asio `thread_context` `top_` object is
marked initialized while its pthread key field remains zero.

Tracer v3 identifies the missing writer: `libboost_process.so.1.90.0`
exports an unversioned Boost.Asio guard symbol for
`call_stack<thread_context, thread_info_base>::top_`, but it does not
export or initialize the matching pthread-TSS `top_` key object. Its GOT
relocation for the guard resolves to `radosgw`'s GNU-unique guard. It
therefore sets `radosgw`'s guard to `1` before Ceph's pthread-TSS
constructor runs, causing Ceph to skip `pthread_key_create()` and leave
`radosgw`'s `top_.tss_key_ == 0`.

This is a mixed Boost.Asio TLS-model / DSO symbol-preemption bug:

```text
Ceph/radosgw:       BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION => pthread TSS
libboost_process:  distro/default Boost.Asio compiler TLS behavior
```

## Affected / observed versions

Confirmed affected:

```text
Ubuntu resolute / 26.04
Ceph Tentacle radosgw 20.2.0-0ubuntu2
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605042247
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605272015
```

Vulnerable PPA `/usr/bin/radosgw` identity used for v3 proof:

```text
SHA256: 122a3f8640fed3d75d88d80d0b33676e9f1ae338f1d92400b04958ff1a7fd3b7
```

Validated compiler-TLS candidate `/usr/bin/radosgw`:

```text
SHA256: 63c88ad26eae42d4ee9793b8e887ee9ede5c107469aa011fa3679aa429f85aa1
Build ID: 46acf7cd0a5615bd1d1eae805a75377c61cf8538
```

## Background: why Ceph disabled Asio compiler TLS

Ceph upstream commit:

```text
29ee772263c7ab3c3bf33038bd989336ae3064ad
librados: workaround for boost::asio use of static member variables
```

added global:

```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```

and exported selected Asio `call_stack<...>::top_`/guard symbols from
`src/librados/librados.map`.

The commit addressed a real historical problem: Boost.Asio is header-
only and uses static member variables for thread-local call-stack state.
If Asio appears in multiple DSOs, such as `librados` and `librbd`, each
DSO can get separate static state unless the dynamic linker coalesces
the symbols correctly. The Ceph workaround forced Asio to use pthread
TSS and manually exported selected static variables so the loader could
consolidate them.

The current RGW crash shows that workaround is no longer safe as a
blanket process-wide assumption when the same process also loads distro
Boost libraries built with the default Asio compiler-TLS configuration.

## Root cause details

### Asio TLS model switch

Boost.Asio chooses its thread-specific pointer implementation
approximately like this:

```cpp
#if defined(BOOST_ASIO_HAS_THREAD_KEYWORD_EXTENSION)
  keyword_tss_ptr<T>   // compiler TLS, e.g. __thread/thread_local storage
#elif defined(BOOST_ASIO_HAS_PTHREADS)
  posix_tss_ptr<T>     // pthread_key_create/get/set
#endif
```

`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` disables compiler TLS and
forces the pthread path.

In Ceph's forced pthread mode, `call_stack<thread_context,
thread_info_base>::top_` is a static object containing a
`pthread_key_t`. In compiler-TLS mode, the state is represented by TLS
symbols such as:

```text
boost::asio::detail::keyword_tss_ptr<...>::value_
```

### Proven failure sequence

1. `libgnutls.so.30` creates pthread key `0` during process startup.
2. `libboost_process.so.1.90.0` static initialization runs.
3. `libboost_process` has an unversioned Asio guard relocation for `call_stack<thread_context, thread_info_base>::top_`.
4. The relocation resolves to `radosgw`'s GNU-unique guard address.
5. `libboost_process` writes the guard byte to `1` but has no corresponding pthread-TSS `top_` key object and does not call `pthread_key_create()` for it.
6. Ceph/radosgw constructors later observe the guard as already initialized and skip `posix_tss_ptr_create()` for `radosgw`'s `thread_context` `top_`.
7. `radosgw`'s `thread_context` `top_` key remains zero.
8. Runtime Asio code calls `pthread_getspecific(0)` and reads GnuTLS state as if it were Asio call-stack state.

## Key evidence

Artifact root:

```text
/home/ubuntu/rgw-s3-crash-bug/artifacts/20260529T_asio_guard_tss_v3
```

Important files:

```text
RESULT.md
unit-logs/asio_guard_tls_tracer_v3.after-crash.log.gz
unit-logs/asio_guard_tls_tracer_v3.after-crash.summary.txt
unit-logs/live-top-inspection-v3.txt
unit-logs/v3-loaded-asio-symbols.txt
unit-logs/libboost_process-asio-thread-analysis.txt
unit-logs/live-libboost-process-got-inspection-v3.txt
static-analysis/static-init-and-relocation-report.md
reproducers/ceph-like-reproducer-summary.md
```

Tracer v3 key creation evidence:

```text
GnuTLS:          key_ptr=/usr/lib/.../libgnutls.so.30     key=0
libceph-common: key_ptr=/usr/bin/radosgw:...strand...top_ key=1
radosgw:        key_ptr=/usr/bin/radosgw:...await...top_  key=3
```

No `pthread_key_create()` was seen for `radosgw`'s
`thread_context/thread_info_base` `top_`.

Crash-path evidence:

```text
op=key_get key=0 ... caller=/usr/bin/radosgw:CompletionImpl<...>::destroy_post(...)+0x578
bytes16=657870616e642033322d62797465206b ascii=expand 32-byte k
```

Live memory after vulnerable restart:

```text
radosgw thread guard: 0x1
radosgw thread top_: 0x0
radosgw strand top_: key 1
radosgw await top_:  key 3
```

`libboost_process` evidence:

```text
/usr/lib/x86_64-linux-gnu/libboost_process.so.1.90.0
  exports guard variable for boost::asio::detail::call_stack<...thread_context...>::top_
  has R_X86_64_GLOB_DAT relocation for that guard
  does not export/create the matching pthread-TSS top_ key object
```

Disassembly shows guard-only initialization:

```asm
5fd2: mov GOT(guard for call_stack<thread_context,...>::top_), %r14
5fd9: cmpb $0x0,(%r14)
5fdf: movb $0x1,(%r14)   # if zero
```

Live GOT inspection proves the preemption target:

```text
libboost_process GOT_guard_thread_top -> 0x5f92865c5240
radosgw_thread_guard                  = 0x5f92865c5240 data=01...
libboost_process own_guard symbol     data=00...
radosgw_thread_top                    data=00...
```

## Validated mitigations / candidates

### 1. Preferred root-cause fix: use compiler TLS consistently in Ceph

Change:

```text
stop defining BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION globally
```

Effect:

- Ceph uses the same Asio compiler-TLS path as distro Boost libraries.
- `radosgw`, `libceph-common`, and `librados` expose/use `keyword_tss_ptr<...>::value_` TLS symbols rather than pthread-key objects for this state.
- The vulnerable `pthread_getspecific(0)` path for Asio `thread_context` is eliminated.

Validation so far:

```text
25/25 Keystone-backed S3 attempts passed
5/5 wrapper attempts passed
post-v3 restore check: 3/3 attempts passed
service active, NRestarts=0
```

Caveat:

This can potentially reintroduce the old issue that commit `29ee772...`
worked around: split Boost.Asio static/TLS state across `librados`,
`librbd`, and other DSOs. A package-quality fix must audit and test
this, not just RGW.

Recommended package-quality follow-up:

- build full Debian packages, not only focused binaries;
- verify compile commands contain neither `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` nor `BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING` unless deliberately scoped;
- audit symbols and relocations for `keyword_tss_ptr<...>::value_` and related guards across `radosgw`, `libceph-common`, `librados`, `librbd`, and any other Ceph Asio users;
- test librados/librbd Asio use cases specifically, including in-process multi-DSO scenarios;
- consider updating symbol maps if compiler-TLS Asio symbols must be exported/coalesced for the original librados/librbd use case.

### 2. Targeted containment: rebuild/vendor Boost.Process with Ceph's
Asio pthread-TSS macro

Idea:

Build a private `libboost_process.so.1.90.0` with:

```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```

and ensure `radosgw` loads it instead of the distro `libboost_process`.

Why it may help:

- It aligns Boost.Process with Ceph's pthread-TSS Asio mode.
- The guard-only writer should become a real pthread-TSS `top_` initializer that calls `pthread_key_create()`.
- It may preserve the intent of Ceph's old `librados` workaround while avoiding this specific mixed-mode crash.

Required validation:

- confirm `radosgw` actually loads the vendored Boost.Process;
- confirm no second system Boost.Process copy is loaded;
- inspect symbols: both guard and matching pthread-TSS `top_` state should be present;
- rerun tracer v3: `radosgw` thread guard must not be `1` while key remains `0`;
- rerun Keystone-backed S3 reproducer and broader RGW tests.

Pros:

- likely targeted fix for this exact `libboost_process` guard-preemption problem;
- avoids immediately changing Ceph's global Asio TLS model.

Cons:

- vendoring distro Boost shared libraries is high-maintenance;
- security updates and ABI compatibility become harder;
- only addresses `libboost_process`; any other Asio-using distro DSO with similar symbol behavior could still trigger a related problem;
- less attractive for Ubuntu/Debian packaging and upstream Ceph.

Status: plausible but not yet built/tested in this investigation.

### 3. Symbol isolation / visibility hardening for Ceph Asio detail
symbols

Idea:

Prevent external distro Boost DSOs from partially preempting Ceph/RGW's
Asio detail symbols, especially guard-only preemption.

Possible forms:

- hide Ceph's Boost.Asio detail symbols from the dynamic symbol table where safe;
- use version scripts to keep guard/top pairs local or consistently versioned;
- namespace or otherwise isolate Ceph's bundled/header-only Asio detail instantiations;
- ensure guard and top objects cannot be split across different TLS implementations.

Pros:

- attacks the symbol-preemption class directly;
- may preserve Ceph's pthread-TSS workaround without vendoring Boost libraries.

Cons:

- high risk: the original `librados` workaround intentionally exported some Asio symbols for cross-DSO coalescing;
- hiding them naively can reintroduce librados/librbd split-state bugs;
- requires careful ELF/version-script design and broader Ceph regression testing.

Status: conceptually valid, but riskier than aligning on compiler TLS.

### 4. RGW/radosgw-scoped `BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`

Idea:

Build RGW/radosgw with:

```text
BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING
```

Effect:

- avoids the specific Asio small-block recycling deallocation path that dereferences the corrupted `thread_info_base` pointer;
- does not fix the underlying `guard=1, key=0` state.

Validation so far:

```text
fresh model baseline crashed
RGW/radosgw-scoped no-recycling candidate passed 25/25 Keystone-backed S3 attempts
```

Pros:

- proven containment for the observed crash path;
- smaller behavioral change than a global Asio TLS-model change;
- potentially SRU-friendly as an emergency workaround.

Cons:

- not root cause;
- leaves invalid Asio state in the process;
- another Asio code path could still read the bad `thread_context` state later;
- may affect allocation performance.

Status: validated fallback containment, not preferred final fix.


### Root-cause package fix

Preferred path:

1. Produce a full Debian Ceph build that removes global `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Do not include the RGW no-recycling workaround in the root-fix build unless explicitly choosing fallback containment.
3. Audit Asio symbols/relocations across all Ceph DSOs, especially `librados` and `librbd`, to ensure the old multi-DSO issue is not reintroduced.
4. Add/execute librados/librbd Asio regression tests in addition to RGW Keystone-backed S3 tests.
5. Rerun RGW Keystone S3 reproducer at scale and broader RGW/Ceph smoke tests.

Alternative if compiler TLS is rejected or regresses:

1. Prototype vendored/rebuilt Boost.Process with `BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Validate with tracer v3 that it initializes the matching pthread-TSS key and no longer leaves `radosgw` guard/key split.
3. Audit the process for any other loaded Asio-using distro DSOs with similar guard-only symbol behavior.

## Validation checklist for any final fix

A final package-quality fix should pass:

- Keystone-backed S3 CreateBucket/ListBuckets/DeleteBucket loop, at least 25/25;
- repeated wrapper run with clean exit;
- service remains `ActiveState=active`, `NRestarts=0`;
- no `SIGSEGV`, `core-dump`, `destroy_post()+0x586`, or `pthread_getspecific(0)` Asio crash in journal;
- compile command audit for intended Asio macros;
- symbol audit showing no mixed guard/top state across `radosgw`, `libceph-common`, `librados`, `librbd`, `libboost_process`;
- librados/librbd Asio regression coverage for the original `29ee772...` scenario.

## Bottom line

The crash is caused by a mixed Boost.Asio TLS model in one process.
Ceph's historical pthread-TSS workaround collides with distro
Boost.Process's compiler-TLS Asio symbols, allowing Boost.Process to set
`radosgw`'s Asio guard without creating `radosgw`'s pthread key.

The most robust long-term fix is to make Ceph and distro Boost libraries
use the same Asio TLS model, preferably compiler TLS, while explicitly
validating the old librados/librbd multi-DSO concern. The safest
already-validated containment is RGW/radosgw-scoped
`BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`, but that should remain a
fallback because it does not repair the corrupted Asio state.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/2154304

Title:
  ceph radosgw segv with keystone auth

Status in ceph package in Ubuntu:
  New

Bug description:
  On Ubuntu resolute radosgw from 20.2.0-0ubuntu2 reproducibly segfaults
  when an S3 request is authenticated with Keystone EC2 credentials. The
  client sees "502 Bad Gateway" and systemd restarts radosgw.

  This is the Keystone-backed S3 auth path. Local RGW users can work
  when local auth is tried first, but a request using Keystone EC2
  credentials still crashes RGW.

  This can be reproduced using the CI bundle with adaptation for
  resolute and zaza tests

  Stack trace

  ```
  Thread 602 "http_manager" received signal SIGSEGV, Segmentation fault.
  0x00005e52632bf3b6 in ceph::async::detail::CompletionImpl<boost::asio::any_io_executor, boost::asio::detail::spawn_handler<boost::asio::any_io_executor, void (boost::system::error_code), void>, void, boost::system::error_code>::destroy_post(std::tuple<boost::system::error_code>&&) ()

  Thread 602 (Thread 0x753657aea6c0 (LWP 63226) "http_manager"):
  #0  0x00005e52632bf3b6 in ceph::async::detail::CompletionImpl<boost::asio::any_io_executor, boost::asio::detail::spawn_handler<boost::asio::any_io_executor, void (boost::system::error_code), void>, void, boost::system::error_code>::destroy_post(std::tuple<boost::system::error_code>&&) ()
  #1  0x00005e5263345d18 in RGWHTTPManager::finish_request(rgw_http_req_data*, int, long) ()
  #2  0x00005e5263346c5a in RGWHTTPManager::reqs_thread_entry() ()
  #3  0x00005e52633475e1 in RGWHTTPManager::ReqsThread::entry() ()
  #4  0x000075365eca40da in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:454
  #5  0x000075365ed377ac in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
  ```

  Juju status

  ```
  # juju status
  Model              Controller  Cloud/Region         Version  SLA          Timestamp
  zaza-9bef958d671a  lxd         localhost/localhost  3.6.23   unsupported  11:39:43Z

  App           Version          Status  Scale  Charm         Channel      Rev  Exposed  Message
  ceph-mon      20.2.0           active      3  ceph-mon                     0  no       Unit is ready and clustered
  ceph-osd      20.2.0           active      3  ceph-osd                    15  no       Unit is ready (1 OSD)
  ceph-radosgw  20.2.0           active      1  ceph-radosgw                26  no       Unit is ready
  keystone      25.0.0           active      1  keystone      latest/edge  774  no       Application Ready
  mysql         8.0.44-0ubun...  active      1  mysql         8.0/stable   444  no       
  vault         1.8.8            active      1  vault         latest/edge  363  no       Unit is ready (active: true, mlock: disabled)

  Unit             Workload  Agent  Machine  Public address                          Ports           Message
  ceph-mon/0*      active    idle   4        10.33.157.168                                           Unit is ready and clustered
  ceph-mon/1       active    idle   5        10.33.157.25                                            Unit is ready and clustered
  ceph-mon/2       active    idle   6        fd42:34da:d2af:4f10:216:3eff:fea3:1a7b                  Unit is ready and clustered
  ceph-osd/0*      active    idle   1        fd42:34da:d2af:4f10:216:3eff:fe5d:8e95                  Unit is ready (1 OSD)
  ceph-osd/1       active    idle   2        10.33.157.197                                           Unit is ready (1 OSD)
  ceph-osd/2       active    idle   3        10.33.157.131                                           Unit is ready (1 OSD)
  ceph-radosgw/0*  active    idle   0        fd42:34da:d2af:4f10:216:3eff:fece:c073  80/tcp          Unit is ready
  keystone/0*      active    idle   8        10.33.157.200                           5000/tcp        Unit is ready
  mysql/0*         active    idle   7        10.33.157.76                            3306,33060/tcp  Primary
  vault/0*         active    idle   9        fd42:34da:d2af:4f10:216:3eff:fe01:376d  8200/tcp        Unit is ready (active: true, mlock: disabled)

  Machine  State    Address                                 Inst id        Base          AZ       Message
  0        started  fd42:34da:d2af:4f10:216:3eff:fece:c073  juju-a24339-0  ubuntu at 26.04  sunkern  Running
  1        started  fd42:34da:d2af:4f10:216:3eff:fe5d:8e95  juju-a24339-1  ubuntu at 26.04  sunkern  Running
  2        started  10.33.157.197                           juju-a24339-2  ubuntu at 26.04  sunkern  Running
  3        started  10.33.157.131                           juju-a24339-3  ubuntu at 26.04  sunkern  Running
  4        started  10.33.157.168                           juju-a24339-4  ubuntu at 26.04  sunkern  Running
  5        started  10.33.157.25                            juju-a24339-5  ubuntu at 26.04  sunkern  Running
  6        started  fd42:34da:d2af:4f10:216:3eff:fea3:1a7b  juju-a24339-6  ubuntu at 26.04  sunkern  Running
  7        started  10.33.157.76                            juju-a24339-7  ubuntu at 22.04  sunkern  Running
  8        started  10.33.157.200                           juju-a24339-8  ubuntu at 24.04  sunkern  Running
  9        started  fd42:34da:d2af:4f10:216:3eff:fe01:376d  juju-a24339-9  ubuntu at 24.04  sunkern  Running
  ```

  versions

  ```
  Ubuntu 26.04 LTS (resolute)
  radosgw:      20.2.0-0ubuntu2
  ceph-common:  20.2.0-0ubuntu2
  librados2:    20.2.0-0ubuntu2
  radosgw --version:
    ceph version 20.2.0 (69f84cc2651aa259a15bc192ddaabd3baba07489) tentacle (stable - RelWithDebInfo)
  ```

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2154304/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list