[SRU][N][PATCH 0/2] noble ubuntu_ftrace_smoke_test:mmiotrace timeout on aws:r5.metal (LP: #2121673)
Juerg Haefliger
juerg.haefliger at canonical.com
Wed Sep 17 14:19:26 UTC 2025
BugLink: https://bugs.launchpad.net/bugs/2121673
[Impact]
This happens for 6.8.0-80.80 (2025.08.11) generic kernel and only happens with aws:r5.metal instance. 6.12 kernel works find. Juerg found the offending commit to be:
memcg: drain obj stock on cpu hotplug teardown
BugLink: https://bugs.launchpad.net/bugs/2119458
commit 9f01b4954490d4ccdbcc2b9be34a9921ceee9cbb upstream.
Currently on cpu hotplug teardown, only memcg stock is drained but we
need to drain the obj stock as well otherwise we will miss the stats
accumulated on the target cpu as well as the nr_bytes cached. The stats
include MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. In
addition we are leaking reference to struct obj_cgroup object.
Because nothing in the upstream patchset depends on this commit we decided to delay applying this patch until the next SRU cycle.
INFO | START ubuntu_ftrace_smoke_test.ftrace-smoke-test ubuntu_ftrace_smoke_test.ftrace-smoke-test timeout=900 timestamp=1756180477 localtime=Aug 26 03:54:37
DEBUG| Persistent state client._record_indent now set to 2
DEBUG| Persistent state client.unexpected_reboot now set to ('ubuntu_ftrace_smoke_test.ftrace-smoke-test', 'ubuntu_ftrace_smoke_test.ftrace-smoke-test')
DEBUG| Waiting for pid 3906 for 900 seconds
WARNI| System python is too old, crash handling disabled
DEBUG| Running '/home/ubuntu/autotest/client/tests/ubuntu_ftrace_smoke_test/ubuntu_ftrace_smoke_test.sh'
DEBUG| [stdout] PASSED (CONFIG_FUNCTION_TRACER=y in /boot/config-6.8.0-80-generic)
DEBUG| [stdout] PASSED (CONFIG_FUNCTION_GRAPH_TRACER=y in /boot/config-6.8.0-80-generic)
DEBUG| [stdout] PASSED (CONFIG_STACK_TRACER=y in /boot/config-6.8.0-80-generic)
DEBUG| [stdout] PASSED (CONFIG_DYNAMIC_FTRACE=y in /boot/config-6.8.0-80-generic)
DEBUG| [stdout] PASSED all expected /sys/kernel/debug/tracing files exist
DEBUG| [stdout] PASSED (function_graph in /sys/kernel/debug/tracing/available_tracers)
DEBUG| [stdout] PASSED (function in /sys/kernel/debug/tracing/available_tracers)
DEBUG| [stdout] PASSED (nop in /sys/kernel/debug/tracing/available_tracers)
DEBUG| [stdout] PASSED (tracer function can be enabled)
DEBUG| [stdout] PASSED (tracer function_graph can be enabled)
ERROR| [stderr] grep: /tmp/ftrace-kernel-trace-3910.tmp.log: binary file matches
DEBUG| [stdout] - tracer function_graph got enough data
DEBUG| [stdout] - tracer function_graph completed
DEBUG| [stdout] - tracer function_graph being turned off
ERROR| [stderr] grep: /tmp/ftrace-kernel-trace-3910.tmp.log: binary file matches
DEBUG| [stdout] - tracer got 231 irq events
DEBUG| [stdout] - tracer timerlat got enough data
DEBUG| [stdout] - tracer timerlat completed
DEBUG| [stdout] - tracer timerlat being turned off
DEBUG| [stdout] - tracer nop being set as current tracer
DEBUG| [stdout] PASSED (tracer timerlat can be enabled (got 660 lines of tracing output))
DEBUG| [stdout] - tracer osnoise got enough data
DEBUG| [stdout] - tracer osnoise completed
DEBUG| [stdout] - tracer osnoise being turned off
DEBUG| [stdout] - tracer nop being set as current tracer
DEBUG| [stdout] PASSED (tracer osnoise can be enabled (got 11 lines of tracing output))
DEBUG| [stdout] - tracer hwlat got enough data
DEBUG| [stdout] - tracer hwlat completed
DEBUG| [stdout] - tracer hwlat being turned off
DEBUG| [stdout] - tracer nop being set as current tracer
DEBUG| [stdout] PASSED (tracer hwlat can be enabled (got 13 lines of tracing output))
DEBUG| [stdout] - tracer blk got enough data
DEBUG| [stdout] - tracer blk completed
DEBUG| [stdout] - tracer blk being turned off
DEBUG| [stdout] - tracer nop being set as current tracer
DEBUG| [stdout] PASSED (tracer blk can be enabled (got 2 lines of tracing output))
DEBUG| [stdout] TIMER END Tue Aug 26 03:58:59 UTC 2025
DEBUG| [stdout] TIMEOUT
DEBUG| [stdout] FAILED: aborting, timeout, took way too long to complete
INFO | Timer expired (900 sec.), nuking pid 3906
INFO | ERROR ubuntu_ftrace_smoke_test.ftrace-smoke-test ubuntu_ftrace_smoke_test.ftrace-smoke-test timestamp=1756181377 localtime=Aug 26 04:09:37 Test timeout expired, rc=15
INFO | END ERROR ubuntu_ftrace_smoke_test.ftrace-smoke-test ubuntu_ftrace_smoke_test.ftrace-smoke-test timestamp=1756181377 localtime=Aug 26 04:09:37
Running 'sudo chcpu -d 1-95' results in:
[ 82.891707] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 82.891959] #PF: supervisor read access in kernel mode
[ 82.891959] #PF: error_code(0x0000) - not-present page
[ 82.891959] PGD 0 P4D 0
[ 82.891959] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 82.891959] CPU: 0 PID: 593 Comm: kworker/0:2 Not tainted 6.8.0-80-generic #80-Ubuntu
[ 82.891959] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 82.891959] Workqueue: events work_for_cpu_fn
[ 82.891959] RIP: 0010:memcg_hotplug_cpu_dead+0x65/0xc0
[ 82.891959] Code: 44 00 00 48 89 df e8 5a ef ff ff 48 89 c3 41 f7 c5 00 02 00 00 74 06 fb 0f 1f 44 00 00 4c 89 e7 e8 f0 cd ff ff e8 6b d9 d0 ff <48> 8b 03 a8 03 75 1e 65 48 ff 08 e8 ab 35 d1 ff 31 c0 5b 41 5c 41
[ 82.891959] RSP: 0018:ffffbd548170bd10 EFLAGS: 00000246
[ 82.891959] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 82.891959] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 82.891959] RBP: ffffbd548170bd28 R08: 0000000000000000 R09: 0000000000000000
[ 82.891959] R10: 000000000000001c R11: 0000000000000000 R12: ffff99183bcb0c00
[ 82.891959] R13: 0000000000000286 R14: 0000000000000001 R15: 0000000000000000
[ 82.891959] FS: 0000000000000000(0000) GS:ffff99183bc00000(0000) knlGS:0000000000000000
[ 82.891959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 82.891959] CR2: 0000000000000000 CR3: 000000001c43c000 CR4: 00000000000006f0
[ 82.891959] Call Trace:
[ 82.891959] <TASK>
[ 82.891959] ? show_regs+0x6d/0x80
[ 82.891959] ? __die+0x24/0x80
[ 82.891959] ? page_fault_oops+0x99/0x1b0
[ 82.891959] ? kernelmode_fixup_or_oops.isra.0+0x69/0x90
[ 82.891959] ? __bad_area_nosemaphore+0x19e/0x2c0
[ 82.891959] ? bad_area_nosemaphore+0x16/0x30
[ 82.891959] ? do_user_addr_fault+0x29d/0x670
[ 82.891959] ? exc_page_fault+0x83/0x1b0
[ 82.891959] ? asm_exc_page_fault+0x27/0x30
[ 82.891959] ? memcg_hotplug_cpu_dead+0x65/0xc0
[ 82.891959] ? __pfx_memcg_hotplug_cpu_dead+0x10/0x10
[ 82.891959] cpuhp_invoke_callback+0x348/0x530
[ 82.891959] __cpuhp_invoke_callback_range+0x80/0x100
[ 82.891959] _cpu_down+0xfb/0x280
[ 82.891959] __cpu_down_maps_locked+0x15/0x30
[ 82.891959] work_for_cpu_fn+0x1a/0x30
[ 82.891959] process_one_work+0x184/0x3a0
[ 82.891959] worker_thread+0x306/0x440
[ 82.891959] ? _raw_spin_lock_irqsave+0xe/0x20
[ 82.891959] ? __pfx_worker_thread+0x10/0x10
[ 82.891959] kthread+0xf2/0x120
[ 82.891959] ? __pfx_kthread+0x10/0x10
[ 82.891959] ret_from_fork+0x47/0x70
[ 82.891959] ? __pfx_kthread+0x10/0x10
[ 82.891959] ret_from_fork_asm+0x1b/0x30
[ 82.891959] </TASK>
[ 82.891959] Modules linked in: kvm_amd ccp kvm irqbypass input_leds psmouse ahci libahci serio_raw overlay 9pnet_virtio virtiofs 9p 9pnet netfs
[ 82.891959] CR2: 0000000000000000
[ 82.891959] ---[ end trace 0000000000000000 ]---
[ 82.891959] RIP: 0010:memcg_hotplug_cpu_dead+0x65/0xc0
[ 82.891959] Code: 44 00 00 48 89 df e8 5a ef ff ff 48 89 c3 41 f7 c5 00 02 00 00 74 06 fb 0f 1f 44 00 00 4c 89 e7 e8 f0 cd ff ff e8 6b d9 d0 ff <48> 8b 03 a8 03 75 1e 65 48 ff 08 e8 ab 35 d1 ff 31 c0 5b 41 5c 41
[ 82.891959] RSP: 0018:ffffbd548170bd10 EFLAGS: 00000246
[ 82.891959] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 82.891959] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 82.891959] RBP: ffffbd548170bd28 R08: 0000000000000000 R09: 0000000000000000
[ 82.891959] R10: 000000000000001c R11: 0000000000000000 R12: ffff99183bcb0c00
[ 82.891959] R13: 0000000000000286 R14: 0000000000000001 R15: 0000000000000000
[ 82.891959] FS: 0000000000000000(0000) GS:ffff99183bc00000(0000) knlGS:0000000000000000
[ 82.891959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 82.891959] CR2: 0000000000000000 CR3: 000000001c43c000 CR4: 00000000000006f0
[ 82.891959] note: kworker/0:2[593] exited with irqs disabled
[Fix]
The offending commit relies on a NULL check introduced by an earlier commit which we don't have. Pull that in:
91b71e78b8e4 ("mm: memcg: add NULL check to obj_cgroup_put()")
[Test Case]
Running 'sudo chcpu -d 1-95' should not trigger a kernel BUG.
[Where Problems Could Occur]
This touches the CPU hotplug code path. Any on- and off-lining of CPUs could cause issues.
Shakeel Butt (1):
memcg: drain obj stock on cpu hotplug teardown
Yosry Ahmed (1):
mm: memcg: add NULL check to obj_cgroup_put()
include/linux/memcontrol.h | 3 ++-
kernel/bpf/memalloc.c | 6 ++----
mm/memcontrol.c | 27 +++++++++++++++------------
mm/zswap.c | 3 +--
4 files changed, 20 insertions(+), 19 deletions(-)
--
2.48.1
More information about the kernel-team
mailing list