APPLIED: [SRU][N][PATCH 0/2] KVM bug causes Firecracker crash when it runs the vCPU for the first time

Stefan Bader stefan.bader at canonical.com
Thu May 15 16:13:34 UTC 2025


On 03.05.25 00:46, Magali Lemes wrote:
> BugLink: https://bugs.launchpad.net/bugs/2109859
> 
> [Impact]
> Firecracker process crashes with an "out of memory" error when it attempts to
> run the vCPU for the first time, even if the system has enough available
> memory:
> ```
> 2025-05-02T16:31:21.850912998 [daf77128-f177-4a01-9b97-a88dd9faa78f:fc_vcpu 0] Failure during vcpu run: Out of memory (os error 12)
> ```
> 
> The issue is triggered by a race condition caused by the VMM thread sending a
> SIGRTMIN to the vCPU thread, while it is starting the
> nx_huge_page_recovery_thread. This makes the thread creation fail, but due to a
> bug in the kernel, it is classified as a ENOMEM, instead of a ERESTARTNOINTR,
> which should be retried.
> 
> This only affects 6.8 kernels, since the bug is introduced by the following
> commits, backported to the noble:linux 6.8.0-58.60 kernel as part of the upstream
> stable updates (LP: #2101915):
> - 43fb96ae7855 ("KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking")
> - 931656b9e2ff ("kvm: defer huge page recovery vhost task to later")
> - d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> 
> [Fix]
> Cherry-pick cb380909ae3b ("vhost: return task creation error instead of NULL")
> and 916b7f42b3b3 ("kvm: retry nx_huge_page_recovery_thread creation").
> 
> [Test Case]
> 1) Launch a Noble c5.metal instance on AWS
> 2) Install and boot into the linux-generic 6.8 kernel
> 3) Install docker and aws-cli
> 4) git clone https://github.com/firecracker-microvm/firecracker.git
> 5) Go to the firecracker directory and run `./tools/devtool test -- -n16 integration_tests/functional/test_snapshot_basic.py::test_cycled_snapshot_restore`
> 6) With this patchset, observe that all tests pass. Without it, a couple
> of tests will fail accusing out of memory.
> 
> [Where problems could occur]
> This patchset touches vhost_task_create(), making it return specific error
> pointers instead of just NULL. Problems could occur if its callers
> mishandle the return value.
> More broadly, it also touches code responsible for MM of KVM VMs, and issues
> could appear as these VMs failing to initialize.
> 
> [Other info]
> SF #00410184
> 
> Keith Busch (2):
>    vhost: return task creation error instead of NULL
>    kvm: retry nx_huge_page_recovery_thread creation
> 
>   arch/x86/kvm/mmu/mmu.c    | 12 +++++-----
>   drivers/vhost/vhost.c     |  2 +-
>   include/linux/call_once.h | 47 ++++++++++++++++++++++++++++-----------
>   kernel/vhost_task.c       |  4 ++--
>   4 files changed, 42 insertions(+), 23 deletions(-)
> 

Applied to noble:linux/master-next. Thanks.

-Stefan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 47863 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250515/26464533/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250515/26464533/attachment-0001.sig>


More information about the kernel-team mailing list