ACK: [SRU][Bionic][PATCH 0/2] Fix ThunderX boot issues

dann frazier dann.frazier at canonical.com
Wed Jan 15 20:40:57 UTC 2020


On Wed, Jan 15, 2020 at 10:37 AM Juerg Haefliger
<juerg.haefliger at canonical.com> wrote:
>
> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
>
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
>
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.
>
> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
>
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list

Both fixes look correct to me. Indeed, I had noticed that the "kernel
page table isolation" message was printed very early in unaffected
kernels logs:

[    0.000000] CPU features: kernel page table isolation forced OFF by
ARM64_WORKAROUND_CAVIUM_27456

While affected kernels weren't printing it until much later, just
before enabling non-boot CPUs:

[    0.035287] smp: Bringing up secondary CPUs ...
[    0.035637] CPU features: kernel page table isolation forced OFF by
ARM64_WORKAROUND_CAVIUM_27456

With these fixes, these messages are now again emitted at [    0.000000].

When I saw crashes before, they were always pretty early - either
during boot, or immediately after when I started a kernel build. I ran
tests overnight on 3 machines (build kernel & reboot loop) with
4.15.0-74.84 +, these patches and some other candidates. 1 machine
survived the night, another survived for several hours until it was
released by automation, and a 3rd failed - but due to an SEA which
might be an actual hardware failure.

I've restarted the same testing on 4 machines today using latest
bionic/master-next + just these 2 patches.  They've all survived at
least 2 iterations so far. I'll leave it running but, after surviving
>16 passes, I believe either the bug is squashed, or Juerg has scared
it back into hiding :)

Acked-by: dann frazier <dann.frazier at canonical.com>



More information about the kernel-team mailing list