[Bug 1904585] Re: opal-prd: Have a worker process handle page offlining (Fixes "PlatServices: dyndealloc memory_error() failed" is getting reported in error log (opal-prd))
Ćukasz Zemczak
1904585 at bugs.launchpad.net
Mon Jan 11 10:02:23 UTC 2021
I'm not entirely happy with the regression potential section, so let me
work a bit on that.
"Hopefully not much" is not a valid regression potential analysis as it
gives us, the SRU team, no insight into what the actual situation is
like. We all have wishful thinking about our changes! This section
requires looking and understanding the code changes, looking at which
parts of the code are modified and doing a though exercise: hmmm, if
this code is changed, what other parts can it affect? Where should we
expect any problems appear, just in case?
Only then we, as the SRU team, can make a decision whether it is safe
enough to go forward with the SRU or not. Also, it gives everyone a
better understanding what additional dogfooding we might want to perform
to minimize regression risk. Lastly, it's also makes easier bisecting
whenever some regression is found later, when trying to identify which
change is responsible.
Another thing that I'm not entirely happy with is the lack of any DEP-3
headers on any of the patches in the actual uploads. Because of that I
don't know where the patches come from, have those been upstreamed?
I will gladly re-visit those once these issues are resolved.
** Changed in: skiboot (Ubuntu Groovy)
Status: In Progress => Incomplete
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to skiboot in Ubuntu.
Matching subscriptions: foundations-bugs-skiboot
https://bugs.launchpad.net/bugs/1904585
Title:
opal-prd: Have a worker process handle page offlining (Fixes
"PlatServices: dyndealloc memory_error() failed" is getting reported
in error log (opal-prd))
Status in The Ubuntu-power-systems project:
In Progress
Status in skiboot package in Ubuntu:
Fix Released
Status in skiboot source package in Xenial:
In Progress
Status in skiboot source package in Bionic:
In Progress
Status in skiboot source package in Focal:
In Progress
Status in skiboot source package in Groovy:
Incomplete
Status in skiboot source package in Hirsute:
Fix Released
Bug description:
[Impact]
This impacts the opal-prd userspace command from the skiboot package
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the memory_error() hservice call which blocks HBRT from processing other
errors.
[Test Case]
Unfortunately due to the specific hardware requirement I wasn't able
to reproduce this problem and provide a test case for it. However I
was able to build this package into a ppa and got the IBM team to
confirm this problem was resolved for groovy focal, bionic, xenial see
comment #4 and #6
[What could go wrong]
Hopefully not much. The initial fix was prepared back in September and
I would think regression could have been discovered by now.
[Original Description]
https://github.com/open-
power/skiboot/commit/8cbd0de88d162e387f11569eee1bdecef8fad2e3
opal-prd: Have a worker process handle page offlining
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the
memory_error() hservice call which blocks HBRT from processing other
errors. Fix this by adding a worker process which performs the page
offlining via the sysfs memory error interfaces.
Reviewed-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall at gmail.com>
Thanks in advance for your support.
Machine Type = Power8 and Power9 OPAL systems
---Steps to Reproduce---
* Inject memory error (UE)
* Verify that opal-prd doesn't return asynchronously to the platform after requesting the memory offlining operation
Userspace tool common name: opal-prd
We need this fix for 16.04.x and 18.04.x LTS releases.
Fix also is needed for 20.04 and 20.10.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1904585/+subscriptions
More information about the foundations-bugs
mailing list