[SRU][jammy][PATCH 1/1] percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing
Philip Cox
philip.cox at canonical.com
Wed Feb 14 18:07:35 UTC 2024
From: Yu Ma <yu.ma at intel.com>
BugLink: https://bugs.launchpad.net/bugs/2053152
When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. It will do system call on execl
frequently, and execl will call mm_init to initialize mm_struct of the
process. mm_init will call __percpu_counter_init for percpu_counters
initialization. Then pcpu_alloc is called to read the base_addr of
pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call
pcpu_alloc_area to allocate memory from a specified chunk. This function
will update "free_bytes" and "chunk_md" to record the rest free bytes and
other meta data for this chunk. Correspondingly, pcpu_free_area will also
update these 2 members when free memory.
Call trace from perf is as below:
+ 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
+ 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
- 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
- 53.54% 0x654278696e552f34
main
__execve
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_execve
do_execveat_common.isra.47
alloc_bprm
mm_init
__percpu_counter_init
pcpu_alloc
- __mutex_lock.isra.17
In current pcpu_chunk layout, `base_addr' is in the same cache line with
`free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This
patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a
new cacheline.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.4-rc4, the 160 parallel score improves by 24%.
The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.
Working the #s on my desktop:
Percpu: 58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?
I believe we can do a little better to avoid eating that full padding,
so likely less than that.
[dennis at kernel.org: changelog details]
Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma at intel.com>
Reviewed-by: Tim Chen <tim.c.chen at linux.intel.com>
Acked-by: Dennis Zhou <dennis at kernel.org>
Cc: Dan Williams <dan.j.williams at intel.com>
Cc: Dave Hansen <dave.hansen at intel.com>
Cc: Liam R. Howlett <Liam.Howlett at oracle.com>
Cc: Shakeel Butt <shakeelb at google.com>
Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
(cherry picked from commit 3a6358c0dbe6a286a4f4504ba392a6039a9fbd12)
Signed-off-by: Philip Cox <philip.cox at canonical.com>
---
mm/percpu-internal.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 639662c20c82..0bc4c2eac808 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -40,10 +40,17 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
struct pcpu_block_md chunk_md;
- void *base_addr; /* base address of this chunk */
+ unsigned long *bound_map; /* boundary map */
+
+ /*
+ * base_addr is the base address of this chunk.
+ * To reduce false sharing, current layout is optimized to make sure
+ * base_addr locate in the different cacheline with free_bytes and
+ * chunk_md.
+ */
+ void *base_addr ____cacheline_aligned_in_smp;
unsigned long *alloc_map; /* allocation map */
- unsigned long *bound_map; /* boundary map */
struct pcpu_block_md *md_blocks; /* metadata blocks */
void *data; /* chunk data */
--
2.34.1
More information about the kernel-team
mailing list