[Bug 2030515] Re: Terrible memcpy performance on Zen 3 when using rep movsb

Wed Nov 29 23:28:01 UTC 2023

Launchpad has imported 13 comments from the remote bug at
https://sourceware.org/bugzilla/show_bug.cgi?id=30994.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2023-10-24T06:18:38+00:00 Bruce Merry wrote:

When (dst-src)&0xFFF is small (but non-zero), the REP MOVSB path in
memcpy performs extremely poorly (as much as 25x slower than the
alternative path). I'm observing this on Zen 4 (Epyc 9374F). I'm running
Ubuntu 22.04 with a glibc hand-built from
glibc-2.38.9000-185-g2aa0974d25.

To reproduce:
1. Download the microbench at https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
2. Compile it with the adjacent Makefile (tl;dr: g++ -std=c++17 -O3 -pthread -o memcpy_loop memcpy_loop.cpp)
3. Run ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5
4. Run GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=10000 ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5

Step 3 reports a rate of 4.2 GB/s, while step 4 (which disables the
rep_movsb path) reports a rate of 111 GB/s. The test uses 8192-byte
memory copies, where the source is page-aligned and the destination
starts 1 byte into a page.

I'll also attach the bench-memcpy-large.out, which shows similar
results.

I've previously filed this as an Ubuntu bug
(https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515) but it
doesn't seem to have received much attention.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/6

------------------------------------------------------------------------
On 2023-10-24T06:19:48+00:00 Bruce Merry wrote:

Created attachment 15193
Glibc's memcpy benchmark results

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/7

------------------------------------------------------------------------
On 2023-10-24T06:20:33+00:00 Bruce Merry wrote:

Created attachment 15194
Output of ld-linux.so.2 --list-tunables

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/8

------------------------------------------------------------------------
On 2023-10-24T06:21:12+00:00 Bruce Merry wrote:

Created attachment 15195
Output of ld-linux.so.2 --list-diagnostics

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/9

------------------------------------------------------------------------
On 2023-10-24T06:32:39+00:00 Bruce Merry wrote:

This issue also affects Zen 3. Zen 2 doesn't advertise ERMS so memcpy
isn't affected.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/11

------------------------------------------------------------------------
On 2023-10-25T13:37:58+00:00 Bruce Merry wrote:

FWIW, backwards REP MOVSB (std; rep movsb; cld) is still horribly slow
on Zen 4 (4 GB/s even when the data is nicely aligned and cached).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/12

------------------------------------------------------------------------
On 2023-10-27T12:39:09+00:00 Adhemerval Zanella wrote:

I have access to a Zen3 code (5900X) and I can confirm that using REP
MOVSB seems to be always worse than vector instructions.  ERMS is used
for sizes between 2112 (rep_movsb_threshold) and 524288
(rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1'
performance really seems to be a microcode since I don't see similar
performance difference with other alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
990.845 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
510.87 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
40.4088 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
`seq -s' ' 0 23`

While with vectorized instructions I see:

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
325.996 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`
498.552 GB/s

So it seems there in gain in using REP MOVSB on Zen3/Zen4, specially on
the size is was supposed to be better. glibc 2.34 added a fix from AMD
(6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f), where the assumption is ERMS
performs poorly on data above L2 cache size so REP MOVSB is limited to
L2 cache size (from 2113 to 524287), but I think AMD engineers did not
really evaluated that ERM is indeed better than vectorized instruction.

And I think BZ#30995 is the same issue, since
__memcpy_avx512_unaligned_erms uses the same tunable to decide whether
to use ERMS. I have created a patch that just disable ERMS usage on AMD
cores [1], can you check if it improves performance on Zen4 as well?

Also, I have notices that memset is also showing subpar performance with
ERMS and I also disable it on my branch.

[1]
https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz30944-memcpy-
zen

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/13

------------------------------------------------------------------------
On 2023-10-27T13:04:12+00:00 Bruce Merry wrote:

Here's what I get on the Zen 4 system with the same parameters. I
haven't had a chance to look at what it all means:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1075.03 GB/s

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/14

------------------------------------------------------------------------
On 2023-10-27T13:16:01+00:00 Bruce Merry wrote:

Ah looks like the GLIBC_TUNABLES environment variable didn't appear in
the output. Let me try again:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1075.03 GB/s

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/15

------------------------------------------------------------------------
On 2023-10-30T08:21:16+00:00 Bruce Merry wrote:

So in those cases, REP MOVSB seems to be a slow-down, but there do also
seem to be cases where REP MOVSB is much faster (this is on Zen 4) e.g.

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
94.5295 GB/s
94.3382 GB/s
94.474 GB/s
94.2385 GB/s
94.5105 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
56.5062 GB/s
55.3669 GB/s
56.4723 GB/s
55.857 GB/s
56.5396 GB/s

When not using huge pages, the vectorised memcpy hits 115.5 GB/s. I'm
seeing a lot of cases on Zen 4 where huge pages actually makes things
worse; maybe it's related to hardware prefetch reading past the end of
the buffer?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/16

------------------------------------------------------------------------
On 2023-10-30T13:30:58+00:00 Adhemerval Zanella wrote:

On Zen3 I am not seeing such slowdown using vectorized instructions.
With a patch glibc to disable REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
146.593 GB/s

# Force REP MOVSB
$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_stop_threshold=4097 ./testrun.sh ./memcpy_loop  -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
116.298 GB/s

And I don't see difference between mmap and mmap_huge.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/17

------------------------------------------------------------------------
On 2023-10-30T14:21:56+00:00 Bruce Merry wrote:

> On Zen3 I am not seeing such slowdown using vectorized instructions.

Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers
(this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built
glibc handy on  that server):

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
90.065 GB/s
89.9096 GB/s
89.9131 GB/s
89.8207 GB/s
89.952 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
116.997 GB/s
116.874 GB/s
116.937 GB/s
117.029 GB/s
117.007 GB/s

On the other hand, there seem to be other cases where REP MOVSB is
faster on Zen 3:

$ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy_rep_movsb
22.045 GB/s
22.3135 GB/s
22.1144 GB/s
22.8571 GB/s
22.2688 GB/s

$ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy
7.66155 GB/s
7.71314 GB/s
7.72952 GB/s
7.72505 GB/s
7.74309 GB/s

But overall it does seem like the vectorised copy performs better than
REP MOVSB on Zen 3.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/18

------------------------------------------------------------------------
On 2023-10-30T16:27:35+00:00 Adhemerval Zanella wrote:

(In reply to Bruce Merry from comment #11)
> > On Zen3 I am not seeing such slowdown using vectorized instructions.
> 
> Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers
> (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc
> handy on  that server):
> 
> $ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 90.065 GB/s
> 89.9096 GB/s
> 89.9131 GB/s
> 89.8207 GB/s
> 89.952 GB/s
> 
> $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D
> 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 116.997 GB/s
> 116.874 GB/s
> 116.937 GB/s
> 117.029 GB/s
> 117.007 GB/s
> 
> On the other hand, there seem to be other cases where REP MOVSB is faster on
> Zen 3:
> 
> $ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy_rep_movsb
> 22.045 GB/s
> 22.3135 GB/s
> 22.1144 GB/s
> 22.8571 GB/s
> 22.2688 GB/s
> 
> $ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy
> 7.66155 GB/s
> 7.71314 GB/s
> 7.72952 GB/s
> 7.72505 GB/s
> 7.74309 GB/s
> 
> But overall it does seem like the vectorised copy performs better than REP
> MOVSB on Zen 3.

The main issues seem to define when ERMS is better than vectorized based
on arguments. Current glibc only takes into consideration the input
size, whereas from the discussion it seems we need to also take into
consideration the argument alignment (and both of them).

Also, it seems that Zen3 ERMS is slightly better than non-temporal
instructions, which is another tuning heuristics since again only the
size is used where to use it (currently x86_non_temporal_threshold).

In any case, I think at least for sizes where ERMS is currently being
used it would be better to use the vectorized path. Most likely some
more tunings to switch to ERMS on large sizes would be profitable for
Zen cores.

Does AMD provide any tuning manual describing such characteristics for
instruction and memory operations?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515/comments/19

** Changed in: glibc
       Status: Unknown => New

** Changed in: glibc
   Importance: Unknown => Low

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/2030515

Title:
  Terrible memcpy performance on Zen 3 when using rep movsb

Status in GLibC:
  New
Status in glibc package in Ubuntu:
  New

Bug description:
  On CPUs that advertise FSRM (fast short rep movsb), glibc 2.35 uses
  REP MOVSB for memcpy for sizes above 2112 (up to some threshold that
  depends on the cache size). Unfortunately, it seems that Zen 3 (at
  least in the microcode we're running) is extremely slow at REP MOVSB
  when the data are not well-aligned.

  I've found this using a memcpy benchmark at https://github.com/ska-
  sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp
  (compiled with the adjacent Makefile). To demonstrate the issue, run

  ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

  This runs:
  - 2113-byte memory copies
  - 1,000,000 times per timing measurement
  - in memory allocated with mmap
  - with the source 0 bytes from the start of the page
  - with the destination 1 byte from the start of the page
  - on core 0.

  It reports about 3.2 GB/s. Change the -b argument to 2111 and it
  reports over 100 GB/s. So the REP MOVSB case is about 30× slower!

  This will most likely need to be reported and fixed upstream, but I'm
  reporting it to Ubuntu first since I don't know if Ubuntu has modified
  glibc in any way that would be significant.

  See also: https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: libc6 2.35-0ubuntu3.1
  ProcVersionSignature: Ubuntu 5.19.0-46.47~22.04.1-generic 5.19.17
  Uname: Linux 5.19.0-46-generic x86_64
  NonfreeKernelModules: nvidia_modeset nvidia
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  CasperMD5CheckResult: unknown
  Date: Mon Aug  7 14:02:28 2023
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: glibc
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/2030515/+subscriptions