[Bug 2101084] Re: GCC produces wrong code for arm64+sve in some cases

Thu Mar 6 17:56:17 UTC 2025

Launchpad has imported 22 comments from the remote bug at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118976.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2025-02-21T15:56:58+00:00 Luke Robison wrote:

Created attachment 60555
Standalone Reproducer

Hello Team,

A customer came to me with a sha1 implementation that was producing
corrupt values on Graviton4 with -O3.

I isolated the problem to the generation of the trailing bytecount in
big-endian which is then included in the checksum.  The original code
snippet is here, and several variants of it can be found online with
some googling

    for (i = 0; i < 8; i++) {
        finalcount[i] = (unsigned char)((context->count[(i >= 4 ? 0 : 1)]
         >> ((3-(i & 3)) * 8) ) & 255);  /* Endian independent */
    }


I've attached a stand-alone reproducer in which the problematic function is called finalcount_av.  I have found that gcc 11 and previous don't vectorize and don't have the issue, while gcc 12.4 through gcc 14.2 produce corrupt results.  Although trunk doesn't exhibit the problem, I believe this is because of changed optimization weights rather than because the error was fixed.

It is also worth noting that the corruption only occurs in hardware with
128-bit SVE vectors.  On Graviton3 with 256-bit vectors the generated
machine code can exit early and not execute the problematic second half.

Here is a link to Compiler Explorer with the same function
https://godbolt.org/z/c99bMjene

Note that the value of NCOUNT can be set to either 2 or 4, with 4
preventing the compiler from simply using the `rev` instruction on
trunk.  Notably though setting NCOUNT to 4 generates correct code in all
versions I tested.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/0

------------------------------------------------------------------------
On 2025-02-21T15:59:43+00:00 Sjames-j wrote:

Does -fno-strict-aliasing work (the uint32_t* cast)?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/1

------------------------------------------------------------------------
On 2025-02-21T16:03:34+00:00 Luke Robison wrote:

In particular I believe the error occurs because of the following
sequence of instructions.  Looking at line numbers form the compiler
explorer output of 14.2

In the first block line 8:

        index   z31.s, #-1, #-1

This generates a vector of {-1, -2, -3, -4, -5, -6, -7 -8} on 256-bit
vector machines, and only {-1, -2, -3, -4} on 128-bits.

Then for 128-bit machines, the vector is generated again for the second
half on line 20, and then manipulated into negative values {-4, -5, -6,
-7}

        index   z29.s, w3, #-1

But clearly the values should be {-5, -6, -7 -8}, and hence the
resulting data is shifted by one.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/2

------------------------------------------------------------------------
On 2025-02-21T16:05:01+00:00 Luke Robison wrote:

Sam,

No, -fno-strict-aliasing still produces incorrect results.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/3

------------------------------------------------------------------------
On 2025-02-21T16:06:44+00:00 Luke Robison wrote:

Apologies I forgot to include compile line and output:

gcc -fno-inline -O3 -Wall -fno-strict-aliasing  -mcpu=neoverse-v1 -o
final final.c

gcc:9 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:10 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:11 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:12.4 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:13.3 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:14.2 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/4

------------------------------------------------------------------------
On 2025-02-21T17:41:18+00:00 Pinskia wrote:

-mcpu=neoverse-v1 sets the number of SVE bits to 256.

tuning_models/neoversev1.h:  SVE_256, /* sve_width  */


So you can only run it on targets which have 256 bits.

If you want to override that you can use the `-msve-vector-
bits=scalable` option to say to use the scalable option.

But since you are specifying what cpu you are compiling for, it won't
run on a cpu which has a different bits set.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/5

------------------------------------------------------------------------
On 2025-02-21T18:04:06+00:00 Luke Robison wrote:

Andrew,

Thanks for taking a look.  I actually had not realized that -msve-
vector-bits=scalable is the only option guaranteed to produce correct
execution on machines with other vector sizes.  I need to make sure I
include that in a few places, thank you.

However, you and the documentation suggest that -msve-vector-
bits=scalable should take precedence over the value in neoversev1.h, yet
I'm still seeing the problem:


gcc -Wall -Wextra -O3 -fno-strict-aliasing -mcpu=neoverse-v1 -msve-vector-bits=scalable final.c

gcc:9 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:10 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:11 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
gcc:12.4 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:13.3 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:14.2 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/6

------------------------------------------------------------------------
On 2025-02-21T18:38:33+00:00 Luke Robison wrote:

Andrew,

Perhaps you mean that setting -mcpu=neoverse-v1 overrides -msve-vector-
bits=scalable argument.  So I tried with `-march=armv9-a+sve -msve-
vector-bits=scalable`.  I still observe the same erroneous output, so I
still think there is an error here.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/7

------------------------------------------------------------------------
On 2025-02-21T18:42:51+00:00 Pinskia wrote:

.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/8

------------------------------------------------------------------------
On 2025-02-24T08:56:55+00:00 Rguenth wrote:

But with -mcpu=neoverse-v1 you are still specifying the number of SVE
bits to 256, so don't do that if the machine you run on behaves
differently.

IMO this bug should be closed as INVALID (on x86 with AVX10 we might now
run into similar issues with 256bit vs 512bit implementations and models
not only being taken as a set of features).

So case in point this makes -mcpu a lot more fragile to users.

As I read it -msve-vector-bits=scalable is supposed to wipe any
knowledge of SVE vector bits?  If that doesn't work then it should be
fixed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/9

------------------------------------------------------------------------
On 2025-02-24T09:07:50+00:00 Ktkachov wrote:

-mcpu=neoverse-v1 will not produce 256-bit-specific code. By default GCC will always generate scalable VLA code unless -msve-vector-bits= is used.
The SVE_256 in the Neoverse V1 tuning model is only used for vectorisation cost modeling heuristics and not for any correctness decisions.
So I'd expect the code from the user to work correctly

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/10

------------------------------------------------------------------------
On 2025-02-24T15:09:12+00:00 Tnfchris wrote:

Confirmed.

As Kyrill mentioned, -mcpu=<core|native> always keeps SVE code VLA.

I think there are two bugs here. I think it's tree-cunroll that's
transforming the VLA code into the broken VLS code.

Could you try with  -fdisable-tree-cunroll?

PS. It's still on trunk if you do -fno-vect-cost-model

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/11

------------------------------------------------------------------------
On 2025-02-24T17:56:27+00:00 Luke Robison wrote:

Tamar,

I'm happy to test as many flags as you can think of, just send them my
way.

See below for detailed results, but I see that -fdisable-tree-cunroll
does not fix the problem, and I suspect that -march=armv8.4-a+sve must
cause a similar code path to -fno-vect-cost-model, since without a CPU
to target, no cost model is available.

In fact, with those flags, this problem affects gcc down to 8 as well.

CFLAGS="-fno-inline -O3 -Wall -fno-strict-aliasing  -march=armv8.4-a+sve
"

gcc:8 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:9 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:10 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:11 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:12.4 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:13.3 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
gcc:14.2 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00


CFLAGS="-fno-inline -O3 -Wall -fno-strict-aliasing  -march=armv8.4-a+sve -fdisable-tree-cunroll"
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:8 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:9 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:10 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:11 gives PASS: got 0x00bbbbbb 0x00aaaaaa as expected
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:12.4 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:13.3 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00
cc1: note: disable pass tree-cunroll for functions in the range of [0, 4294967295]
gcc:14.2 gives ERROR: expected 0x00bbbbbb 0x00aaaaaa but got 0x00bbbbbb 0xaaaaaa00

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/12

------------------------------------------------------------------------
On 2025-02-26T11:00:18+00:00 Avieira-e wrote:

Looked a bit into this and fre4 takes:

_30 = { POLY_INT_CST [4, 4], POLY_INT_CST [5, 4], POLY_INT_CST [6, 4], ... };
...
vect__3.13_46 = ~_30;

and simplifies that ~_30 to a constant:
{ POLY_INT_CST [-4, -4], POLY_INT_CST [-5, -4], POLY_INT_CST [-6, -4], ... }

Which seems wrong to me, as ~(4) = -5, not -4

looking at const_binop in fold-const.cc I see:
    case BIT_NOT_EXPR:
      if (TREE_CODE (arg0) == INTEGER_CST)
        return fold_not_const (arg0, type);
      else if (POLY_INT_CST_P (arg0))
        return wide_int_to_tree (type, -poly_int_cst_value (arg0));
      /* Perform BIT_NOT_EXPR on each element individually.  */
      else if (TREE_CODE (arg0) == VECTOR_CST)
...

the VECTOR_CST just goes over the elements in the VECTOR_CST and calls this recursively, making the change:

--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -1964,7 +1964,7 @@ const_unop (enum tree_code code, tree type, tree arg0)
       if (TREE_CODE (arg0) == INTEGER_CST)
        return fold_not_const (arg0, type);
       else if (POLY_INT_CST_P (arg0))
-       return wide_int_to_tree (type, -poly_int_cst_value (arg0));
+       return wide_int_to_tree (type, ~poly_int_cst_value (arg0));
       /* Perform BIT_NOT_EXPR on each element individually.  */
       else if (TREE_CODE (arg0) == VECTOR_CST)
        {

fixes it for me.

I'm going on holidays tomorrow, so I'll leave this for someone with a
bit more time to pick up. Richard S it seems to have been your change
in:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=36fd64086542ed734aded849304723218fa4d6fd
so I'll wait for others to confirm my thinking here, I always get
nervous around bit fiddling ;)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/13

------------------------------------------------------------------------
On 2025-02-27T01:21:01+00:00 Pinskia wrote:

(In reply to avieira from comment #13)

> 
> the VECTOR_CST just goes over the elements in the VECTOR_CST and calls this
> recursively, making the change:
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1964,7 +1964,7 @@ const_unop (enum tree_code code, tree type, tree arg0)
>        if (TREE_CODE (arg0) == INTEGER_CST)
>         return fold_not_const (arg0, type);
>        else if (POLY_INT_CST_P (arg0))
> -       return wide_int_to_tree (type, -poly_int_cst_value (arg0));
> +       return wide_int_to_tree (type, ~poly_int_cst_value (arg0));
>        /* Perform BIT_NOT_EXPR on each element individually.  */
>        else if (TREE_CODE (arg0) == VECTOR_CST)
>         {
> 
> fixes it for me.
> 
> I'm going on holidays tomorrow, so I'll leave this for someone with a bit
> more time to pick up. Richard S it seems to have been your change in:
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;
> h=36fd64086542ed734aded849304723218fa4d6fd so I'll wait for others to
> confirm my thinking here, I always get nervous around bit fiddling ;)

Yes that does look like the fix and it looks like obvious and looks like
it was just a typo.

Though I wonder if we could change fold_not_const to handle it instead.
But other than that looks correct.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/14

------------------------------------------------------------------------
On 2025-03-03T12:12:41+00:00 Rsandifo-gcc wrote:

Oops, yes, a typo indeed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/15

------------------------------------------------------------------------
On 2025-03-04T10:45:00+00:00 Cvs-commit wrote:

The trunk branch has been updated by Richard Sandiford
<rsandifo at gcc.gnu.org>:

https://gcc.gnu.org/g:78380fd7f743e23dfdf013d68a2f0347e1511550

commit r15-7807-g78380fd7f743e23dfdf013d68a2f0347e1511550
Author: Richard Sandiford <richard.sandiford at arm.com>
Date:   Tue Mar 4 10:44:35 2025 +0000

    Fix folding of BIT_NOT_EXPR for POLY_INT_CST [PR118976]
    
    There was an embarrassing typo in the folding of BIT_NOT_EXPR for
    POLY_INT_CSTs: it used - rather than ~ on the poly_int.  Not sure
    how that happened, but it might have been due to the way that
    ~x is implemented as -1 - x internally.
    
    gcc/
            PR tree-optimization/118976
            * fold-const.cc (const_unop): Use ~ rather than - for BIT_NOT_EXPR.
            * config/aarch64/aarch64.cc (aarch64_test_sve_folding): New function.
            (aarch64_run_selftests): Run it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/16

------------------------------------------------------------------------
On 2025-03-04T10:49:58+00:00 Rsandifo-gcc wrote:

Fixed on trunk, will backport.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/17

------------------------------------------------------------------------
On 2025-03-04T17:49:53+00:00 Cvs-commit wrote:

The releases/gcc-14 branch has been updated by Richard Sandiford
<rsandifo at gcc.gnu.org>:

https://gcc.gnu.org/g:aa8793daa4ec110ae1e8fa240614651711b93fe4

commit r14-11380-gaa8793daa4ec110ae1e8fa240614651711b93fe4
Author: Richard Sandiford <richard.sandiford at arm.com>
Date:   Tue Mar 4 17:49:30 2025 +0000

    Fix folding of BIT_NOT_EXPR for POLY_INT_CST [PR118976]
    
    There was an embarrassing typo in the folding of BIT_NOT_EXPR for
    POLY_INT_CSTs: it used - rather than ~ on the poly_int.  Not sure
    how that happened, but it might have been due to the way that
    ~x is implemented as -1 - x internally.
    
    gcc/
            PR tree-optimization/118976
            * fold-const.cc (const_unop): Use ~ rather than - for BIT_NOT_EXPR.
            * config/aarch64/aarch64.cc (aarch64_test_sve_folding): New function.
            (aarch64_run_selftests): Run it.
    
    (cherry picked from commit 78380fd7f743e23dfdf013d68a2f0347e1511550)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/18

------------------------------------------------------------------------
On 2025-03-05T08:26:26+00:00 Cvs-commit wrote:

The releases/gcc-13 branch has been updated by Richard Sandiford
<rsandifo at gcc.gnu.org>:

https://gcc.gnu.org/g:7995713012fcc0e0e098157d87fe5ff9d85c820b

commit r13-9412-g7995713012fcc0e0e098157d87fe5ff9d85c820b
Author: Richard Sandiford <richard.sandiford at arm.com>
Date:   Wed Mar 5 08:25:55 2025 +0000

    Fix folding of BIT_NOT_EXPR for POLY_INT_CST [PR118976]
    
    There was an embarrassing typo in the folding of BIT_NOT_EXPR for
    POLY_INT_CSTs: it used - rather than ~ on the poly_int.  Not sure
    how that happened, but it might have been due to the way that
    ~x is implemented as -1 - x internally.
    
    gcc/
            PR tree-optimization/118976
            * fold-const.cc (const_unop): Use ~ rather than - for BIT_NOT_EXPR.
            * config/aarch64/aarch64.cc (aarch64_test_sve_folding): New function.
            (aarch64_run_selftests): Run it.
    
    (cherry picked from commit 78380fd7f743e23dfdf013d68a2f0347e1511550)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/19

------------------------------------------------------------------------
On 2025-03-06T15:10:23+00:00 Luke Robison wrote:

Richard,

Thank you for getting this merged and backported.  Although I initially
didn't observe this problem in gcc 11, I have since confirmed that with
the right flags  (-march=armv8.4-a+sve) it can be exposed as far back as
gcc-8.  My understanding is that versions 11 and 12 should still expect
at least one more release, should those branches receive a backport as
well?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/20

------------------------------------------------------------------------
On 2025-03-06T15:11:32+00:00 Sjames-j wrote:

(In reply to Luke Robison from comment #20)

11 is EOL (you can ask your distributor to handle it there if they're
still shipping it), but 12 is planned as the bug remains open.

Thanks for the report.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/2101084/comments/21


** Changed in: gcc
       Status: Unknown => In Progress

** Changed in: gcc
   Importance: Unknown => Medium

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to gcc-10 in Ubuntu.
https://bugs.launchpad.net/bugs/2101084

Title:
  GCC produces wrong code for arm64+sve in some cases

Status in gcc:
  In Progress
Status in gcc-10 package in Ubuntu:
  New
Status in gcc-11 package in Ubuntu:
  New
Status in gcc-8 package in Ubuntu:
  New
Status in gcc-9 package in Ubuntu:
  New

Bug description:
  This bug-report is to request patching of the GCC bug 118976 in the
  Ubuntu gcc packages to avoid correctness issues, especially in 24.04
  and 22.04 LTS releases.

  This issue effects SVE vectorization which involves bitwise-not during
  optimization on arm64 platforms.  It was reported and fixed in
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118976.

  For gcc 8-11 there will be no minor releases as they are EOL from GCC
  team.  For gcc 11 through trunk the fix will be included in the next
  minor version, but my understanding is that Ubuntu LTS releases are
  unlikely to upgrade minor versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/gcc/+bug/2101084/+subscriptions