[Bug 1799397] Re: [dpdk]rte_memcpy() moves data incorrectly on Ubuntu 18.04 on Intel Skylake.

Launchpad Bug Tracker 1799397 at bugs.launchpad.net
Tue Apr 2 17:16:58 UTC 2019


This bug was fixed in the package dpdk - 17.11.5-0~ubuntu18.10.1

---------------
dpdk (17.11.5-0~ubuntu18.10.1) cosmic; urgency=medium

  * New upstream release 17.11.5; for a full list of changes see:
    https://doc.dpdk.org/guides-17.11/rel_notes/release_17_11.html#id4
    https://doc.dpdk.org/guides-17.11/rel_notes/release_17_11.html#id5
    Among many other fixes this closes the following bugs:
    - request to merge 17.11.5 (LP: #1817675)
    - issues with -mavx512f on recent Skylake chips (LP: #1799397)
    - Drop d/p/net-mlx5-fix-build-with-rdma-core-v19.patch which is part of
      17.11.4
  * d/p/*kni-fix-build*: fix build with kernel 5.0 (LP: #1814919)
    as preparation for a HWE kernel based on the 5.0 version of 19.04

 -- Christian Ehrhardt <christian.ehrhardt at canonical.com>  Tue, 26 Feb
2019 12:34:12 +0100

** Changed in: dpdk (Ubuntu Cosmic)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to gcc-7 in Ubuntu.
https://bugs.launchpad.net/bugs/1799397

Title:
  [dpdk]rte_memcpy() moves data incorrectly on Ubuntu 18.04 on    Intel
  Skylake.

Status in DPDK:
  Fix Released
Status in dpdk package in Ubuntu:
  Fix Released
Status in gcc-7 package in Ubuntu:
  Invalid
Status in dpdk source package in Bionic:
  Fix Committed
Status in dpdk source package in Cosmic:
  Fix Released

Bug description:
  [Impact]

   * Crashing on certain SkyLake Chips

   * Follow upstream disabling one of the gcc options

  [Test Case]

   * Part of the MRE bug 1817675 following the MRE verficiation process as 
     defined there.

  [Regression Potential]

   * Rebuilds with the new code using DPDK headers will be slightly slower 
     (not using the feature) but avoiding the crash. The slowdown should 
     be negligible for most cases and the crash avoidance outweigh this.

  [Other Info]
   
   * n/a

  ---

  Hi, Christian

  We've recently encountered a weird issue with Ubuntu 18.04 on the Skylake
  server. I can always reproduce this crash and I could narrowed it down. I guess
  it could be a GCC issue.

  [1] How to reproduce
  - ConnectX-4Lx/ConnectX-5 with mlx5 PMD in DPDK 18.02.1
  - Ubuntu 18.04 on Intel Skylake server
  - gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
  - Testpmd crashes when it starts to forward traffic. Easy to reproduce.
  - Only happens on the Skylake server.
  - DPDK 18.05 and later don't have such issue. git-bisect gives no clue.

  This is because I enabled MEMPOOL_DEBUG and MLX5_DEBUG. As mempool/rte_memcpy is
  inlined function, it should be affected. Now I can see the crash regardlessly -
  18.02, 18.05 and 18.08.

  [2] Failure point

  The attached patch gives an insight of why it crashes. The following is the
  result of the patch and the GDB commands.

  In summary, rte_memcpy() doesn't work as expected. In __mempool_generic_put(),
  there's rte_memcpy() to move the array of objects to the lcore cache. If I run
  memcmp() right after rte_memcpy(dst, src, n), data in dst differs from data in
  src. And it looks like some of data got shifted by a few bytes as you can see
  below.

   [GDB command]
   $dst = 0x7ffff4e09ea8
   $src = 0x7fffce3fb970
   $n = 256
   x/32gx 0x7ffff4e09ea8
   x/32gx 0x7fffce3fb970
   testpmd: /home/mlnxtest/dpdk/build/include/rte_mempool.h:1140: __mempool_generic_put: Assertion `0' failed.

   Thread 4 "lcore-slave-1" received signal SIGABRT, Aborted.
   [Switching to Thread 0x7fffce3ff700 (LWP 69913)]
   (gdb) x/32gx 0x7ffff4e09ea8
   0x7ffff4e09ea8: 0x00007fffaac38ec0      0x00007fffaac38500
   0x7ffff4e09eb8: 0x00007fffaac37b40      0x00007fffaac37180
   0x7ffff4e09ec8: 0x850000007fffaac3      0x7b4000007fffaac3
   0x7ffff4e09ed8: 0x00007fffaac35440      0x00007fffaac34a80
   0x7ffff4e09ee8: 0xaac3850000007fff      0xaac37b4000007fff
   0x7ffff4e09ef8: 0x00007fffaac32d40      0x00007fffaac32380
   0x7ffff4e09f08: 0x7fffaac385000000      0x7fffaac37b400000
   0x7ffff4e09f18: 0x00007fffaac30640      0x00007fffaac2fc80
   0x7ffff4e09f28: 0x00007fffaac2f2c0      0x00007fffaac2e900
   0x7ffff4e09f38: 0x00007fffaac2df40      0x00007fffaac2d580
   0x7ffff4e09f48: 0x00007fffaac2cbc0      0x00007fffaac2c200
   0x7ffff4e09f58: 0x00007fffaac2b840      0x00007fffaac2ae80
   0x7ffff4e09f68: 0x00007fffaac2a4c0      0x00007fffaac29b00
   0x7ffff4e09f78: 0x00007fffaac29140      0x00007fffaac28780
   0x7ffff4e09f88: 0x00007fffaac27dc0      0x00007fffaac27400
   0x7ffff4e09f98: 0x00007fffaac26a40      0x00007fffaac26080
   (gdb) x/32gx 0x7fffce3fb970
   0x7fffce3fb970: 0x00007fffaac38ec0      0x00007fffaac38500
   0x7fffce3fb980: 0x00007fffaac37b40      0x00007fffaac37180
   0x7fffce3fb990: 0x00007fffaac367c0      0x00007fffaac35e00
   0x7fffce3fb9a0: 0x00007fffaac35440      0x00007fffaac34a80
   0x7fffce3fb9b0: 0x00007fffaac340c0      0x00007fffaac33700
   0x7fffce3fb9c0: 0x00007fffaac32d40      0x00007fffaac32380
   0x7fffce3fb9d0: 0x00007fffaac319c0      0x00007fffaac31000
   0x7fffce3fb9e0: 0x00007fffaac30640      0x00007fffaac2fc80
   0x7fffce3fb9f0: 0x00007fffaac2f2c0      0x00007fffaac2e900
   0x7fffce3fba00: 0x00007fffaac2df40      0x00007fffaac2d580
   0x7fffce3fba10: 0x00007fffaac2cbc0      0x00007fffaac2c200
   0x7fffce3fba20: 0x00007fffaac2b840      0x00007fffaac2ae80
   0x7fffce3fba30: 0x00007fffaac2a4c0      0x00007fffaac29b00
   0x7fffce3fba40: 0x00007fffaac29140      0x00007fffaac28780
   0x7fffce3fba50: 0x00007fffaac27dc0      0x00007fffaac27400
   0x7fffce3fba60: 0x00007fffaac26a40      0x00007fffaac26080

  AFAIK, AVX512F support is disabled by default in DPDK as it is still
  experimental (CONFIG_RTE_ENABLE_AVX512=n). But with gcc optimization, AVX2
  version of rte_memcpy() seems to be optimized with 512b instructions. If I
  disable it by adding EXTRA_CFLAGS="-mno-avx512f", then it works fine and doesn't
  crash.

  Do you have any idea regarding this issue or are you already aware of
  it?

  Thanks,
  Yongseok

  $ git diff
  diff --git a/config/common_base b/config/common_base
  index ad03cf433..f512b5a88 100644
  --- a/config/common_base
  +++ b/config/common_base
  @@ -275,8 +275,8 @@ CONFIG_RTE_LIBRTE_MLX4_TX_MP_CACHE=8
   #
   # Compile burst-oriented Mellanox ConnectX-4 & ConnectX-5 (MLX5) PMD
   #
  -CONFIG_RTE_LIBRTE_MLX5_PMD=n
  -CONFIG_RTE_LIBRTE_MLX5_DEBUG=n
  +CONFIG_RTE_LIBRTE_MLX5_PMD=y
  +CONFIG_RTE_LIBRTE_MLX5_DEBUG=y
   CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=n
   CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8

  @@ -597,7 +597,7 @@ CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
   #
   CONFIG_RTE_LIBRTE_MEMPOOL=y
   CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE=512
  -CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
  +CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=y

   #
   # Compile Mempool drivers
  diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
  index 8b1b7f7ed..9f48028d9 100644
  --- a/lib/librte_mempool/rte_mempool.h
  +++ b/lib/librte_mempool/rte_mempool.h
  @@ -39,6 +39,7 @@
   #include <errno.h>
   #include <inttypes.h>
   #include <sys/queue.h>
  +#include <assert.h>

   #include <rte_config.h>
   #include <rte_spinlock.h>
  @@ -1123,6 +1124,22 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
          /* Add elements back into the cache */
          rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);

  +       if(memcmp(&cache_objs[0], obj_table, sizeof(void *) * n)) {
  +               printf("[GDB command] \n"
  +                      "$dst = %p\n"
  +                      "$src = %p\n"
  +                      "$n = %ld\n"
  +                      "x/%ldgx %p\n"
  +                      "x/%ldgx %p\n",
  +                      (void *)&cache_objs[0],
  +                      (const void *)obj_table,
  +                      sizeof(void *) * n,
  +                      sizeof(void *) * n / 8, (void *)&cache_objs[0],
  +                      sizeof(void *) * n / 8, (const void *)obj_table
  +                      );
  +               assert(0);
  +       }
  +
          cache->len += n;

          if (cache->len >= cache->flushthresh) {

To manage notifications about this bug go to:
https://bugs.launchpad.net/dpdk/+bug/1799397/+subscriptions



More information about the foundations-bugs mailing list