[Bug 2031912] Re: glibc 2.38 causes hangs on some openMPI-using packages

Sergio Durigan Junior 2031912 at bugs.launchpad.net
Wed Aug 23 04:23:41 UTC 2023


Athos and I had a fun conversation about this bug, which prompted me to
look more deep into what's going on.  I think I found why the bug is
happening.  It's an interesting race involving multithreading,
semaphores and "open(2)" flags.

After some GDB/strace analysis, and having the gut feeling that this is
one of those "race-condition between runs due to something happening in
the filesystem", I initially found that the problem happens because, on
the first run (and then on subsequent runs that are odd-numbered), the
semaphore created by openmpi (called /dev/shm/sem.OMPIO_aaaa) doesn't
exist.  This causes sem_open[0] to fail to open the file (because of the
O_CREAT file used here[1]).  Note that this open(2) is performed
concurrently by the two threads created by mpirun.  Also note that this
first failure is expected, because sem_open is being invoked with
O_CREAT (this time a sem_open flag!) by openmpi.

As can be seen, when sem_open fails to open the file at the location
mentioned above it will clean things up and go to the label "try_again".
This time, we're inside a section of the code which expects the
semaphore file to exist.  As such, O_CREAT (the "open" flag) needs to be
removed from open_flags, but it isn't because [2] is above the label.

I'm still building glibc with the proposed change to test the fix, but
I'm pretty sure that that line needs to be moved inside the label, so I
submitted [3].  I'll report back on the results of the test tomorrow.

[0]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e
[1]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n138
[2]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n69
[3]: https://inbox.sourceware.org/libc-alpha/20230823042129.3955131-1-sergiodj@sergiodj.net/T/#u

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/2031912

Title:
  glibc 2.38 causes hangs on some openMPI-using packages

Status in dolfin package in Ubuntu:
  New
Status in glibc package in Ubuntu:
  New
Status in h5py package in Ubuntu:
  New
Status in mpi4py package in Ubuntu:
  New
Status in openmpi package in Ubuntu:
  New

Bug description:
  This occurs on amd64 and armhf.

  Relevant logs:

  706s tests/test_file.py::TestPathlibSupport::test_pathlib_name_match PASSED   [ 65%]
  706s tests/test_file.py::TestPickle::test_dump_error PASSED                   [ 65%]
  10706s tests/test_file.py::TestMPI::test_mpio autopkgtest [16:36:57]: ERROR: timed out on command "su -s /bin/bash ubuntu -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true;  . ~/.profile >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest.izAQWQ/build.Q5d/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE   LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS   LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; export 'ADT_TEST_TRIGGERS=glibc/2.38-1ubuntu3'; chmod +x /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi; touch /tmp/autopkgtest.izAQWQ/python3-mpi-stdout /tmp/autopkgtest.izAQWQ/python3-mpi-stderr; /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi 2> >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stderr >&2) > >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stdout);" (kind: test)

  Full logs: https://autopkgtest.ubuntu.com/results/autopkgtest-
  mantic/mantic/amd64/h/h5py/20230817_163711_290b0@/log.gz

  Marking Critical as this blocks the glibc transition.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dolfin/+bug/2031912/+subscriptions




More information about the foundations-bugs mailing list