[Bug 2031912] Re: glibc 2.38 causes hangs on some openMPI-using packages
Sergio Durigan Junior
2031912 at bugs.launchpad.net
Wed Aug 23 04:23:41 UTC 2023
Athos and I had a fun conversation about this bug, which prompted me to
look more deep into what's going on. I think I found why the bug is
happening. It's an interesting race involving multithreading,
semaphores and "open(2)" flags.
After some GDB/strace analysis, and having the gut feeling that this is
one of those "race-condition between runs due to something happening in
the filesystem", I initially found that the problem happens because, on
the first run (and then on subsequent runs that are odd-numbered), the
semaphore created by openmpi (called /dev/shm/sem.OMPIO_aaaa) doesn't
exist. This causes sem_open[0] to fail to open the file (because of the
O_CREAT file used here[1]). Note that this open(2) is performed
concurrently by the two threads created by mpirun. Also note that this
first failure is expected, because sem_open is being invoked with
O_CREAT (this time a sem_open flag!) by openmpi.
As can be seen, when sem_open fails to open the file at the location
mentioned above it will clean things up and go to the label "try_again".
This time, we're inside a section of the code which expects the
semaphore file to exist. As such, O_CREAT (the "open" flag) needs to be
removed from open_flags, but it isn't because [2] is above the label.
I'm still building glibc with the proposed change to test the fix, but
I'm pretty sure that that line needs to be moved inside the label, so I
submitted [3]. I'll report back on the results of the test tomorrow.
[0]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e
[1]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n138
[2]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n69
[3]: https://inbox.sourceware.org/libc-alpha/20230823042129.3955131-1-sergiodj@sergiodj.net/T/#u
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/2031912
Title:
glibc 2.38 causes hangs on some openMPI-using packages
Status in dolfin package in Ubuntu:
New
Status in glibc package in Ubuntu:
New
Status in h5py package in Ubuntu:
New
Status in mpi4py package in Ubuntu:
New
Status in openmpi package in Ubuntu:
New
Bug description:
This occurs on amd64 and armhf.
Relevant logs:
706s tests/test_file.py::TestPathlibSupport::test_pathlib_name_match [32mPASSED[0m[32m [ 65%][0m
706s tests/test_file.py::TestPickle::test_dump_error [32mPASSED[0m[32m [ 65%][0m
10706s tests/test_file.py::TestMPI::test_mpio autopkgtest [16:36:57]: ERROR: timed out on command "su -s /bin/bash ubuntu -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest.izAQWQ/build.Q5d/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; export 'ADT_TEST_TRIGGERS=glibc/2.38-1ubuntu3'; chmod +x /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi; touch /tmp/autopkgtest.izAQWQ/python3-mpi-stdout /tmp/autopkgtest.izAQWQ/python3-mpi-stderr; /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi 2> >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stderr >&2) > >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stdout);" (kind: test)
Full logs: https://autopkgtest.ubuntu.com/results/autopkgtest-
mantic/mantic/amd64/h/h5py/20230817_163711_290b0@/log.gz
Marking Critical as this blocks the glibc transition.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dolfin/+bug/2031912/+subscriptions
More information about the foundations-bugs
mailing list