[Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04
Mikko Rantalainen
2003851 at bugs.launchpad.net
Tue May 7 08:36:21 UTC 2024
This sounds like a race condition that leads to live lock, considering
that both gpgv and apt are waiting for the other end to speak at the
same time.
I assume the protocol between these is custom made for this connection.
How the protocol is supposed to prevent this live lock situation from
happening? If the protocol cannot avoid this situation, then the bug is
in the protocol design and the least invasive fix is to add timeout for
the waiting time and consider the execution as failed if timeout is
triggered.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851
Title:
occasional hanging 'apt-get update' from daily cronjob since Jammy
22.04
Status in apt package in Ubuntu:
Confirmed
Bug description:
Hi!
Yesterday I spotted several machines of ours where a period `apt-get
update` was stalled. The `http` children were hanging in `WaitFd`
(waiting for parent instructions/queue). The parent was looping in
`AcquireUpdate` every 500ms.
We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks.
Example:
```
# apt-get update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get)
N: Be aware that removing the lock file is not a solution and may break your system.
E: Unable to lock directory /var/lib/apt/lists/
```
Task listing:
```
root 153929 \_ /usr/sbin/CRON -f -P
root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron
root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
root 154026 \_ apt-get update
_apt 154029 \_ /usr/lib/apt/methods/http
_apt 154030 \_ /usr/lib/apt/methods/http
_apt 154031 \_ /usr/lib/apt/methods/http
_apt 154033 \_ /usr/lib/apt/methods/gpgv
```
Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http
tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http
tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http
```
All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL).
The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
NULL, {tv_sec=0, tv_nsec=500000000}, NULL).
The http sockets in the children were at fd=3.
Parent lsof:
```
# lsof -p 154026 +E
...
apt-get 154026 root 4uW REG 8,1 0 262281 /var/lib/apt/lists/lock
apt-get 154026 root 5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w
apt-get 154026 root 6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w
apt-get 154026 root 7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w
apt-get 154026 root 8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r
apt-get 154026 root 9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w
apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r
apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r
apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r
http 154029 _apt 0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w
http 154029 _apt 1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r
http 154030 _apt 0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w
http 154030 _apt 1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r
http 154031 _apt 0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w
http 154031 _apt 1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r
gpgv 154033 _apt 0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w
gpgv 154033 _apt 1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r
```
So:
- apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9)
- http and gpgv are waiting for any data written by their parent (at their respective fd 0)
Parent backtrace:
```
#0 0x00007f420116a74d in select ()
from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f420153fb5d in pkgAcquire::Run(int) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#3 0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#4 0x00007f42016d127b in DoUpdate (CmdL=...)
at ./apt-private/private-update.cc:73
#5 0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#6 0x00007f420169fa97 in DispatchCommandLine (CmdL=...,
Cmds=std::vector of length 27, capacity 32 = {...})
at ./apt-private/private-cmndline.cc:588
#7 0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
at ./cmdline/apt-get.cc:447
```
Child backtrace:
```
#0 0x00007f58b1c9b74d in select ()
from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) ()
from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00005643e37bc7b3 in BaseHttpMethod::Loop (
this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
#3 main (argv=<optimized out>) at ./methods/http.cc:1052
```
Where this is:
```
(gdb) print *this
$1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data fields>}, <aptConfigWrapperForMethods> = {
methodNames = std::vector of length 2, capacity 2 = {
"http", "http::XXXXXX.nl"}}, Binary = "http",
SeccompFlags = 6},
authconfs = std::vector of length 1, capacity 1 = {
std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}},
Server = std::unique_ptr<ServerState> = {
get() = 0x5643e534fba0}, NextURI = "",
AllowRedirect = true, Debug = false, PipelineDepth = 10,
static FailFile = {static npos = 18446744073709551615,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x5643e534d650 "/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
_M_string_length = 85, {
_M_local_buf = "\226", '\000' <repeats 14 times>,
_M_allocated_capacity = 150}}, static FailFd = -1,
static FailTime = 1674071760}
```
Relevant code in parent:
```
pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval)
{
...
// Run till all things have been acquired
struct timeval tv = SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
while (ToFetch > 0)
...
int Res;
do
{
Res = select(Highest+1,&RFds,&WFds,0,&tv);
}
while (Res < 0 && errno == EINTR);
```
Relevant code in child:
```
int BaseHttpMethod::Loop()
{
signal(SIGTERM,SigTerm);
signal(SIGINT,SigTerm);
Server = 0;
int FailCounter = 0;
while (1)
{
// We have no commands, wait for some to arrive
if (Queue == 0)
{
if (WaitFd(STDIN_FILENO) == false)
return 0;
```
Versions:
```
Ubuntu 22.04.1 LTS (Jammy Jellyfish)
apt 2.4.8
```
Preliminiary conclusion/notes:
- I don't know why it stalls, but it does. And this was not a single
incident. All the stalling apt processes had stopped at a different
time/day.
- Apparently there is a situation possible where there are no commands
queued from the parent, and the parent isn't planning on sending any
either.
- Our apt-get is invoked by a cron job which calls either `update-
notifier-common` or `apt-get update`:
https://github.com/ossobv/zabbix-agent-
osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31
-- on the machines where apt-get was hanging, there was no `update-
notifier-common`.
- I don't know if this is 2.4.x related (on Focal we have apt 2.0.x),
but it sure looks like it. I did check a few other Focal machines that
lacked `update-notifier-common`, but they were not having a stalled
apt-get (although _not_ seeing a hung process proves nothing)
Let me know if there's anything I can get you. Should this be filed elsewhere? (Debian Salsa?)
Cheers,
Walter Doekes
OSSO B.V.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+subscriptions
More information about the foundations-bugs
mailing list