[Bug 1872118] Re: [SRU] DHCP Cluster crashes after a few hours
Jorge Niedbalski
1872118 at bugs.launchpad.net
Tue Aug 11 19:38:25 UTC 2020
** Description changed:
+ [Description]
- I have a pair of DHCP serevrs running in a cluster on ubuntu 20.04, All worked perfectly until recently, when they started stopping with code=killed, status=6/ABRT.
- This is being fixed by
+ isc-dhcp-server uses libisc-export (coming from bin9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket
+ in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion
+ will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state.
- https://bugs.launchpad.net/bugs/1870729
+ If this race condition happens, the following stacktrace will be
+ hit:
- However now one stops after a few hours with the following errors. One
- can stay on line but not both.
+ (gdb) info threads
+ Id Target Id Frame
+ * 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
+ 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex at entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
+ 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
+ 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183
+
+ (gdb) frame 2
+ #2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
+ cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
+ (gdb) bt
+ #1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79
+ #2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
+ cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
+ #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
+ #4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054
+ #5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211
+ #6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397
+ #7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477
+ #8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
+
+ (gdb) frame 3
+ #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
+ 4041 in ../../../../lib/isc/unix/socket.c
+ (gdb) p sock->pending_send
+ $2 = 1
+
+ [TEST CASE]
+
+ 1) Install isc-dhcp-server in 2 focal machine(s).
+ 2) Configure peer/cluster mode as follows:
+ Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
+ Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/
+ 2) Run dhcpd as follows in both machine(s)
+
+ # dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4
+
+ 3) Leave the cluster running for a long (2h) period until the crash/race
+ condition is reproduced.
+ [REGRESSION POTENTIAL]
- Syslog shows
- Apr 10 17:20:15 dhcp-primary sh[6828]: ../../../../lib/isc/unix/socket.c:3361: INSIST(!sock->pending_send) failed, back trace
- Apr 10 17:20:15 dhcp-primary sh[6828]: #0 0x7fbe78702a4a in ??
- Apr 10 17:20:15 dhcp-primary sh[6828]: #1 0x7fbe78702980 in ??
- Apr 10 17:20:15 dhcp-primary sh[6828]: #2 0x7fbe7873e7e1 in ??
- Apr 10 17:20:15 dhcp-primary sh[6828]: #3 0x7fbe784e5609 in ??
- Apr 10 17:20:15 dhcp-primary sh[6828]: #4 0x7fbe78621103 in ??
-
-
- nothing in kern.log
-
-
- apport.log shows
- ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: called for pid 6828, signal 6, core limit 0, dump mode 2
- ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: not creating core for pid with dump mode of 2
- ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: executable: /usr/sbin/dhcpd (command line "dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf")
- ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment
- ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: wrote report /var/crash/_usr_sbin_dhcpd.0.crash
-
-
- /var/crash/_usr_sbin_dhcpd.0.crash shows
-
- ProblemType: Crash
- Architecture: amd64
- CrashCounter: 1
- Date: Fri Apr 10 17:20:15 2020
- DistroRelease: Ubuntu 20.04
- ExecutablePath: /usr/sbin/dhcpd
- ExecutableTimestamp: 1586210315
- ProcCmdline: dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf
- ProcEnviron: Error: [Errno 13] Permission denied: 'environ'
- ProcMaps: Error: [Errno 13] Permission denied: 'maps'
- ProcStatus:
- Name: dhcpd
- Umask: 0022
- State: D (disk sleep)
- Tgid: 6828
- Ngid: 0
- Pid: 6828
- PPid: 1
- TracerPid: 0
- Uid: 113 113 113 113
- Gid: 118 118 118 118
- FDSize: 128
- Groups:
- NStgid: 6828
- NSpid: 6828
- NSpgid: 6828
- NSsid: 6828
- VmPeak: 236244 kB
- VmSize: 170764 kB
- VmLck: 0 kB
- VmPin: 0 kB
- VmHWM: 12064 kB
- VmRSS: 12064 kB
- RssAnon: 5940 kB
- RssFile: 6124 kB
- RssShmem: 0 kB
- VmData: 30792 kB
- VmStk: 132 kB
- VmExe: 592 kB
- VmLib: 5424 kB
- VmPTE: 76 kB
- VmSwap: 0 kB
- HugetlbPages: 0 kB
- CoreDumping: 1
- THP_enabled: 1
- Threads: 4
- SigQ: 0/7609
- SigPnd: 0000000000000000
- ShdPnd: 0000000000000000
- SigBlk: 0000000000000000
- SigIgn: 0000000000001000
- SigCgt: 0000000180000000
- CapInh: 0000000000000000
- CapPrm: 0000000000000000
- CapEff: 0000000000000000
- CapBnd: 0000003fffffffff
- CapAmb: 0000000000000000
- NoNewPrivs: 0
- Seccomp: 0
- Speculation_Store_Bypass: thread vulnerable
- Cpus_allowed: 3
- Cpus_allowed_list: 0-1
- Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000
- 0000,00000000,00000000,00000001
- Mems_allowed_list: 0
- voluntary_ctxt_switches: 111
- nonvoluntary_ctxt_switches: 144
- Signal: 6
- Uname: Linux 5.4.0-21-generic x86_64
- UserGroups:
+ * The fix will prevent the assertion to happen in the dispatch_send
+ path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this
+ assertion at process_fd shouldn't be problematic.
** Description changed:
[Description]
- isc-dhcp-server uses libisc-export (coming from bin9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket
+ isc-dhcp-server uses libisc-export (coming from bind9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket
in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion
will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state.
If this race condition happens, the following stacktrace will be
hit:
(gdb) info threads
- Id Target Id Frame
+ Id Target Id Frame
* 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
- 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex at entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
- 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
- 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183
+ 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex at entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
+ 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
+ 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183
(gdb) frame 2
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
- cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
+ cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
(gdb) bt
#1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
- cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
+ cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
#4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054
#5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211
#6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397
#7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) frame 3
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
4041 in ../../../../lib/isc/unix/socket.c
(gdb) p sock->pending_send
$2 = 1
[TEST CASE]
1) Install isc-dhcp-server in 2 focal machine(s).
2) Configure peer/cluster mode as follows:
- Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
- Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/
+ Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
+ Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/
2) Run dhcpd as follows in both machine(s)
# dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4
3) Leave the cluster running for a long (2h) period until the crash/race
condition is reproduced.
-
[REGRESSION POTENTIAL]
* The fix will prevent the assertion to happen in the dispatch_send
path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this
assertion at process_fd shouldn't be problematic.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1872118
Title:
[SRU] DHCP Cluster crashes after a few hours
Status in DHCP:
New
Status in bind9-libs package in Ubuntu:
In Progress
Status in isc-dhcp package in Ubuntu:
In Progress
Status in bind9-libs source package in Focal:
In Progress
Status in isc-dhcp source package in Focal:
In Progress
Status in bind9-libs source package in Groovy:
In Progress
Status in isc-dhcp source package in Groovy:
In Progress
Bug description:
[Description]
isc-dhcp-server uses libisc-export (coming from bind9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket
in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion
will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state.
If this race condition happens, the following stacktrace will be
hit:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex at entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52
3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364
4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183
(gdb) frame 2
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
(gdb) bt
#1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79
#2 0x00007fb4dec85985 in isc_assertion_failed (file=file at entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line at entry=3361, type=type at entry=isc_assertiontype_insist,
cond=cond at entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
#4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054
#5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211
#6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397
#7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) frame 3
#3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041
4041 in ../../../../lib/isc/unix/socket.c
(gdb) p sock->pending_send
$2 = 1
[TEST CASE]
1) Install isc-dhcp-server in 2 focal machine(s).
2) Configure peer/cluster mode as follows:
Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/
Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/
2) Run dhcpd as follows in both machine(s)
# dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4
3) Leave the cluster running for a long (2h) period until the
crash/race condition is reproduced.
[REGRESSION POTENTIAL]
* The fix will prevent the assertion to happen in the dispatch_send
path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this
assertion at process_fd shouldn't be problematic.
To manage notifications about this bug go to:
https://bugs.launchpad.net/dhcp/+bug/1872118/+subscriptions
More information about the foundations-bugs
mailing list