[Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived clusters
Rafael David Tinoco
rafaeldtinoco at kernelpath.com
Wed Sep 25 13:23:50 UTC 2019
Alright,
As this is a problem that does not only affect keepalived, but, all
cluster-like softwares dealing with aliases in any existing interface,
managed or not by systemd, I have tested the same test case in a
pacemaker based cluster, with 3 nodes, having 1 virtual IP + a lighttpd
instance running in the same resource group:
----
(k)inaddy at kcluster01:~$ crm config show
node 1: kcluster01
node 2: kcluster02
node 3: kcluster03
primitive fence_kcluster01 stonith:fence_virsh \
params ipaddr=192.168.100.205 plug=kcluster01 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=2 \
op monitor interval=60s
primitive fence_kcluster02 stonith:fence_virsh \
params ipaddr=192.168.100.205 plug=kcluster02 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=4 \
op monitor interval=60s
primitive fence_kcluster03 stonith:fence_virsh \
params ipaddr=192.168.100.205 plug=kcluster03 action=off login=stonithmgr passwd=xxxx use_sudo=true delay=6 \
op monitor interval=60s
primitive virtual_ip IPaddr2 \
params ip=10.0.3.1 nic=eth3 \
op monitor interval=10s
primitive webserver systemd:lighttpd \
op monitor interval=10 timeout=60
group webserver_virtual_ip webserver virtual_ip
location l_fence_kcluster01 fence_kcluster01 -inf: kcluster01
location l_fence_kcluster02 fence_kcluster02 -inf: kcluster02
location l_fence_kcluster03 fence_kcluster03 -inf: kcluster03
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=2.0.1-9e909a5bdd \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=true \
stonith-action=off \
no-quorum-policy=stop
----
(k)inaddy at kcluster01:~$ cat /etc/netplan/cluster.yaml
network:
version: 2
renderer: networkd
ethernets:
eth1:
dhcp4: no
dhcp6: no
addresses: [10.0.1.2/24]
eth2:
dhcp4: no
dhcp6: no
addresses: [10.0.2.2/24]
eth3:
dhcp4: no
dhcp6: no
addresses: [10.0.3.2/24]
eth4:
dhcp4: no
dhcp6: no
addresses: [10.0.4.2/24]
eth5:
dhcp4: no
dhcp6: no
addresses: [10.0.5.2/24]
----
AND the virtual IP failed right after the netplan acted in systemd
network interface.
(k)inaddy at kcluster03:~$ sudo netplan apply
(k)inaddy at kcluster03:~$ ping 10.0.3.1
PING 10.0.3.1 (10.0.3.1) 56(84) bytes of data.
>From 10.0.3.4 icmp_seq=1 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=2 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=3 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=4 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=5 Destination Host Unreachable
>From 10.0.3.4 icmp_seq=6 Destination Host Unreachable
64 bytes from 10.0.3.1: icmp_seq=7 ttl=64 time=0.088 ms
64 bytes from 10.0.3.1: icmp_seq=8 ttl=64 time=0.076 ms
--- 10.0.3.1 ping statistics ---
8 packets transmitted, 2 received, +6 errors, 75% packet loss, time 7128ms
rtt min/avg/max/mdev = 0.076/0.082/0.088/0.006 ms, pipe 4
Liked explained in this bug description. With that, virtual_ip_monitor,
from pacemaker, realized the virtual IP was gone and re-started it in
the same node:
----
(k)inaddy at kcluster01:~$ crm status
Stack: corosync
Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Sep 25 13:11:05 2019
Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01
3 nodes configured
5 resources configured
Online: [ kcluster01 kcluster02 kcluster03 ]
Full list of resources:
fence_kcluster01 (stonith:fence_virsh): Started kcluster02
fence_kcluster02 (stonith:fence_virsh): Started kcluster01
fence_kcluster03 (stonith:fence_virsh): Started kcluster01
Resource Group: webserver_virtual_ip
webserver (systemd:lighttpd): Started kcluster03
virtual_ip (ocf::heartbeat:IPaddr2): FAILED kcluster03
Failed Resource Actions:
* virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, status=complete, exitreason='',
last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms
----
(k)inaddy at kcluster01:~$ crm status
Stack: corosync
Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Sep 25 13:11:07 2019
Last change: Wed Sep 25 12:49:56 2019 by root via cibadmin on kcluster01
3 nodes configured
5 resources configured
Online: [ kcluster01 kcluster02 kcluster03 ]
Full list of resources:
fence_kcluster01 (stonith:fence_virsh): Started kcluster02
fence_kcluster02 (stonith:fence_virsh): Started kcluster01
fence_kcluster03 (stonith:fence_virsh): Started kcluster01
Resource Group: webserver_virtual_ip
webserver (systemd:lighttpd): Started kcluster03
virtual_ip (ocf::heartbeat:IPaddr2): Started kcluster03
Failed Resource Actions:
* virtual_ip_monitor_10000 on kcluster03 'not running' (7): call=100, status=complete, exitreason='',
last-rc-change='Wed Sep 25 13:11:05 2019', queued=0ms, exec=0ms
----
And, if I want, I can query the number of restarts that particular
resource (the virtual_ip monitor) had in that node, to check if the
resource was about to migrate to another node, thinking this was a real
failure (and it is ?):
(k)inaddy at kcluster01:~$ sudo crm_failcount --query -r virtual_ip -N kcluster03
scope=status name=fail-count-virtual_ip value=5
So this resource already failed 5 times in that node, and a "netplan
apply" could have migrated the issue, for example.
----
For pacemaker, the issue is not *that big* if the cluster is configured
correctly - with a resource monitor - as the cluster will always try to
restart the virtual IP associated with the resource - lighttpd in my
case - being managed. Nevertheless, resource migrations and possible
downtime could happen in the event of multiple resource monitor
failures.
I'll check now why keepalived can't simply re-establish the virtual IPs
in the event of a failure, like pacemaker does, and, if systemd-networkd
should be altered not to change aliases if having a specific flag, or
things are good the way they are.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/1815101
Title:
[master] Restarting systemd-networkd breaks keepalived clusters
Status in netplan:
Confirmed
Status in heartbeat package in Ubuntu:
Triaged
Status in keepalived package in Ubuntu:
In Progress
Status in systemd package in Ubuntu:
In Progress
Status in heartbeat source package in Bionic:
Triaged
Status in keepalived source package in Bionic:
Confirmed
Status in systemd source package in Bionic:
Confirmed
Status in heartbeat source package in Disco:
Triaged
Status in keepalived source package in Disco:
Confirmed
Status in systemd source package in Disco:
Confirmed
Status in heartbeat source package in Eoan:
Triaged
Status in keepalived source package in Eoan:
In Progress
Status in systemd source package in Eoan:
In Progress
Bug description:
Configure netplan for interfaces, for example (a working config with
IP addresses obfuscated)
network:
ethernets:
eth0:
addresses: [192.168.0.5/24]
dhcp4: false
nameservers:
search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com]
addresses: [10.22.11.1]
eth2:
addresses:
- 12.13.14.18/29
- 12.13.14.19/29
gateway4: 12.13.14.17
dhcp4: false
nameservers:
search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com]
addresses: [10.22.11.1]
eth3:
addresses: [10.22.11.6/24]
dhcp4: false
nameservers:
search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com]
addresses: [10.22.11.1]
eth4:
addresses: [10.22.14.6/24]
dhcp4: false
nameservers:
search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com]
addresses: [10.22.11.1]
eth7:
addresses: [9.5.17.34/29]
dhcp4: false
optional: true
nameservers:
search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, phone.blah.com]
addresses: [10.22.11.1]
version: 2
Configure keepalived (again, a working config with IP addresses
obfuscated)
global_defs # Block id
{
notification_email {
sysadmins at blah.com
}
notification_email_from keepalived at system3.hq.blah.com
smtp_server 10.22.11.7 # IP
smtp_connect_timeout 30 # integer, seconds
router_id system3 # string identifying the machine,
# (doesn't have to be hostname).
vrrp_mcast_group4 224.0.0.18 # optional, default 224.0.0.18
vrrp_mcast_group6 ff02::12 # optional, default ff02::12
enable_traps # enable SNMP traps
}
vrrp_sync_group collection {
group {
wan
lan
phone
}
vrrp_instance wan {
state MASTER
interface eth2
virtual_router_id 77
priority 150
advert_int 1
smtp_alert
authentication {
auth_type PASS
auth_pass BlahBlah
}
virtual_ipaddress {
12.13.14.20
}
}
vrrp_instance lan {
state MASTER
interface eth3
virtual_router_id 78
priority 150
advert_int 1
smtp_alert
authentication {
auth_type PASS
auth_pass MoreBlah
}
virtual_ipaddress {
10.22.11.13/24
}
}
vrrp_instance phone {
state MASTER
interface eth4
virtual_router_id 79
priority 150
advert_int 1
smtp_alert
authentication {
auth_type PASS
auth_pass MostBlah
}
virtual_ipaddress {
10.22.14.3/24
}
}
At boot the affected interfaces have:
5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff
inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4
valid_lft forever preferred_lft forever
inet 10.22.14.3/24 scope global secondary eth4
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link
valid_lft forever preferred_lft forever
7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff
inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3
valid_lft forever preferred_lft forever
inet 10.22.11.13/24 scope global secondary eth3
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:feb0:2629/64 scope link
valid_lft forever preferred_lft forever
9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff
inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2
valid_lft forever preferred_lft forever
inet 12.13.14.20/32 scope global eth2
valid_lft forever preferred_lft forever
inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:feb0:262b/64 scope link
valid_lft forever preferred_lft forever
Run 'netplan try' (didn't even make any changes to the configuration) and the keepalived addresses disappear never to return, the affected interfaces have:
5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff
inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link
valid_lft forever preferred_lft forever
7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff
inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:feb0:2629/64 scope link
valid_lft forever preferred_lft forever
9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff
inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2
valid_lft forever preferred_lft forever
inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2
valid_lft forever preferred_lft forever
inet6 fe80::ae1f:6bff:feb0:262b/64 scope link
valid_lft forever preferred_lft forever
To manage notifications about this bug go to:
https://bugs.launchpad.net/netplan/+bug/1815101/+subscriptions
More information about the foundations-bugs
mailing list