[Bug 2039148] [NEW] lxd-installer can race or temp-fail and then block itself

Christian Ehrhardt  2039148 at bugs.launchpad.net
Thu Oct 12 08:18:13 UTC 2023


Public bug reported:

Hey,
while checking for some other issue I realized that pre-installed LXD isn't always working.
It is fully pre-installed on cloud-images and server installs to provide users quick access to a great feature.

But in minimal images it is not installed (ok for the reason to be minimal), yet it is not fully gone either and what is left fails without a clear indication to the (uneducated) user.
There we have `lxd-installer`

Normal image:

```
$ snap list | grep lxd
lxd     5.0.2-838e1b2  24322  5.0/stable/…   canonical**  -
$ which lxc
/snap/bin/lxc
$ lxc list
If this is your first time running LXD on this machine, you should also run: lxd init
To start your first container, try: lxc launch ubuntu:22.04
Or for a virtual machine: lxc launch ubuntu:22.04 --vm

+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+
```


But OTOH a minimal image has this ...

```
$ which lxc
/usr/sbin/lxc

$ dpkg -S /usr/sbin/lxc
dpkg-query: no path found matching pattern /usr/sbin/lxc

$ cat /usr/sbin/lxc
#!/bin/sh
SNAP_BIN="/snap/bin/$(basename $0)"
if [ ! -f ${SNAP_BIN} ]; then
    python3 -c 'import socket; s=socket.socket(socket.AF_UNIX); s.connect("/run/lxd-installer.socket"); s.send(b"x"); s.recv(1)'
fi
exec $SNAP_BIN "$@"

$ snap list 
No snaps are installed yet. Try 'snap install hello-world'.

AFAICS this is trying to use lxd-installer which is a package, so let me
try to file it against this and images in general.

It is trying to hold that connection until the installer has brought in
lxd and then pass it on.

# cat /lib/systemd/system/lxd-installer\@.service 
[Unit]
Description=Helper to install lxd snap on demand

[Service]
ExecStart=/bin/sh -eux /usr/share/lxd-installer/lxd-installer-service
StandardInput=socket
StandardOutput=socket
StandardError=journal
Restart=no


This is up (as socket) after start as one would expect.
And it even works fine usually:

$ lxc launch ubuntu-minimal-daily:j j-test --ephemeral --vm
Creating j-test
Starting j-test
$ lxc exec j-test bash
root at j-test:~# systemctl status lxd-installer.socket
● lxd-installer.socket - Helper to install lxd snap on demand
     Loaded: loaded (/lib/systemd/system/lxd-installer.socket; enabled; vendor preset: enabled)
     Active: active (listening) since Thu 2023-10-12 08:00:29 UTC; 4s ago
     Listen: /run/lxd-installer.socket (Stream)
   Accepted: 0; Connected: 0;
      Tasks: 0 (limit: 1171)
     Memory: 0B
        CPU: 374us
     CGroup: /system.slice/lxd-installer.socket

Oct 12 08:00:29 j-test systemd[1]: Starting Helper to install lxd snap on demand...
Oct 12 08:00:29 j-test systemd[1]: Listening on Helper to install lxd snap on demand.
root at j-test:~# lxc list
If this is your first time running LXD on this machine, you should also run: lxd init
To start your first container, try: lxc launch ubuntu:22.04
Or for a virtual machine: lxc launch ubuntu:22.04 --vm

+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+


But if instead this ever failed, then it is like:

root at m:~# systemctl status lxd-installer.socket
● lxd-installer.socket - Helper to install lxd snap on demand
     Loaded: loaded (/lib/systemd/system/lxd-installer.socket; enabled; preset: enabled)
     Active: active (listening) since Thu 2023-09-28 09:47:48 UTC; 1 week 6 days ago
   Triggers: ● lxd-installer at 3-13717-0.servicelxd-installer at 1-13484-0.servicelxd-installer at 2-13655-0.servicelxd-installer at 0-13372-0.service
     Listen: /run/lxd-installer.socket (Stream)
   Accepted: 4; Connected: 0;
      Tasks: 0 (limit: 38254)
     Memory: 0B
        CPU: 580us
     CGroup: /system.slice/lxd-installer.socket

Sep 28 09:47:48 m systemd[1]: Starting lxd-installer.socket - Helper to install lxd snap on demand...
Sep 28 09:47:48 m systemd[1]: Listening on lxd-installer.socket - Helper to install lxd snap on demand.
root at m:~# lxc list
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ConnectionResetError: [Errno 104] Connection reset by peer
/usr/sbin/lxc: 6: exec: /snap/bin/lxc: not found


While my initial case hitting this was due to an unknown failure

Oct 12 07:31:35 m systemd[1]: lxd-installer at 0-13372-0.service: Failed with result 'exit-code'.
Oct 12 07:32:55 m systemd[1]: Started lxd-installer at 1-13484-0.service - Helper to install lxd snap on demand (PID 13484/UID 0).
Oct 12 07:32:55 m systemd[1]: lxd-installer at 1-13484-0.service: Main process exited, code=exited, status=1/FAILURE

That was due to
snap install lxd
error: system does not fully support snapd: The "fuse" filesystem is required on this system but
       not available. Please try to install the fuse package.

But you do not have to re-create this.
Instead the repro-case is not too hard using the impatience simulator:

$ lxc launch ubuntu-minimal-daily:j j-test --ephemeral --vm
Creating j-test
Starting j-test
$ lxc exec j-test bash

# abort this first time to simulate any reason it might fail
root at j-test:~# lxc list
^CTraceback (most recent call last):
  File "<string>", line 1, in <module>
KeyboardInterrupt

# Now see it never coming back to live
root at j-test:~# lxc list
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ConnectionResetError: [Errno 104] Connection reset by peer
/usr/sbin/lxc: 6: exec: /snap/bin/lxc: not found


This is due to the service counting as started and going on in the background.
$ ps axlf | grep installer
0     0     753     240  20   0   4020  2092 ?      S+   pts/0      0:00      \_ grep --color=auto installer
4     0     556       1  20   0   2888   948 ?      Ss   ?          0:00 /bin/sh -eux /usr/share/lxd-installer/lxd-installer-service

If you wait long enough it will recover


There are a few scenarios I can think of:
1. a boot race, the socket is not yet up - triggers the same issue
2. a transient issue occurred, lxd-installer-service is still running in the background
3. there is a permanent, "lxd-installer will fail" problem


#1 and #2 should detect this and wait for the soon or already running job.
But it needs to be able to differ from #3 in which case it needs to give up at some point.

Maybe something simple like a retry + timeout logic might provide all of
that?

** Affects: lxd-installer (Ubuntu)
     Importance: Undecided
         Status: New

** Also affects: lxd-installer (Ubuntu)
   Importance: Undecided
       Status: New

** No longer affects: cloud-images

** Summary changed:

- lxd-installer is not idempotent
+ lxd-installer can race or temp-fail and then block itself

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to lxd-installer in Ubuntu.
https://bugs.launchpad.net/bugs/2039148

Title:
  lxd-installer can race or temp-fail and then block itself

Status in lxd-installer package in Ubuntu:
  New

Bug description:
  Hey,
  while checking for some other issue I realized that pre-installed LXD isn't always working.
  It is fully pre-installed on cloud-images and server installs to provide users quick access to a great feature.

  But in minimal images it is not installed (ok for the reason to be minimal), yet it is not fully gone either and what is left fails without a clear indication to the (uneducated) user.
  There we have `lxd-installer`

  Normal image:

  ```
  $ snap list | grep lxd
  lxd     5.0.2-838e1b2  24322  5.0/stable/…   canonical**  -
  $ which lxc
  /snap/bin/lxc
  $ lxc list
  If this is your first time running LXD on this machine, you should also run: lxd init
  To start your first container, try: lxc launch ubuntu:22.04
  Or for a virtual machine: lxc launch ubuntu:22.04 --vm

  +------+-------+------+------+------+-----------+
  | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
  +------+-------+------+------+------+-----------+
  ```

  
  But OTOH a minimal image has this ...

  ```
  $ which lxc
  /usr/sbin/lxc

  $ dpkg -S /usr/sbin/lxc
  dpkg-query: no path found matching pattern /usr/sbin/lxc

  $ cat /usr/sbin/lxc
  #!/bin/sh
  SNAP_BIN="/snap/bin/$(basename $0)"
  if [ ! -f ${SNAP_BIN} ]; then
      python3 -c 'import socket; s=socket.socket(socket.AF_UNIX); s.connect("/run/lxd-installer.socket"); s.send(b"x"); s.recv(1)'
  fi
  exec $SNAP_BIN "$@"

  $ snap list 
  No snaps are installed yet. Try 'snap install hello-world'.

  AFAICS this is trying to use lxd-installer which is a package, so let
  me try to file it against this and images in general.

  It is trying to hold that connection until the installer has brought
  in lxd and then pass it on.

  # cat /lib/systemd/system/lxd-installer\@.service 
  [Unit]
  Description=Helper to install lxd snap on demand

  [Service]
  ExecStart=/bin/sh -eux /usr/share/lxd-installer/lxd-installer-service
  StandardInput=socket
  StandardOutput=socket
  StandardError=journal
  Restart=no

  
  This is up (as socket) after start as one would expect.
  And it even works fine usually:

  $ lxc launch ubuntu-minimal-daily:j j-test --ephemeral --vm
  Creating j-test
  Starting j-test
  $ lxc exec j-test bash
  root at j-test:~# systemctl status lxd-installer.socket
  ● lxd-installer.socket - Helper to install lxd snap on demand
       Loaded: loaded (/lib/systemd/system/lxd-installer.socket; enabled; vendor preset: enabled)
       Active: active (listening) since Thu 2023-10-12 08:00:29 UTC; 4s ago
       Listen: /run/lxd-installer.socket (Stream)
     Accepted: 0; Connected: 0;
        Tasks: 0 (limit: 1171)
       Memory: 0B
          CPU: 374us
       CGroup: /system.slice/lxd-installer.socket

  Oct 12 08:00:29 j-test systemd[1]: Starting Helper to install lxd snap on demand...
  Oct 12 08:00:29 j-test systemd[1]: Listening on Helper to install lxd snap on demand.
  root at j-test:~# lxc list
  If this is your first time running LXD on this machine, you should also run: lxd init
  To start your first container, try: lxc launch ubuntu:22.04
  Or for a virtual machine: lxc launch ubuntu:22.04 --vm

  +------+-------+------+------+------+-----------+
  | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
  +------+-------+------+------+------+-----------+

  
  But if instead this ever failed, then it is like:

  root at m:~# systemctl status lxd-installer.socket
  ● lxd-installer.socket - Helper to install lxd snap on demand
       Loaded: loaded (/lib/systemd/system/lxd-installer.socket; enabled; preset: enabled)
       Active: active (listening) since Thu 2023-09-28 09:47:48 UTC; 1 week 6 days ago
     Triggers: ● lxd-installer at 3-13717-0.servicelxd-installer at 1-13484-0.servicelxd-installer at 2-13655-0.servicelxd-installer at 0-13372-0.service
       Listen: /run/lxd-installer.socket (Stream)
     Accepted: 4; Connected: 0;
        Tasks: 0 (limit: 38254)
       Memory: 0B
          CPU: 580us
       CGroup: /system.slice/lxd-installer.socket

  Sep 28 09:47:48 m systemd[1]: Starting lxd-installer.socket - Helper to install lxd snap on demand...
  Sep 28 09:47:48 m systemd[1]: Listening on lxd-installer.socket - Helper to install lxd snap on demand.
  root at m:~# lxc list
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
  ConnectionResetError: [Errno 104] Connection reset by peer
  /usr/sbin/lxc: 6: exec: /snap/bin/lxc: not found


  While my initial case hitting this was due to an unknown failure

  Oct 12 07:31:35 m systemd[1]: lxd-installer at 0-13372-0.service: Failed with result 'exit-code'.
  Oct 12 07:32:55 m systemd[1]: Started lxd-installer at 1-13484-0.service - Helper to install lxd snap on demand (PID 13484/UID 0).
  Oct 12 07:32:55 m systemd[1]: lxd-installer at 1-13484-0.service: Main process exited, code=exited, status=1/FAILURE

  That was due to
  snap install lxd
  error: system does not fully support snapd: The "fuse" filesystem is required on this system but
         not available. Please try to install the fuse package.

  But you do not have to re-create this.
  Instead the repro-case is not too hard using the impatience simulator:

  $ lxc launch ubuntu-minimal-daily:j j-test --ephemeral --vm
  Creating j-test
  Starting j-test
  $ lxc exec j-test bash

  # abort this first time to simulate any reason it might fail
  root at j-test:~# lxc list
  ^CTraceback (most recent call last):
    File "<string>", line 1, in <module>
  KeyboardInterrupt

  # Now see it never coming back to live
  root at j-test:~# lxc list
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
  ConnectionResetError: [Errno 104] Connection reset by peer
  /usr/sbin/lxc: 6: exec: /snap/bin/lxc: not found

  
  This is due to the service counting as started and going on in the background.
  $ ps axlf | grep installer
  0     0     753     240  20   0   4020  2092 ?      S+   pts/0      0:00      \_ grep --color=auto installer
  4     0     556       1  20   0   2888   948 ?      Ss   ?          0:00 /bin/sh -eux /usr/share/lxd-installer/lxd-installer-service

  If you wait long enough it will recover

  
  There are a few scenarios I can think of:
  1. a boot race, the socket is not yet up - triggers the same issue
  2. a transient issue occurred, lxd-installer-service is still running in the background
  3. there is a permanent, "lxd-installer will fail" problem

  
  #1 and #2 should detect this and wait for the soon or already running job.
  But it needs to be able to differ from #3 in which case it needs to give up at some point.

  Maybe something simple like a retry + timeout logic might provide all
  of that?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxd-installer/+bug/2039148/+subscriptions




More information about the foundations-bugs mailing list