Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various software refuses to bind to it's listenaddress #71106

Closed
wucke13 opened this issue Oct 14, 2019 · 19 comments
Closed

Various software refuses to bind to it's listenaddress #71106

wucke13 opened this issue Oct 14, 2019 · 19 comments

Comments

@wucke13
Copy link

wucke13 commented Oct 14, 2019

This is a weird bug. When a service is changed during nixos-rebuild and therefore restarted, it refuses to bind to it's given port. For Example:

starting the following units: accounts-daemon.service, cups.socket, mpd.service, nix-daemon.socket, systemd-modules-load.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, tlp.service, udisks2.service, zerotierone.service
Job failed. See "journalctl -xe" for details.
warning: the following units failed: cups.socket

● cups.socket - CUPS Scheduler
   Loaded: loaded (/nix/store/nkihl7vbzbqz990hq15q42h69c06pijg-cups-2.2.12/lib/systemd/system/cups.socket; enabled; vendor preset: enabled)
  Drop-In: /nix/store/za3k732cia0g9y21xc5sdfflvgfsxd9j-system-units/cups.socket.d
           └─overrides.conf
   Active: failed (Result: resources) since Mon 2019-10-14 10:17:56 CEST; 3s ago
   Listen: /run/cups/cups.sock (Stream)
           /run/cups/cups.sock (Stream)
           127.0.0.1:631 (Stream)
       IP: 2.1K in, 2.2K out
      CPU: 645us

Oct 14 10:17:58 zornpad systemd[1]: /nix/store/nkihl7vbzbqz990hq15q42h69c06pijg-cups-2.2.12/lib/systemd/system/cups.socket:6: ListenStream= references a path below legacy directory /var/run/, updating /var/run/cups/cups.sock → /run/cups/cups.sock; please update the unit file accordingly....
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed to receive listening socket (127.0.0.1:631): Input/output error
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed to listen on sockets: Input/output error
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed with result 'resources'.
Oct 14 10:17:58 zornpad systemd[1]: Failed to listen on CUPS Scheduler.
Oct 14 10:17:58 zornpad systemd[2307]: cups.socket: Failed to create listening socket (127.0.0.1:631): Cannot assign requested address
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed to receive listening socket (127.0.0.1:631): Input/output error
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed to listen on sockets: Input/output error
Oct 14 10:17:58 zornpad systemd[1]: cups.socket: Failed with result 'resources'.
Oct 14 10:17:58 zornpad systemd[1]: Failed to listen on CUPS Scheduler.
Hint: Some lines were ellipsized, use -l to show in full.
warning: error(s) occurred while switching to the new configuration
sudo nixos-rebuild test  18.28s user 4.83s system 7% cpu 5:23.89 total

Though nothing is listening on that port:

$ sudo ss -tlpn
State    Recv-Q   Send-Q            Local Address:Port        Peer Address:Port
LISTEN   0        128               127.0.0.53%lo:53               0.0.0.0:*       users:(("systemd-resolve",pid=1003,fd=17))
LISTEN   0        128                     0.0.0.0:22               0.0.0.0:*       users:(("sshd",pid=1028,fd=3))
LISTEN   0        128         139.174.76.3%wlp3s0:21915            0.0.0.0:*       users:(("zerotier-one",pid=2313,fd=17))
LISTEN   0        128         139.174.76.3%wlp3s0:21916            0.0.0.0:*       users:(("zerotier-one",pid=2313,fd=19))
LISTEN   0        5                       0.0.0.0:6600             0.0.0.0:*       users:(("mpd",pid=2301,fd=7))
LISTEN   0        128         139.174.76.3%wlp3s0:9993             0.0.0.0:*       users:(("zerotier-one",pid=2313,fd=15))
LISTEN   0        5                       0.0.0.0:4713             0.0.0.0:*       users:(("pulseaudio",pid=1418,fd=32))
LISTEN   0        128                     0.0.0.0:5355             0.0.0.0:*       users:(("systemd-resolve",pid=1003,fd=12))
LISTEN   0        128                        [::]:22                  [::]:*       users:(("sshd",pid=1028,fd=4))
LISTEN   0        5                          [::]:4713                [::]:*       users:(("pulseaudio",pid=1418,fd=10))
LISTEN   0        128                        [::]:5355                [::]:*       users:(("systemd-resolve",pid=1003,fd=14))

I have this problem with all kind of services, for example mpd. Even a local build of zola (a static site generator with a preview function) doesn't start at this point:

$ zola serve
No port available.

Changing the port in question doesn't change anything.
The symptomatics is always the same: according to ss the port is not in use yet the software claims that it cannot bind to the port. I'm desperate for input on how to debug this.

Edit 1:
After a fresh restart, things (zola serve) work! Then, after nixos-rebuild test, things (zola serve) still work. So, I still don't know how to reproduce this reliable.

Edit 2:
After a systemctl suspend, things (zola serve) won't work.

Expected behavior
Service starts listening on it's assigned and free port

Additional context
I've encountered this on two of my machines, one with linux 4 and one with linux 5

Metadata

  • system: "x86_64-linux"
  • host os: Linux 5.3.5, NixOS, 19.09.789.7952807791d (Loris)
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.3
  • channels(root): "nixos-19.09.789.7952807791d, nixos-unstable-20.03pre197080.dd050270c78"
  • channels(wucke13): "nixos-unstable-20.03pre196480.8b46dcb3db5"
  • nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos
@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

are you running any large server workloads on this box? First thing I was thining of is that perhaps your file descriptor pool is exhausted?

@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

Could you show the output of $ strace zola serve (Or any other cli tool that tries to bind to a port?)
I want to know what syscall is failing and with what error code

@wucke13
Copy link
Author

wucke13 commented Oct 14, 2019

are you running any large server workloads on this box? First thing I was thining of is that perhaps your file descriptor pool is exhausted?

Nope, one of the machines is my laptop with few services.

Could you show the output of $ strace zola serve (Or any other cli tool that tries to bind to a port?)
I want to know what syscall is failing and with what error code

Sure! trace.txt

At least for zola serve, the systemctl suspend seems to be a reliable way to reproduce this behavior.

@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

Thanks. could you also show the output of $ ip link when this issue happens?

From the logs it seems like your loopback interface is not available (which is odd)

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8995), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
close(3)                                = 0

``

@wucke13
Copy link
Author

wucke13 commented Oct 14, 2019

Sure 😄

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s25: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether <one of my mac addresses> brd ff:ff:ff:ff:ff:ff
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
    link/ether <one of my mac addresses> brd ff:ff:ff:ff:ff:ff
4: zt<zerotier id>: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether <one of my mac addresses> brd ff:ff:ff:ff:ff:ff
5: zt<zerotier id>: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether <one of my mac addresses> brd ff:ff:ff:ff:ff:ff
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 <hostname>
::1 localhost

Worth noting, I do use systemd-networkd to configure my interfaces.

@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

In that case, a networkctl status -a when the issue is manifested will also be useful ;)

@wucke13
Copy link
Author

wucke13 commented Oct 14, 2019

As requested. Note that I removed the zerotier interfaces from the list, as I do not expect the issue to be related to them.

$ networkctl status -a
● 1: lo
             Link File: /nix/store/6snycpaz9zrs5m7xz6dixl1nl0ngdrma-systemd-243/lib/systemd/network/99-default.link
          Network File: /etc/systemd/network/99-main.network
                  Type: loopback
                 State: carrier (configured)
                   MTU: 65536
  Queue Length (Tx/Rx): 1/1

● 2: enp0s25
             Link File: /nix/store/6snycpaz9zrs5m7xz6dixl1nl0ngdrma-systemd-243/lib/systemd/network/99-default.link
          Network File: /etc/systemd/network/99-main.network
                  Type: ether
                 State: no-carrier (configuring)
                  Path: pci-0000:00:19.0
                Driver: e1000e
                Vendor: Intel Corporation
                 Model: Ethernet Connection (3) I218-LM
            HW Address: <my mac address> (LCFC(HeFei) Electronics Technology co., ltd)
                   MTU: 1500 (min: 68, max: 9000)
  Queue Length (Tx/Rx): 1/1
      Auto negotiation: yes
                 Speed: n/a
                  Port: tp

● 3: wlp3s0
             Link File: /nix/store/6snycpaz9zrs5m7xz6dixl1nl0ngdrma-systemd-243/lib/systemd/network/99-default.link
          Network File: /etc/systemd/network/99-main.network
                  Type: wlan
                 State: routable (configured)
                  Path: pci-0000:03:00.0
                Driver: iwlwifi
                Vendor: Intel Corporation
                 Model: Wireless 7265 (Dual Band Wireless-AC 7265)
            HW Address: <my mac address> (Intel Corporate)
                   MTU: 1500 (min: 256, max: 2304)
  Queue Length (Tx/Rx): 1/1
               Address: 10.0.0.17
                        fe80::6257:18ff:fe6e:65d1
               Gateway: 10.0.0.1 (<brand name of my gateway>)
                   DNS: 10.0.0.1

@alexbakker
Copy link
Member

Looks like the loopback interface has lost its configuration as it doesn't have any addresses. This sounds like the same issue as I reported here.

@wucke13
Copy link
Author

wucke13 commented Oct 14, 2019

@alexbakker Ah yes!

@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

That networkd is managing the lo interface by default seems undesirable to me. We should probably exclude it in the generated config. For completeness sake, could you post

$ cat /etc/systemd/network/99-main.network

here?

networkd managing the lo interface should be supported as far as I understand. That stuff goes bananas after suspend sounds like an upstream systemd-networkd issue to me. and should probably be filed as a bug with the systemd project. I'd like to try to reproduce this on a non-NixOS distro using your 99-main.network and then see if we can file an upstream issue for it.

For you to get further and not have a buggy experience, I advice to make lo explicitly unmanaged by default like in the issue @alexbakker linked. We should also probably make this the default; and backport it to 19.09.

@wucke13
Copy link
Author

wucke13 commented Oct 14, 2019

@arianvp Sure!

[Network]
DHCP=yes
IPv6AcceptRA=no
DNSSEC=no

Many thanks for the information, this makes sence.

@arianvp
Copy link
Member

arianvp commented Oct 14, 2019

Interesting. Could you show your networking config section of your nixos configuration? I'm kind of surprised it generated a file with DHCP=yes on all interfaces. I thought we removed this from 19.09. It should've given you an evaluation error. I wonder if this is also what is causing the issues. Asking for a DHCP lease on loopback is a bit weird.

as per release notes:

The networking.useDHCP option is unsupported in combination with networking.useNetworkd in anticipation of defaulting to it by default. It has to be set to false and enabled per interface with networking.interfaces..useDHCP = true;

The origianl PR:
#69302

@globin
Copy link
Member

globin commented Oct 15, 2019

I have experienced the lo interface losing its IP addresses when managed by systemd, as far as I could tell at suspend/wakeup. Since, I have removed the 99-main for me, which we will remove soon for 20.03, too.

cc @fpletz

@wucke13
Copy link
Author

wucke13 commented Oct 15, 2019

Could you show your networking config section of your nixos configuration?

Sure!

networking = {
    useNetworkd = true;
    useDHCP = false;
    firewall. enable = false;
};
systemd.services."systemd-networkd-wait-online".enable = false;
systemd.network = {
  enable = true;
  networks = {
    "10-unmanaged" = {
      matchConfig.Name = "docker* nlmon* tun* virbr* zt*";
      linkConfig.Unmanaged = true;
    };
    "99-main" = {
      DHCP = "yes";
      extraConfig = ''
        IPv6AcceptRA=no
        DNSSEC=no
      '';
    };
  };
};

So looks like it is my fault, as I have DHCP=yes. Would you mind sharing a basic idiomatic sample for a systemd-networkd config?

@arianvp
Copy link
Member

arianvp commented Oct 15, 2019

    "99-main" = {
      MatchConfig.Name = "en* wl*"; 
      DHCP = "yes";
      extraConfig = ''
        IPv6AcceptRA=no
        DNSSEC=no
      '';
    };

Should do the job in your case. Explicitly only match interfaces for which you want to actually do DHCP.

Or you could add lo to your Umanaged list:

  networks = {
    "10-unmanaged" = {
      matchConfig.Name = "lo docker* nlmon* tun* virbr* zt*";
      linkConfig.Unmanaged = true;
    };
    "99-main" = {
      matchConfig.Name = "*";
      DHCP = "yes";
      extraConfig = ''
        IPv6AcceptRA=no
        DNSSEC=no
      '';
    };
  };

Both will work. One is blacklist based, the other whitelist based. Note that we'll probably move to a model where all interfaces are unmanaged by default in 20.03, so I would say method 1 is a bit more idiomatic.

@arianvp
Copy link
Member

arianvp commented Oct 15, 2019

I was looking to see if this is a systemd bug, but this behaviour is documented. It's not a bug. From the systemd-networkd man page (emphasis mine):

systemd-networkd will manage network addresses and routes for any link for which it finds a .network file with an appropriate [Match] section, see systemd.network(5). For those links, it will flush existing network addresses and routes when bringing up the device.

The linux kernel brings lo up. So when systemd-network starts, there is nothing for it to do. However when you suspend, and unsuspend, lo will be down, and systemd-networkd tries to configure it, which will flush the addresses before configuring, making lo lose the 127.0.0.1 address. However this might be a buggy part. the lo interface isn't down when you unsuspend, I think networkd merely thinks it is down and then flushes its config. It's a bit of a weird edge case that we could classify as a bug perhaps?

Another way to fix this would be to have an explicit configuration for loopback, and make it a very high presedence. This will cause the loopback interface to always be correctly configured. But I think explicitly not managing the lo interface is probably better.

networks."00-loopback" = {  matchConfig.Name = "lo";   Address = "127.0.0.1"; };

@alexbakker
Copy link
Member

I would personally much rather have the option to have networkd not touch any network interfaces by default, except for the ones that have a config in configuration.nix.

@arianvp
Copy link
Member

arianvp commented Oct 15, 2019

Agreed. That is the option we're planning to implement soon. (See @globin 's reaciton)

@wucke13
Copy link
Author

wucke13 commented Oct 15, 2019

Issue is solved by #71106 (comment), thank you all!

@wucke13 wucke13 closed this as completed Oct 15, 2019
flokli added a commit to flokli/nixpkgs that referenced this issue Oct 23, 2019
Just maching all network interfaces caused many breakages, see NixOS#18962
and NixOS#71106.

We already don't support the global networking.useDHCP,
networking.defaultGateway(6) options if networking.useNetworkd is
enabled, but direct users to configure the per-device
networking.interfaces.<name?>.… options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants