Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Suddenly unable to work #3183

Open
1 task done
wuwo1952368901 opened this issue Nov 7, 2024 · 9 comments
Open
1 task done

[Bug]: Suddenly unable to work #3183

wuwo1952368901 opened this issue Nov 7, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@wuwo1952368901
Copy link

Contact Details

No response

What happened?

Suddenly unable to ping between nodes.

Version

v0.24.2

What OS are you using?

No response

Relevant log output

No response

Contributing guidelines

  • Yes, I did.
@wuwo1952368901 wuwo1952368901 added the bug Something isn't working label Nov 7, 2024
@wuwo1952368901
Copy link
Author

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

@abhishek9686
Copy link
Member

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

can you provide more information on your environment?

  1. clients are running on which OS?
  2. Are they behind NAT?

@wuwo1952368901
Copy link
Author

wuwo1952368901 commented Nov 8, 2024

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

@yabinma
Copy link
Collaborator

yabinma commented Nov 8, 2024

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

@wuwo1952368901
Copy link
Author

They are not behind NAT.
OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds

@yabinma
Copy link
Collaborator

yabinma commented Nov 9, 2024

They are not behind NAT.
OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds

Auto Endpoint detection is enabled by default. So that the hosts are able to communicate each other with internal ip if they are in the same sub network.

In your setup, the host could not communicate each other with the network IP of k8s cluster.
Or you may disable the auto endpoint detection. In netmaker.env, set ENDPOINT_DETECTION=false and restart the containers with docker compose down & docker compose up -d

@wuwo1952368901
Copy link
Author

After synchronizing the configuration through "netclient pull", the node still cannot ping. Use the "wg show" command to check for the following:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 3 seconds ago
  transfer: 209.23 KiB received, 143.68 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 35 seconds ago
  transfer: 5.31 MiB received, 958.77 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.5/32
  transfer: 0 B received, 39.17 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 39.31 KiB sent
  persistent keepalive: every 20 seconds

The last two nodes cannot be pinged properly. The wg show command shows that the problematic nodes do not have a "latest handshake".

@yabinma @afeiszli

@abhishek9686
Copy link
Member

10.104.0.5
Hi,
can share the output of wg show of this peer 10.104.0.5

@wuwo1952368901
Copy link
Author

10.104.0.5
Hi,
can share the output of wg show of this peer 10.104.0.5

This is the information for the "wg show" on 10.104.0.5:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 11 seconds ago
  transfer: 11.06 MiB received, 56.23 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 21 seconds ago
  transfer: 368.95 MiB received, 321.13 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 489.67 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.1/32
  transfer: 0 B received, 465.68 KiB sent
  persistent keepalive: every 20 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants