[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

h-w-chen · 2022-02-14T19:19:36Z

What happened:
coredns is in Running state, however the ready container is 0/1; coredns pod log has following error record:

...
Failed to list *v1.Endpoints: Get \"https://10.40.0.2:443/api/v1/endpoints?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: connect: no route to host\n","stream":"stderr","time":"2022-02-14T19:07:10.303482004Z"}
Failed to list *v1.Service: Get \"https://10.40.0.2:443/api/v1/services?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: connect: no route to host\n","stream":"stderr","time":"2022-02-14T19:07:20.515095501Z"}

What you expected to happen:
coredns able to list resources of TP master

How to reproduce it (as minimally and precisely as possible):
using code of poc-2022-01-30, run kube-up.sh to start 1 TP 1RP 1worker scale-out cluster,

Anything else we need to know?:
10.40.0.2 is the node (physical) IP address of its TP master

Environment:

Arktos version (use kubectl version): poc-2022-0130, commit 2b6855
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools: kube-up scale-out 1+1x1
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

sonyafenge · 2022-02-14T22:06:47Z

Add previous investigation with flannel for your reference
For scale-up

cos OS:

coredns pod can be started successfully

pod can ping each others cross nodes.

ip neigh on master

# ip neigh
10.40.0.1 dev eth0 lladdr 42:01:0a:28:00:01 REACHABLE

ip neigh on minion node

# ip neigh
10.64.1.5 dev cni0 lladdr ba:6c:7a:2e:ff:e7 REACHABLE
10.64.1.10 dev cni0 lladdr e6:b6:78:9c:0d:97 REACHABLE
10.64.1.19 dev cni0 lladdr 16:e4:f7:91:d0:77 REACHABLE
10.64.1.4 dev cni0 lladdr 4e:ca:cd:9e:a5:1d REACHABLE
10.64.1.9 dev cni0  FAILED
10.40.0.1 dev eth0 lladdr 42:01:0a:28:00:01 REACHABLE
10.64.1.7 dev cni0  FAILED
10.64.1.3 dev cni0 lladdr 92:48:35:fd:06:25 DELAY
10.64.1.6 dev cni0 lladdr a2:c1:25:4c:a8:db REACHABLE
10.64.1.11 dev cni0 lladdr 8a:d9:ac:5d:09:7c REACHABLE
10.64.1.2 dev cni0 lladdr 22:4f:8b:43:8d:8c REACHABLE

ubuntu 20.04

cordons failed to start with the error in this issue

pod cannot ping cross nodes

ip neigh on master

ip neigh
10.40.0.1 dev ens4 lladdr 42:01:0a:28:00:01 REACHABLE

ip neigh on minion nodes

# ip neigh
10.40.0.1 dev ens4 lladdr 42:01:0a:28:00:01 REACHABLE

sonyafenge · 2022-02-14T22:10:09Z

checked scale-out with mizar on minion nodes, run "ip neigh" also get only 10.40.0.1 one and no any pods ip listed.
Suspect same issue with flannel:
flannel-io/flannel#1155

Sindica · 2022-02-15T22:48:48Z

This is the reason that coredns and kube-dns are crashing in scale out (both local and kube-up) and scale up (in kube-up only since local deploys coredns/kube-dns pod to master).

kube-dns:
2022-02-15T22:42:35.697490309Z stderr F E0215 22:42:35.697343       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:189: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
2022-02-15T22:42:35.6975123Z stderr F E0215 22:42:35.697348       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
...
2022-02-15T22:43:03.197113883Z stderr F I0215 22:43:03.196855       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:03.697056566Z stderr F I0215 22:43:03.696870       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:04.19701365Z stderr F I0215 22:43:04.196833       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:04.697068732Z stderr F I0215 22:43:04.696904       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:05.197105713Z stderr F I0215 22:43:05.196876       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:05.697058396Z stderr F F0215 22:43:05.696845       1 dns.go:209] Timeout waiting for initialization

coredns:
2022-02-15T22:47:59.821381993Z stderr F E0215 22:47:59.821145       1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Get "https://172.30.0.14:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.14:6443: i/o timeout

Sindica · 2022-02-17T16:28:00Z

After added mizar daemon to TP master and tp master to mizar droplets, kubeup scale out 1x1 still have service crashing issue.

sindica2000@ying-scaleout-tp-1-master:/var/log$ cat kube-controller-manager.log | grep mizar-node-controller.go | grep mizar | grep Create | grep successfully
I0217 06:33:40.553971       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-tp-1-master, eventType Create
I0217 06:34:20.331171       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-minion-group-2gt2, eventType Create
I0217 06:34:24.335101       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-master, eventType Create
I0217 06:34:24.943984       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-minion-group-w5f3, eventType Create

sindica2000@ying-dev1:~/go/src/sindica-arktos$ kubectl --kubeconfig cluster/kubeconfig.tp-1 get eps | grep host
ying-scaleout-rp-1-master-default--hostep-13576365                            host     4a:f7:5b:5c:20:54   10.40.0.3      89.225.0.1   32       Provisioned   aaa-default-network-subnet      aaa-default-network      13576365   ying-scaleout-rp-1-master              ehost-13576365   vehost-13576365           10.40.0.3   42:01:0a:28:00:03   2022-02-17T06:44:35.409457   0.522525                    
ying-scaleout-rp-1-minion-group-2gt2-default--hostep-13576365                 host     1a:3d:d2:71:73:a8   10.40.0.4      89.225.0.1   32       Init          aaa-default-network-subnet      aaa-default-network      1          ying-scaleout-rp-1-minion-group-2gt2   ehost-13576365   vehost-13576365           10.40.0.4   42:01:0a:28:00:04   2022-02-17T06:44:35.091425                               
ying-scaleout-rp-1-minion-group-w5f3-default--hostep-13576365                 host     12:1c:8d:2a:01:74   10.40.0.5      89.225.0.1   32       Init          aaa-default-network-subnet      aaa-default-network      1          ying-scaleout-rp-1-minion-group-w5f3   ehost-13576365   vehost-13576365           10.40.0.5   42:01:0a:28:00:05   2022-02-17T06:44:35.831306                               
ying-scaleout-tp-1-master-default--hostep-1                                   host     b2:30:08:53:36:7d   10.40.0.2      20.0.0.1     32       Provisioned   net0                            vpc0                     1          ying-scaleout-tp-1-master              ehost-1          vehost-1                  10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:34:08.829725   1.330807                    
ying-scaleout-tp-1-master-default--hostep-13576365                            host     9a:f9:43:fb:8f:ec   10.40.0.2      89.225.0.1   32       Provisioned   aaa-default-network-subnet      aaa-default-network      13576365   ying-scaleout-tp-1-master              ehost-13576365   vehost-13576365           10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:44:34.931540   0.842169                    
ying-scaleout-tp-1-master-default--hostep-13961987                            host     0e:4f:f1:bc:d9:a9   10.40.0.2      1.36.0.1     32       Provisioned   system-default-network-subnet   system-default-network   13961987   ying-scaleout-tp-1-master              ehost-13961987   vehost-13961987           10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:34:08.555134   0.998453     

sindica2000@ying-dev1:~/go/src/sindica-arktos$ kubectl --kubeconfig cluster/kubeconfig.tp-1 get pods -o wide -AT | grep ying-scaleout-rp-1-minion-group-w5f3 | grep -v netpod
system   default       mizar-daemon-jbc5b                                    4045097148037907557   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   coredns-default-ying-scaleout-tp-1-7545f94d7c-qzz4z   4574940730222740260   0/1     CrashLoopBackOff   135        9h    1.36.0.14    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-48bdq                              1457492913046417354   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   heapster-v1.6.0-beta.1-7c546f8546-twxbl               8803644565539269132   2/2     Running            98         9h    1.36.0.19    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   kube-proxy-ying-scaleout-rp-1-minion-group-w5f3       5903977805958703516   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   metrics-server-v0.3.3-5f994fcb77-6dzf6                523873102943355845    1/2     CrashLoopBackOff   116        9h    1.36.0.34    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>

Log from kubernetes-dashboard-848965699-jx5vx:

{"log":"2022/02/17 16:23:14 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service account's configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.0.0.1:443/version: dial tcp 10.0.0.1:443: connect: no route to host\n","stream":"stdout","time":"2022-02-17T16:23:14.830337903Z"}

Log from kube-dns-autoscaler:

{"log":"E0217 16:26:46.478019       1 reflector.go:190] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:94: Failed to list *v1.Node: Get https://10.0.0.1:443/api/v1/nodes: dial tcp 10.0.0.1:443: getsockopt: no route to host\n","stream":"stderr","time":"2022-02-17T16:26:46.478272251Z"}

Log from metrics server:

{"log":"panic: Get https://10.0.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.0.0.1:443: connect: no route to host\n","stream":"stderr","time":"2022-02-17T16:27:49.286259419Z"}

Note host eps for minion is in Init state

Sindica · 2022-03-03T17:49:22Z

I still see issues in new tenant coredns pod. It is not 100% reproducible but should be able to reproduce with creating multiple tenants in same cluster.

{"log":"E0303 17:47:03.438944       1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get \"https://10.40.0.2:443/api/v1/endpoints?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: i/o timeout\n","stream":"stderr","time":"2022-03-03T17:47:03.439111869Z"}
{"log":"[INFO] SIGTERM: Shutting down servers then terminating\n","stream":"stdout","time":"2022-03-03T17:47:09.34679178Z"}

This may not be an issue since the new VPC started with 127.

sonyafenge mentioned this issue Feb 15, 2022

[Kube-up][scale-out]2x2 has 9 pods stuck in ContainerCreating #1357

Closed

Sindica self-assigned this Feb 15, 2022

Sindica added this to the 0.10 milestone Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

h-w-chen commented Feb 14, 2022

sonyafenge commented Feb 14, 2022

sonyafenge commented Feb 14, 2022

Sindica commented Feb 15, 2022 •

edited

Loading

Sindica commented Feb 17, 2022 •

edited

Loading

Sindica commented Mar 3, 2022 •

edited

Loading

[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

Comments

h-w-chen commented Feb 14, 2022

sonyafenge commented Feb 14, 2022

sonyafenge commented Feb 14, 2022

Sindica commented Feb 15, 2022 • edited Loading

Sindica commented Feb 17, 2022 • edited Loading

Sindica commented Mar 3, 2022 • edited Loading

Sindica commented Feb 15, 2022 •

edited

Loading

Sindica commented Feb 17, 2022 •

edited

Loading

Sindica commented Mar 3, 2022 •

edited

Loading