Networking (Flannel) issues

Cross-node pod connectivity issues

It happens that pods in different worker nodes could not reach each other.

To check where the pod is runnning do:

$ kubectl get pods -o wide

NAME                                   READY   STATUS    RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
kubernetes-bootcamp-6c5cfd894b-mnq5f   1/1     Running   0          32m     10.111.37.154   pa1-r2-s01     <none>           <none>
kubernetes-bootcamp-6c5cfd894b-qd2tb   1/1     Running   0          32m     10.111.12.208   pa1-r3-gpu01   <none>           <none>
kubernetes-bootcamp-6c5cfd894b-t2g2n   1/1     Running   0          47m     10.111.15.202   pa1-r3-s14     <none>           <none>

Log in in one node and try to ping the others. Note that their IPs are on different subnets (10.111.37.0, 10.111.12.0, etc). Each subnet belongs t oa specific node.

Log in to the worker nodes and check/restart flannel:

$ juju run --application kubernetes-worker "sudo service flannel status [restart]"

This should be enough to solve the connectivity issues.

Flannel plugin disappears in /opt/cni/bin

It happened to us that after replacing a Juju relation between kubernetes-worker and flannel (by juju remove-relation and juju add-relation) the containers created in the worker failed with the following error:

network: failed to find plugin "flannel" in path [/opt/cni/bin]

The cause is probably a bug in the flannel charm, which removes the CNI flannel plugin when the relation is cleared but it does not create it again.

We solved the issue by issuing:

juju upgrade-charm kubernetes-worker

(and if needed rolling back to the previous revision). Indeed, it is kubernetes-worker that installs the plugin during installation!

Configure flannel through etcdctl

Example session:

$ juju ssh etcd/6     # choose the etcd leader

$ etcdctl ls /coreos.com/network
/coreos.com/network/config
/coreos.com/network/subnets

$ etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/10.111.71.0-24
/coreos.com/network/subnets/10.111.82.0-24
/coreos.com/network/subnets/10.111.24.0-24
/coreos.com/network/subnets/10.111.11.0-24
/coreos.com/network/subnets/10.111.39.0-24

$  etcdctl get /coreos.com/network/config
{"Network": "10.111.0.0/16", "Backend": {"Type": "vxlan"}}


# specify explicitly that the subnets should be /24 with SubnetLen
$ etcdctl set /coreos.com/network/config '{"Network": "10.111.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}'
{"Network": "10.111.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}


# reconfigure the subnet on a worker (10.111.82.0/24 -> 10.111.42.0/24)
$ etcdctl get /coreos.com/network/subnets/10.111.82.0-24
{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}

$ etcdctl set /coreos.com/network/subnets/10.111.42.0-24 '{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}'
{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}

$ etcdctl rm /coreos.com/network/subnets/10.111.82.0-24
PrevNode.Value: {"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}


# restart flannel on all nodes
$ juju run --application kubernetes-worker "sudo service flannel restart"

# check
$ juju run --application kubernetes-worker "sudo cat /var/run/flannel/subnet.env"
$ juju run --application kubernetes-worker "sudo ip -4 a | grep 111"