This is an expansion of a talk I gave to some fellow OpenShift developers about how I got to the bottom of a particularly nasty bug. Of course, it only showed up in integration tests. The actual bug is quite simple, but to understand it requires some knowledge of most of the underlying Kubernetes node components. So, it’s a useful tour of some of the lower-level parts of Kubernetes networking.
Many thanks to Tyler Neely for his detailed review of this post.
This post assumes you know some basic Kubernetes components, like Pods, Nodes, and containers. If you’d like to learn more about these, check out the documentation for Nodes and Pods, as well as a useful intro to containers.
Some time ago, my colleague merged a minor networking PR in to an upstream project used by OpenShift Container Platform. OpenShift has an extensive set of CI test jobs, and this PR merged without any troubles. However, a day later, another team complained that one of their jobs had gone red and it “seemed to be networking…”
This all was a bit odd - their job didn’t have anything to do with networking, and we couldn’t see an obvious connection. And no other CI jobs, of which there are hundreds, were affected. Fortunately, our amazing colleagues (using the sophisticated OpenShift build system) quickly bisected the failure, and the networking PR was somehow to blame. So, however unclear the connection, we knew where to look.
All OpenShift CI jobs start by creating a Kubernetes cluster then executing tests on it. We also assert on certain invariants during the entire CI process. Because this is just a Kubernetes cluster, we can use all the usual tools for inspecting cluster status.
Digging in to the test results, we see that one of the cluster nodes has marked itself as NotReady. Let’s see why, by retrieving the kubernetes Node object:
$ kubectl get node <<nodename>> -o yaml ... (snip) ... type: Ready status: False reason: KubeletNotReady message: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
This tells us that:
- The kubelet called the Status function of the CRI API
- The container runtime (in this case, CRI-O), has scanned its CNI configuration directory and found no files1.
- The runtime returned a failing status
- The kubelet, accordingly, set a NotReady status on its Node object
So the next place we should look is the networking daemon on the node.
Inspecting the pod
In OpenShift, there is a daemon that manages pod networking. Since it’s a normal DaemonSet, we can use our tools to investigate our tools. Looking at the pod, we see that it’s crashing2.
$ kubectl get -n openshift-sdn pods -o wide | grep <<nodename>> sdn-6qckt 1/1 CrashLoopBackoff 10 1h15m 10.0.152.60
Let’s find out what’s going on:
$ kubectl -n openshift-sdn describe pod sdn-6qckt <<snip>> Events: Type Reason Age From Message ---- ------ ---- ---- ------- <<snip>> Warning Unhealthy 37s (x10 over 3m38s) kubelet, <<nodename>> Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Okay, the liveness probe is failing. Let’s look at the logs for the process.
$ kubect -n openshift-sdn logs -t 5 sdn-6qckt I0921 07:07:16.378919 12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete I0921 07:07:16.438034 12152 proxier.go:367] userspace proxy: processing 0 service events I0921 07:07:16.438055 12152 proxier.go:346] userspace syncProxyRules took 59.113113ms I0921 07:07:16.438066 12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete interrupt: Gracefully shutting down ...
That’s odd - nothing exciting, just a sudden signal “out of the blue”, so to speak. And the liveness probe is a simple http server; it’s pretty unlikely it would fail without other obvious signals. Given that
- It’s very unlikely that the healthz http server is failing
- Resource usage hasn’t changed; i.e. we’re not hitting memory or file descriptor limits
- The logs would say something
- The offending change doesn’t touch any of this code
I’ll assume the
openshift-sdn process is live, and the liveness probe is failing for a different reason.
A Failing Probe
So we have a liveness probe failing. Let’s look at the kubelet logs:
prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
What does this mean? It means the kubelet is trying to connect to the IP specified in the Pod.Status.PodIP and issue an HTTP request. That connection times out. Since this is a LivenessProbe, a failure will cause the kubelet to kill and restart the process. Bummer.
In ordinary circumstances, this means that (more likely) the process is truly hung, or (less likely) something has gone wrong with pod networking such that, for whatever reason, packets aren’t flowing to the pod3.
This case is more curious:
openshift-sdn is a
hostNetwork Pod. That is to say, it does not have a separate network namespace allocated, and thus shares the networking environment of the host. If pod networking is broken, that won’t break us.
This lack of isolation can mean that other processes – outside the
openshift-sdn Pod – can interfere with us. So, a few possible hypotheses for what is happening:
Maybe another process is listening on the port somehow?
netstat -ltnp | grep 10256returns nothing, so this is wrong.
What about something wrong with iptables?
iptables-save | grep 10256returns nothing and disproves this.
A Double Take
10.0.152.60 is the node’s primary IP address - the one we are SSH’d to!. In other words, we’re not able to connect our own local address, one we know is up. What the heck is going on!?
Suddenly, I have a recollection from when I was hacking on some low level namespace silliness for the CNI test suite: loopback is special.
Loopback is special
As you probably know, there is an interface,
lo, of type
loopback, on every Linux node that hosts the localhost IPs of 127.0.0.1 and ::1. The loopback device is somewhat magical, because it doesn’t really use the full networking stack. Rather, packets are instantly delivered to their destination. This makes it much faster. Using loopback is good.
So, Linux does something clever. If you’re talking to yourself, even if you don’t use the loopback address, you still use the loopback device. Looking at my laptop,
wl0 has ip
$ ip addr show dev wl0 3: wl0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 inet 192.168.0.150/24 brd 192.168.0.255 scope global dynamic noprefixroute wl0
And it has the default route:
$ ip route get 188.8.131.52 184.108.40.206 via 192.168.0.1 dev wl0 src 192.168.0.150 uid 1000 cache
But if we were to connect to the address on
wl0, we’d use
$ ip route get 192.168.0.150 local 192.168.0.150 dev lo src 192.168.0.150 uid 1000 cache <local>
I’m (not) talking to the man in the mirror…
So, let’s check our bad node’s
$ ip -brief link show dev lo lo DOWN 00:00:00:00:00:00 <LOOPBACK>
Down! We’re a vampire! We can’t see ourselves in the mirror. Bring the interface up, and everything comes back to normal.
How did we get here?
Let’s review the facts:
- We merged a change that fixed CNI deletion.
- The CNI loopback plugin sets the interface down.
- Nothing else on the host would ever touch loopback.
- Sometimes loopback is down on the host, soon after a reboot.
- This doesn’t happen all the time.
To mee, it seems that the CNI loopback plugin is taking down the host’s loopback. How could this happen?
CNI & Kubernetes
Let’s go over the usual lifecycle of a pod, and how CNI fits in to this:
- A Pod object is created on the apiserver, and is scheduled to a node
- Kubelet reconciles (more on this later) the Pod’s desired state with the running state, and notices that (among other things4) it has no PodSandbox (since it is brand new)
- It issues a CreatePodSandboxRequest over the CRI to the container runtime on the node. In this case, the runtime is CRI-O
- CRI-O creates the namespaces, including the network namespace, by starting a process. (The “infrastructure container”).
- CRI-O issues a CNI ADD command to the CNI plugin(s), passing them the path to the network namespace
- Once the PodSandbox is configured, Kubelet creates individual containers via CreateContainer and StartContainer CRI methods
Then, when the pod is deleted
- Kubelet notices that the Pod object is being deleted
- It executes any PreStop hooks on running containers (via CRI Exec)
- Kubelet issues any relevant StopContainer CRI requests
- Kubelet issues StopPodSandbox and RemovePodSandbox CRI requests
- CRI-O issues a CNI DEL command to the CNI plugin(s), again passing them the network namespace
- Kubelet updates the Pod object as deleted
All well and good, then? But this bug is only triggered after a reboot.
Reboot / reconciliation
An interesting subtlety here is the separate notion of a Pod and PodSandbox object. To Kubelet, a Pod is the object in the Kubernetes API. The Platonic Ideal, as it were. The Form of the Pod. Separately, the PodSandbox represents the “real” machine state needed to run a pod.
Kubelet is, essentially, a giant reconciliation loop that ensures every Pod - the ideal state of the world - is correctly represented on the machine. It will create and delete PodSandboxes as necessary in an attempt to reach the desired state5.
Linux containers6 do not survive reboots. Thus, Kubelet will notice that the PodSandbox is in a bad state, delete it, and request a new one. All the while, the Pod object itself remains in the apiserver. Kubelet then re-reconciles and re-creates the PodSandbox.
The interesting thing here is that kubelet will issue a DeletePodSandbox CRI call, even if the underlying container is gone. It does this so other resources – whatever they may be – can be cleaned up. For example, many CNI plugins rely on this to release allocated IPs.
Don’t CRI-O for me
So, we know that CRI-O is calling CNI DEL after a reboot, in response to kubelet recreating PodSandboxes. Let’s look at the CRI-O logs and see what we see:
$ journalctl -b -u crio ...snip... Error deleting network: failed to Statfs "/proc/5029/ns/net": no such file or directory
After staring at this line for a few minutes, I realized we had our smoking gun. But, to understand why this innocent-looking error message is so bad, let’s take a quick detour in to understanding how namespaces are referenced in Linux.
Namespaces in Linux
First of all, read the LWN article, explaining namespaces, which explains this complicated topic better than I ever could.
Namespaces in Linux are created with the
clone() system calls. The end effect of both of these calls is a process (or thread) running in a new namespace. By default, namespaces are internal kernel objects (like PIDs or file descriptors). Like many other process attributes, they are expoed via the
You can see which namespaces a process is referencing with a simple
$ ls -l /proc/$$/ns total 0 lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 cgroup -> 'cgroup:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 ipc -> 'ipc:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 mnt -> 'mnt:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 net -> 'net:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 pid -> 'pid:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 pid_for_children -> 'pid:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 time -> 'time:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 time_for_children -> 'time:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 user -> 'user:' lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 uts -> 'uts:'
Normally, a namespace will disappear when the last process in it exits. This is one of the reasons why Kubernetes needed to use “pause containers”: processes that idle forever, just holding on to the namespaces. Namespaces can be persisted, however, by holding open a file descriptor or bind-mounting the file from proc:
# mount -o bind /proc/1234/ns/net /run/netns/my-cool-container
The log line above references the namespace by its PID. This means CRI-O is passing an “ephemeral” namespace path, rather than a bind-mounted one, to the CNI DEL. It means that we’re vulnerable to pid reuse.
Putting it all together
So, the timeline looks like this
- Kubelet decides to create a PodSandbox
- CRI-O does a CNI ADD with
/proc/1234/ns/netbeing passed to the plugin.
- The node reboots.
- Kubelet does housekeeping, and deletes the now-dead PodSandbox.
- A non-containerized process with pid 1234 starts up.
- CRI-O does a CNI DEL, again passing
- The CNI plugin switches to the namespace at
/proc/1234/ns/net, which is actually the host’s network namespace
- Thinking it’s in the container, the CNI plugin sets
I filed this issue on CRI-O, and the upstream team immediately started working on it. They needed to do a decent amount of refactoring. Ultimately, they added the option
manage_ns_lifecycle, which enabled the bind-mounting behavior. Later, this became the default, and even later, the config knob was removed.
The immediate fix was to make our CNI plugin do nothing on teardown, since the network namespace is going away anyways. That got CI green.
This is one of the awkward edges of Kubernetes: the only way for the networking provider to say “yes, the node’s network is configured and ready” is to write a configuration file. This is something we’d like to fix in CNI 2.0, as part of a transition to GRPC. ↩︎
The daemon writes its CNI configuration file to disk after it successfully initializes. This causes, ultimately, the node to go Ready. A PreStop hook removes the file on teardown, which results in a NotReady node. ↩︎
Obviously this isn’t supposed to happen. But software has bugs, and some CNI plugins are pretty complicated beasts. Openshift-SDN, for example, relies on OpenVSwitch, which is an external process that can theoretically take down networking if it misbehaves. In reality, it is pretty reliable. ↩︎
It also sets up volumes, secrets, ensures images are pulled, etc. ↩︎
The Linux kernel doesn’t actually have “containers” per se. Rather, what we users call a container is a combination of several independent kernel features, including namespaces, cgroups, and seccomp. ↩︎