Introduction

This is an expansion of a talk I gave to some fellow OpenShift developers about how I got to the bottom of a particularly nasty bug. Of course, it only showed up in integration tests. The actual bug is quite simple, but to understand it requires some knowledge of most of the underlying Kubernetes node components. So, it’s a useful tour of some of the lower-level parts of Kubernetes networking.

Many thanks to Tyler Neely for his detailed review of this post.

Expectations

This post assumes you know some basic Kubernetes components, like Pods, Nodes, and containers. If you’d like to learn more about these, check out the documentation for Nodes and Pods, as well as a useful intro to containers.

Back-story

Some time ago, my colleague merged a minor networking PR in to an upstream project used by OpenShift Container Platform. OpenShift has an extensive set of CI test jobs, and this PR merged without any troubles. However, a day later, another team complained that one of their jobs had gone red and it “seemed to be networking…”

This all was a bit odd - their job didn’t have anything to do with networking, and we couldn’t see an obvious connection. And no other CI jobs, of which there are hundreds, were affected. Fortunately, our amazing colleagues (using the sophisticated OpenShift build system) quickly bisected the failure, and the networking PR was somehow to blame. So, however unclear the connection, we knew where to look.

The problem

TL;DR: our networking daemon has a failing LivenessProbe and is being killed by the kubelet. However, the process seems healthy.

All OpenShift CI jobs start by creating a Kubernetes cluster then executing tests on it. We also assert on certain invariants during the entire CI process. Because this is just a Kubernetes cluster, we can use all the usual tools for inspecting cluster status.

Digging in to the test results, we see that one of the cluster nodes has marked itself as NotReady. Let’s see why, by retrieving the kubernetes Node object:

$ kubectl get node <<nodename>> -o yaml
... (snip) ...
    type: Ready 
    status: False 
    reason: KubeletNotReady 
    message: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

This tells us that:

The kubelet called the Status function of the CRI API
The container runtime (in this case, CRI-O), has scanned its CNI configuration directory and found no files¹.
The runtime returned a failing status
The kubelet, accordingly, set a NotReady status on its Node object

So the next place we should look is the networking daemon on the node.

Inspecting the pod

In OpenShift, there is a daemon that manages pod networking. Since it’s a normal DaemonSet, we can use our tools to investigate our tools. Looking at the pod, we see that it’s crashing².

$ kubectl get -n openshift-sdn pods -o wide | grep <<nodename>>
sdn-6qckt   1/1     CrashLoopBackoff   10         1h15m 10.0.152.60

Let’s find out what’s going on:

$ kubectl -n openshift-sdn describe pod sdn-6qckt
<<snip>>
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  <<snip>>
  Warning  Unhealthy  37s (x10 over 3m38s)  kubelet, <<nodename>> Liveness probe failed: Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Okay, the liveness probe is failing. Let’s look at the logs for the process.

$ kubect -n openshift-sdn logs -t 5 sdn-6qckt
I0921 07:07:16.378919   12152 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete
I0921 07:07:16.438034   12152 proxier.go:367] userspace proxy: processing 0 service events
I0921 07:07:16.438055   12152 proxier.go:346] userspace syncProxyRules took 59.113113ms
I0921 07:07:16.438066   12152 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete
interrupt: Gracefully shutting down ...

That’s odd - nothing exciting, just a sudden signal “out of the blue”, so to speak. And the liveness probe is a simple http server; it’s pretty unlikely it would fail without other obvious signals. Given that

It’s very unlikely that the healthz http server is failing
Resource usage hasn’t changed; i.e. we’re not hitting memory or file descriptor limits
The logs would say something
The offending change doesn’t touch any of this code

I’ll assume the openshift-sdn process is live, and the liveness probe is failing for a different reason.

A Failing Probe

So we have a liveness probe failing. Let’s look at the kubelet logs:

prober.go:112] Liveness probe for "sdn-6qckt_openshift-sdn(8be9db67-dc35-11e9-a4c5-12d5b2c68656):sdn" failed (failure): Get http://10.0.152.60:10256/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

What does this mean? It means the kubelet is trying to connect to the IP specified in the Pod.Status.PodIP and issue an HTTP request. That connection times out. Since this is a LivenessProbe, a failure will cause the kubelet to kill and restart the process. Bummer.

In ordinary circumstances, this means that (more likely) the process is truly hung, or (less likely) something has gone wrong with pod networking such that, for whatever reason, packets aren’t flowing to the pod³.

This case is more curious: openshift-sdn is a hostNetwork Pod. That is to say, it does not have a separate network namespace allocated, and thus shares the networking environment of the host. If pod networking is broken, that won’t break us.

This lack of isolation can mean that other processes – outside the openshift-sdn Pod – can interfere with us. So, a few possible hypotheses for what is happening:

Maybe another process is listening on the port somehow?

Running netstat -ltnp | grep 10256 returns nothing, so this is wrong.
What about something wrong with iptables?

iptables-save | grep 10256 returns nothing and disproves this.

A Double Take

Hang on, 10.0.152.60 is the node’s primary IP address - the one we are SSH’d to!. In other words, we’re not able to connect our own local address, one we know is up. What the heck is going on!?

Suddenly, I have a recollection from when I was hacking on some low level namespace silliness for the CNI test suite: loopback is special.

Loopback is special

As you probably know, there is an interface, lo, of type loopback, on every Linux node that hosts the localhost IPs of 127.0.0.1 and ::1. The loopback device is somewhat magical, because it doesn’t really use the full networking stack. Rather, packets are instantly delivered to their destination. This makes it much faster. Using loopback is good.

So, Linux does something clever. If you’re talking to yourself, even if you don’t use the loopback address, you still use the loopback device. Looking at my laptop, wl0 has ip 192.168.0.150:

$ ip addr show dev wl0 
3: wl0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.0.150/24 brd 192.168.0.255 scope global dynamic noprefixroute wl0

And it has the default route:

$ ip route get 4.2.2.2
4.2.2.2 via 192.168.0.1 dev wl0 src 192.168.0.150 uid 1000 
    cache

But if we were to connect to the address on wl0, we’d use lo, not wl0:

$ ip route get 192.168.0.150
local 192.168.0.150 dev lo src 192.168.0.150 uid 1000 
    cache <local>

I’m (not) talking to the man in the mirror…

So, let’s check our bad node’s lo:

$ ip -brief link show dev lo
lo               DOWN           00:00:00:00:00:00 <LOOPBACK>

Down! We’re a vampire! We can’t see ourselves in the mirror. Bring the interface up, and everything comes back to normal.

How did we get here?

Let’s review the facts:

We merged a change that fixed CNI deletion.
The CNI loopback plugin sets the interface down.
Nothing else on the host would ever touch loopback.
Sometimes loopback is down on the host, soon after a reboot.
This doesn’t happen all the time.

To mee, it seems that the CNI loopback plugin is taking down the host’s loopback. How could this happen?

CNI & Kubernetes

Let’s go over the usual lifecycle of a pod, and how CNI fits in to this:

A Pod object is created on the apiserver, and is scheduled to a node
Kubelet reconciles (more on this later) the Pod’s desired state with the running state, and notices that (among other things⁴) it has no PodSandbox (since it is brand new)
It issues a CreatePodSandboxRequest over the CRI to the container runtime on the node. In this case, the runtime is CRI-O
CRI-O creates the namespaces, including the network namespace, by starting a process. (The “infrastructure container”).
CRI-O issues a CNI ADD command to the CNI plugin(s), passing them the path to the network namespace
Once the PodSandbox is configured, Kubelet creates individual containers via CreateContainer and StartContainer CRI methods

Then, when the pod is deleted

Kubelet notices that the Pod object is being deleted
It executes any PreStop hooks on running containers (via CRI Exec)
Kubelet issues any relevant StopContainer CRI requests
Kubelet issues StopPodSandbox and RemovePodSandbox CRI requests
CRI-O issues a CNI DEL command to the CNI plugin(s), again passing them the network namespace
Kubelet updates the Pod object as deleted

All well and good, then? But this bug is only triggered after a reboot.

Reboot / reconciliation

An interesting subtlety here is the separate notion of a Pod and PodSandbox object. To Kubelet, a Pod is the object in the Kubernetes API. The Platonic Ideal, as it were. The Form of the Pod. Separately, the PodSandbox represents the “real” machine state needed to run a pod.

Kubelet is, essentially, a giant reconciliation loop that ensures every Pod - the ideal state of the world - is correctly represented on the machine. It will create and delete PodSandboxes as necessary in an attempt to reach the desired state⁵.

Linux containers⁶ do not survive reboots. Thus, Kubelet will notice that the PodSandbox is in a bad state, delete it, and request a new one. All the while, the Pod object itself remains in the apiserver. Kubelet then re-reconciles and re-creates the PodSandbox.

The interesting thing here is that kubelet will issue a DeletePodSandbox CRI call, even if the underlying container is gone. It does this so other resources – whatever they may be – can be cleaned up. For example, many CNI plugins rely on this to release allocated IPs.

Don’t CRI-O for me

So, we know that CRI-O is calling CNI DEL after a reboot, in response to kubelet recreating PodSandboxes. Let’s look at the CRI-O logs and see what we see:

$ journalctl  -b -u crio
...snip...
Error deleting network: failed to Statfs "/proc/5029/ns/net": no such file or directory

After staring at this line for a few minutes, I realized we had our smoking gun. But, to understand why this innocent-looking error message is so bad, let’s take a quick detour in to understanding how namespaces are referenced in Linux.

Namespaces in Linux

First of all, read the LWN article, explaining namespaces, which explains this complicated topic better than I ever could.

Namespaces in Linux are created with the unshare() or clone() system calls. The end effect of both of these calls is a process (or thread) running in a new namespace. By default, namespaces are internal kernel objects (like PIDs or file descriptors). Like many other process attributes, they are expoed via the /proc filesystem.

You can see which namespaces a process is referencing with a simple ls:

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 net -> 'net:[4026532008]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 pid -> 'pid:[4026531836]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 time -> 'time:[4026531834]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 c1 c1 0 Jan 15 13:59 uts -> 'uts:[4026531838]'

Normally, a namespace will disappear when the last process in it exits. This is one of the reasons why Kubernetes needed to use “pause containers”: processes that idle forever, just holding on to the namespaces. Namespaces can be persisted, however, by holding open a file descriptor or bind-mounting the file from proc:

# mount -o bind /proc/1234/ns/net /run/netns/my-cool-container

The log line above references the namespace by its PID. This means CRI-O is passing an “ephemeral” namespace path, rather than a bind-mounted one, to the CNI DEL. It means that we’re vulnerable to pid reuse.

Putting it all together

So, the timeline looks like this

Kubelet decides to create a PodSandbox
CRI-O does a CNI ADD with /proc/1234/ns/net being passed to the plugin.
The node reboots.
Kubelet does housekeeping, and deletes the now-dead PodSandbox.
A non-containerized process with pid 1234 starts up.
CRI-O does a CNI DEL, again passing /proc/1234/ns/net
The CNI plugin switches to the namespace at /proc/1234/ns/net, which is actually the host’s network namespace
Thinking it’s in the container, the CNI plugin sets lo down.

The fix

I filed this issue on CRI-O, and the upstream team immediately started working on it. They needed to do a decent amount of refactoring. Ultimately, they added the option manage_ns_lifecycle, which enabled the bind-mounting behavior. Later, this became the default, and even later, the config knob was removed.

The immediate fix was to make our CNI plugin do nothing on teardown, since the network namespace is going away anyways. That got CI green.

This is one of the awkward edges of Kubernetes: the only way for the networking provider to say “yes, the node’s network is configured and ready” is to write a configuration file. This is something we’d like to fix in CNI 2.0, as part of a transition to GRPC. ↩︎
The daemon writes its CNI configuration file to disk after it successfully initializes. This causes, ultimately, the node to go Ready. A PreStop hook removes the file on teardown, which results in a NotReady node. ↩︎
Obviously this isn’t supposed to happen. But software has bugs, and some CNI plugins are pretty complicated beasts. Openshift-SDN, for example, relies on OpenVSwitch, which is an external process that can theoretically take down networking if it misbehaves. In reality, it is pretty reliable. ↩︎
It also sets up volumes, secrets, ensures images are pulled, etc. ↩︎
All Kubernetes components are built on the notion of a “reconcilation loop”, which is an incredibly deep topic. I recommend reading this presentation on what it means to reconcile by Tim Hockin. ↩︎
The Linux kernel doesn’t actually have “containers” per se. Rather, what we users call a container is a combination of several independent kernel features, including namespaces, cgroups, and seccomp. ↩︎