Post

Kubernetes Eviction of Ephemeral Storage: Troubleshoot OOMKilled Pods That Won't Restart

Introduction

According to the Kubernetes documents, when a container in a pod consumes more memory than its limit, the pod is killed. A Pod killed by OOM (OOMKilled Pod) will be restarted if the restart policy allow:1

If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure.

However, sometimes people may see an OOMKilled Pod stuck and the Kubernetes scheduler start another pod without restarting the OOMKilled Pod. This article records a scenario that the ephemeral-storage usage of the pod may cause a killed pod not be restarted.

The Scenario

In most cases,, when a pod is killed by OOM, the pod will be restarted. We can see the behaviour from the following kubectl get pod output after executing the example in the Kuberentes document:1

1
2
3
4
5
6
7
8
9
$ kubectl get pods -w
NAME                            READY   STATUS      RESTARTS     AGE
oom-demo-766c8d556d-2vm85       0/1     OOMKilled          11 (5m10s ago)   31m
oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   11 (15s ago)     31m
oom-demo-766c8d556d-2vm85       0/1     OOMKilled          12 (5m9s ago)    36m
oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   12 (15s ago)     36m
oom-demo-766c8d556d-2vm85       0/1     OOMKilled          13 (5m7s ago)    41m
oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   13 (14s ago)     41m
oom-demo-766c8d556d-2vm85       0/1     OOMKilled          14 (5m5s ago)    46m

However, there are situations where an OOMKilled pod may not be restarted, stuck in the OOMKilled or Error ,and instead, the Kubernetes scheduler will start a new pod:

1
2
3
4
$ kubectl get pods -o wide
NAME                          READY   STATUS        RESTARTS       AGE     IP               NODE
oom-demo-6f7747887b-7vx6v     0/1     Error               1               68s     192.167.35.208   ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-m8bh5     1/1     Running             0               1s      192.167.28.227   ip-192-167-19-202.ec2.internal

But why the killed pod is not restarted while the RestartPolicy should be Always for a deployment2 and the pod should be automatically restarts the container after any termination.3

Explanation

This can happen when the pod is evicted due to reaching the node’s ephemeral storage limit. Ephemeral storage is the temporary storage used by a pod for storing logs, data, and other temporary files. Each node has a limit on the amount of ephemeral storage that can be used by pods running on that node. When a pod’s ephemeral storage usage exceeds this limit, the pod is evicted from the node.

When a pod is evicted due to exceeding the ephemeral storage limit, the kubelet marks the pod as “Failed” with the reason “Evicted.” The scheduler then assumes that the pod cannot be restarted on the same node and starts a new pod instead.

While the killed pod still exist in the cluster, we can check the pod by kubectl describe pod:

1
2
3
4
5
6
7
8
9
$ kubectl describe pod oom-demo-6f7747887b-7vx6v
...
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki.
...
  Warning  Evicted              8m18s                  kubelet            The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki.
  Normal   Killing              8m18s                  kubelet            Stopping container memory-demo-2-ctr
  Warning  ExceededGracePeriod  8m8s                   kubelet            Container runtime did not kill the pod within specified grace period.

Here we can see that the pod is actually be evicted because the ephemeral-storage is reached the threshold of the node. We can see the ephemeral-storage threshold is 8583014118 bytes (8 GB) when it is a node of 80 GB disk which is 10% of the node disk size.

If the pod is running on EKS with control plane logging enabled4, we can also query the control plane log to see the message:

1
2
3
fields @timestamp, verb, requestURI, objectRef.namespace, objectRef.resource, objectRef.name , userAgent, requestObject.status.message
    | filter verb not in ["get", "list", "watch"]
    | filter objectRef.name  like "oom-demo-6f7747887b-7vx6v" or objectRef.name like "oom-demo-6f7747887b-m8bh5"
@timestampverbrequestURIobjectRef.resourceobjectRef.nameuserAgentrequestObject.status.message
2024-08-27 14:51:04.952patch/api/v1/namespaces/default/pods/oom-demo-6f7747887b-m8bh5/statuspodsoom-demo-6f7747887b-m8bh5kubelet/v1.29.5 (linux/amd64) kubernetes/1109419 
2024-08-27 14:51:04.951create/api/v1/namespaces/default/pods/oom-demo-6f7747887b-m8bh5/bindingpodsoom-demo-6f7747887b-m8bh5kube-scheduler/v1.29.6 (linux/arm64) kubernetes/c978c80/scheduler 
2024-08-27 14:51:04.950patch/api/v1/namespaces/default/pods/oom-demo-6f7747887b-7vx6v/statuspodsoom-demo-6f7747887b-7vx6vkubelet/v1.29.5 (linux/amd64) kubernetes/1109419The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki.

Lab

To reproduce this scenario, we can create a deployment with a container to consume a large amount of ephemeral storage and memory.

  1. Create a deployment with a stress container:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oom-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oom-demo
  template:
    metadata:
      labels:
        app: oom-demo
    spec:
      containers:
        - name: memory-demo
          image: polinux/stress
          resources:
            requests:
              memory: "50Mi"
            limits:
              memory: "100Mi"
          command: ["sleep"]
          args: ["3600"]
  1. To consume the ephemeral storage and cause the pod killed by OOM, in the container, create a large file to consume ephemeral storage and trigger oom by executing fallocate and stress as following:
1
2
3
4
5
6
7
8
9
10
11
$ kubectl exec -it oom-demo-6f7747887b-7vx6v -- /bin/bash

bash-5.0# df -h; fallocate -l 75G storage.log; df -h; stress --vm 1 --vm-bytes 250M --vm-hang 1
Filesystem                Size      Used Available Use% Mounted on
overlay                  79.9G      4.4G     75.5G   6% /

Filesystem                Size      Used Available Use% Mounted on
overlay                  79.9G     79.4G    549.2M  99% /

stress: info: [16] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
command terminated with exit code 137

At the same time, we can see the pod oom-demo-6f7747887b-7vx6v was OOM killed and in this case it was restarted once then entering Error status, without further restarting and a new pod is created:

1
2
3
4
5
6
7
8
9
10
$ kubectl get pods -o wide -w
NAME                                                 READY   STATUS        RESTARTS       AGE     IP               NODE
oom-demo-6f7747887b-7vx6v                            1/1     Running       0              6s      192.167.35.208   ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-7vx6v                            0/1     OOMKilled     0              29s     192.167.35.208   ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-7vx6v                            1/1     Running       1 (1s ago)      30s     192.167.35.208   ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-7vx6v                            0/1     Error         1 (39s ago)     68s     <none>           ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-m8bh5                            0/1     Pending       0               0s      <none>           <none>
oom-demo-6f7747887b-m8bh5                            0/1     ContainerCreating   0               0s      <none>           ip-192-167-19-202.ec2.internal
oom-demo-6f7747887b-7vx6v                            0/1     Error               1               68s     192.167.35.208   ip-192-167-35-2.ec2.internal
oom-demo-6f7747887b-m8bh5                            1/1     Running             0               1s      192.167.28.227   ip-192-167-19-202.ec2.internal

Summary

In summary, when a Kubernetes pod is evicted due to exceeding the node’s ephemeral storage limit, the kubelet marks the pod as “Failed” with the reason “Evicted.” The scheduler then assumes that the pod cannot be restarted on the same node and starts a new pod instead of restarting the evicted pod. To prevent this issue and ensure that the node has sufficient ephemeral storage capacity, we can monitor the pod disk usage, rotating the pod log/data or even use extra storage like PV/PVC.

In the past, I thought a pod killed by OOM must be restarted by kubelet continuously. From this scenario I learned that a pod being restarted and triggering eviction would continue existing in the cluster while the scheduler already start a new pod and we need to delete the pod manually.

References

This post is licensed under CC BY 4.0 by the author.