Before proceeding, the Descheduler Operator must be installed.
WARNING
Do not run this in a LIVE cluster, this should be dedicated to the specific tests, as it will EVICT running pods every 1 minute when the Pods are older than 5m
.
WARNING
Per the documentation:
RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times. Pods where the sum of restarts over all containers (including Init Containers) is more than 100.
We have to wait for 100 restarts. It is not configurable through the operator.
Backoff times are hardcoded in Kubernetes as described here kubernetes/kubernetes#57291 and cannot be overridden.
There are essentially two tests included:
- spec.containers restart 100 times and are evicted
- spec.initContainers restart 100 times and are evicted.
This test takes a LONG time approximately 10 hrs for each case. Feel free to RUN them in parallel with slight modifications to the deployment names.
- Update the LifecycleAndUtilization Policy with
spec.container
failures.
$ oc apply -n openshift-kube-descheduler-operator -f files/3_LifecycleAndUtilization_RemovePodsHavingTooManyRestarts.yml
kubedescheduler.operator.openshift.io/cluster created
- Check the configmap to see the Descheduler Policy.
$ oc -n openshift-kube-descheduler-operator get cm cluster -o=yaml
This ConfigMap shows the excluded namespaces and podsHavingTooManyRestarts.podRestartThreshold: 100
as a setting.
- Check the descheduler cluster
$ oc -n openshift-kube-descheduler-operator logs -l app=descheduler
This log should show a started Descheduler.
- Create a test namespace
$ oc get namespace test || oc create namespace test
namespace/test created
- Create a Deployment that fails.
$ oc -n test apply -f files/3_LifecycleAndUtilization_RemovePodsHavingTooManyRestarts_dp.yml
deployment.apps/demo created
- Double check that it is scheduled to nodes:
$ oc -n test get pods
NAME READY STATUS RESTARTS AGE
demo-8z7lh 0/1 Error 2 (23s ago) 44s
demo-t9tps 0/1 Error 2 (23s ago) 44s
- You need to monitor the Pods over the next few hours to see the status and the
RESTARTS
.
$ oc -n test get pods
NAME READY STATUS RESTARTS AGE
demo-64d9c84b75-665nv 0/1 CrashLoopBackOff 9 (4m52s ago) 26m
demo-64d9c84b75-7hxdv 0/1 CrashLoopBackOff 9 (5m2s ago) 26m
The CrashLoopBackOff is not tunable, so it'll take some time to hit 100
.
- Once you see a new set of pods
demo-64d9c84b75-665nv
changed to for instancedemo-64d9c84b75-dgp5b
, the Eviction has happened, and it should show up in the logs. Wait on the logs to be updated.
$ oc -n openshift-kube-descheduler-operator logs -l app=descheduler --since=10h --tail=20000 > out.log
- Scan for the output.log following lines:
I0511 10:23:30.306748 1 evictions.go:160] "Evicted pod" pod="test/demo-64d9c84b75-665nv" reason="TooManyRestarts"
I0511 10:25:30.384381 1 evictions.go:160] "Evicted pod" pod="test/demo-64d9c84b75-7hxdv" reason="TooManyRestarts"
You've now seen spec.container
TooManyRestarts policy take action.
- Delete the Deployment demo (we're going to re-use the name)
$ oc -n test delete deployment.apps/demo
deployment.apps "demo" deleted
- Update the LifecycleAndUtilization Policy with spec.initContainers failures, by creating a Deployment that fails.
$ oc -n test apply -f files/3_LifecycleAndUtilization_RemovePodsHavingTooManyRestarts_init.yml
deployment.apps/demo created
- Double check that it is started:
$ oc -n test get pods
NAME READY STATUS RESTARTS AGE
demo-64d9c84b75-dgp5b 0/1 Error 2 (23s ago) 44s
demo-64d9c84b75-mrwd2 0/1 Error 2 (23s ago) 44s
- Wait for ~10 hrs, you need to follow the descheduler cluster pod.
$ oc -n test get pods
NAME READY STATUS RESTARTS AGE
demo-64d9c84b75-dgp5b 0/1 Init:CrashLoopBackOff 27 (63s ago) 114m
demo-64d9c84b75-mrwd2 0/1 Init:CrashLoopBackOff 26 (4m53s ago) 112m
- Once you see a new set of pods created, the Eviction has happened, and it should show up in the logs. Wait on the logs to be updated.
$ oc -n openshift-kube-descheduler-operator logs -l app=descheduler --since=10h --tail=20000 > out.log
- Scan for the output.log following lines:
I0511 10:23:30.306748 1 evictions.go:160] "Evicted pod" pod="test/demo-64d9c84b75-zlw5r" reason="TooManyRestarts"
I0511 10:25:30.384381 1 evictions.go:160] "Evicted pod" pod="test/demo-64d9c84b75-v46kq" reason="TooManyRestarts"
- Delete the Deployment demo (we're going to re-use the name)
oc -n test delete deployment.apps/demo
deployment.apps "demo" deleted
The RemovePodsHavingTooManyRestarts is has two configuration for the strategy that forces TooMany restarts.
I did look into using a custom BackOff. It's hard coded in Kubernetes and OKD. If you want to learn more, you can look at: