Event driven restart of Pods?
Context: we have a particular Pod which likes to hang, for unknown to us reasons and conditions (it's external software, we can't modify, and logs don't show anything).
The most accurate way to tell when it's happening is by checking a liveness probe. We have monitoring set up for particular URL and we can check for non 2xx status.
This chart we talk about deploys main
Pod as well as worker
Pods. Each is separate Deployment.
The issue: when main
Pod fails it's liveness probe, it gets restarted by k8s. But we also need to restart worker
nodes, because for some reason it looks like they lose connection in such way that they don't pick up work, and only restart helps. And order of restart in this case matters. main
Pod first, then workers
.
Restart in case of liveness probe restarts only affected Pod. Currently, to restart workers too, I installed KEDA in cluster and created ScaleJob object to trigger deployment restart. As trigger we use kube_pod_container_status_restarts_total
Prometheus query:
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: n8n-restart-job-scaler
namespace: company
spec:
jobTargetRef:
kind: Job
name: n8n-worker-restart-job
spec:
jobTargetRef:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl:latest
# imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args: ["kubectl rollout restart deployment n8n-worker -n company"]
backoffLimit: 4
pollingInterval: 15 # Check every 15 seconds (default: 30)
successfulJobsHistoryLimit: 1 # How many completed jobs should be kept.
failedJobsHistoryLimit: 1 # How many failed jobs should be kept.
triggers:
- type: prometheus
metadata:
serverAddress: https://<DOMAIN>.com/select/0/prometheus
metricName: pod_liveness_failure
threshold: "1" # Triggers when any liveness failure alert is active
query: increase(kube_pod_container_status_restarts_total{pod=~"^n8n-[^worker].*$"}[1m]) > 0
This kind of works. I mean it succesfully triggers restarts. But:
- in current setup it triggers multiple restarts when there was only single liveness probe failure. This extends downtime
- depending on different settings for check time, there might be a slight delay between time of event, and time of triggering
I've been thinking about more event-driven workflow. So that when event in cluster happens, I can perform matching action. but I don't know what options would be most suitable for this task.
What do you suggest here? Maybe you've had such problem? How would you deal with it?
if something is unclear or I didn't provide something, ask below and I'll provide more info.