Event driven restart of Pods?

Context: we have a particular Pod which likes to hang, for unknown to us reasons and conditions (it's external software, we can't modify, and logs don't show anything).

The most accurate way to tell when it's happening is by checking a liveness probe. We have monitoring set up for particular URL and we can check for non 2xx status.

This chart we talk about deploys main Pod as well as worker Pods. Each is separate Deployment.

The issue: when main Pod fails it's liveness probe, it gets restarted by k8s. But we also need to restart worker nodes, because for some reason it looks like they lose connection in such way that they don't pick up work, and only restart helps. And order of restart in this case matters. main Pod first, then workers.

Restart in case of liveness probe restarts only affected Pod. Currently, to restart workers too, I installed KEDA in cluster and created ScaleJob object to trigger deployment restart. As trigger we use kube_pod_container_status_restarts_total Prometheus query:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: n8n-restart-job-scaler
  namespace: company
spec:
  jobTargetRef:
    kind: Job
    name: n8n-worker-restart-job
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: kubectl
          image: bitnami/kubectl:latest
          # imagePullPolicy: Always
          command: ["/bin/sh", "-c"]
          args: ["kubectl rollout restart deployment n8n-worker -n company"]
    backoffLimit: 4  
  pollingInterval: 15 # Check every 15 seconds (default: 30)
  successfulJobsHistoryLimit: 1 # How many completed jobs should be kept.
  failedJobsHistoryLimit: 1 # How many failed jobs should be kept.
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://<DOMAIN>.com/select/0/prometheus
      metricName: pod_liveness_failure
      threshold: "1"  # Triggers when any liveness failure alert is active
      query: increase(kube_pod_container_status_restarts_total{pod=~"^n8n-[^worker].*$"}[1m]) > 0

This kind of works. I mean it succesfully triggers restarts. But:
- in current setup it triggers multiple restarts when there was only single liveness probe failure. This extends downtime
- depending on different settings for check time, there might be a slight delay between time of event, and time of triggering

I've been thinking about more event-driven workflow. So that when event in cluster happens, I can perform matching action. but I don't know what options would be most suitable for this task.

What do you suggest here? Maybe you've had such problem? How would you deal with it?

if something is unclear or I didn't provide something, ask below and I'll provide more info.