Add liveness probe to FRR container to detect unrecoverable daemon failures#455
Add liveness probe to FRR container to detect unrecoverable daemon failures#455andreaskaris wants to merge 3 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a liveness probe to the FRR container to improve system reliability. By monitoring the health of the FRR daemon, Kubernetes can now automatically detect and recover from persistent failures that were previously left in a broken state, effectively addressing issues where the container remained running despite daemon crashes. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
6f74dc9 to
bdc89b4
Compare
There was a problem hiding this comment.
Code Review
This pull request implements liveness probes for the FRR container across multiple deployment configurations and Helm templates to ensure the container restarts if the routing daemon becomes unhealthy. The reviewer feedback identifies that the 1-second timeout for the /healthz endpoint is likely too restrictive, as the underlying health check relies on commands that can take up to 10 seconds under heavy load, and suggests increasing the timeout to 10 seconds to prevent premature container restarts.
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml (line 66), vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.
timeoutSeconds: 10References
- Ensure correctness and proper handling of edge cases, such as slow health checks under load.
There was a problem hiding this comment.
I agree with the bot that 1 second is way too optimistic.
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.
timeoutSeconds: 10References
- Ensure correctness and proper handling of edge cases, such as slow health checks under load.
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.
timeoutSeconds: 10References
- Ensure correctness and proper handling of edge cases, such as slow health checks under load.
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.
timeoutSeconds: 10References
- Ensure correctness and proper handling of edge cases, such as slow health checks under load.
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.
timeoutSeconds: 10References
- Ensure correctness and proper handling of edge cases, such as slow health checks under load.
maiqueb
left a comment
There was a problem hiding this comment.
thoughts @andreaskaris / @fedepaol ?
| port: 9080 | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 20 | ||
| timeoutSeconds: 1 |
There was a problem hiding this comment.
I agree with the bot that 1 second is way too optimistic.
|
from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ? @fedepaol do you think that would make sense ? |
When an FRR daemon such as zebra crashes in a way that watchfrr cannot recover by itself, the FRR container keeps running but remains in a broken state. Add an HTTP liveness probe to the FRR container that queries the reloader's /healthz endpoint (port 9080) so that Kubernetes detects the persistent failure and restarts the pod, recovering all FRR daemons. Fixes openperouter#454 Signed-off-by: Andreas Karis <ak.karis@gmail.com>
bdc89b4 to
5d4edcc
Compare
|
Regardless of exposing or not values to CI, I checked the data on my test pr with the crash, and I can see the recovery. But Fede also added |
yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too. |
+1 on realistic defaults. Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low. |
I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version |
from some logs of a CI run, I am seeing always less than a second. Let's put 5 to be on the safe side. |
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
@andreaskaris this is pending. |
Is this a BUG FIX or a FEATURE ?:
/kind bug
What this PR does / why we need it:
When an FRR daemon such as zebra crashes in a way that watchfrr cannot recover by itself, the FRR container keeps running but remains in a broken state. This adds an HTTP liveness probe to the FRR container that queries the reloader's
/healthzendpoint (port 9080) so that Kubernetes detects the persistent failure and restarts the pod, recovering all FRR daemons.Fixes #454
Special notes for your reviewer:
The FRR container reuses the same
/healthzprobe endpoint already served by the reloader container. If FRR's health is not o.k., both the frr and the reloader container will be restarted.Release note:
AI Guidelines Acknowledgment: