Skip to content

Add liveness probe to FRR container to detect unrecoverable daemon failures#455

Open
andreaskaris wants to merge 3 commits into
openperouter:mainfrom
andreaskaris:frr-liveness
Open

Add liveness probe to FRR container to detect unrecoverable daemon failures#455
andreaskaris wants to merge 3 commits into
openperouter:mainfrom
andreaskaris:frr-liveness

Conversation

@andreaskaris
Copy link
Copy Markdown
Contributor

Is this a BUG FIX or a FEATURE ?:

/kind bug

What this PR does / why we need it:

When an FRR daemon such as zebra crashes in a way that watchfrr cannot recover by itself, the FRR container keeps running but remains in a broken state. This adds an HTTP liveness probe to the FRR container that queries the reloader's /healthz endpoint (port 9080) so that Kubernetes detects the persistent failure and restarts the pod, recovering all FRR daemons.

Fixes #454

Special notes for your reviewer:

The FRR container reuses the same /healthz probe endpoint already served by the reloader container. If FRR's health is not o.k., both the frr and the reloader container will be restarted.

Release note:

Add liveness probe to FRR container to restart the pod when FRR daemons crash and cannot be recovered by watchfrr.

AI Guidelines Acknowledgment:

  • I have reviewed all changes in this PR, including any AI-generated content, and I take full responsibility for its accuracy and correctness.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a liveness probe to the FRR container to improve system reliability. By monitoring the health of the FRR daemon, Kubernetes can now automatically detect and recover from persistent failures that were previously left in a broken state, effectively addressing issues where the container remained running despite daemon crashes.

Highlights

  • Liveness Probe Implementation: Added an HTTP liveness probe to the FRR container to monitor the /healthz endpoint on port 9080.
  • Failure Recovery: Configured the probe to trigger a pod restart if the FRR daemon enters an unrecoverable state, ensuring both the FRR and reloader containers are reset.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements liveness probes for the FRR container across multiple deployment configurations and Helm templates to ensure the container restarts if the routing daemon becomes unhealthy. The reviewer feedback identifies that the 1-second timeout for the /healthz endpoint is likely too restrictive, as the underlying health check relies on commands that can take up to 10 seconds under heavy load, and suggests increasing the timeout to 10 seconds to prevent premature container restarts.

port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml (line 66), vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

          timeoutSeconds: 10
References
  1. Ensure correctness and proper handling of edge cases, such as slow health checks under load.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the bot that 1 second is way too optimistic.

port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

          timeoutSeconds: 10
References
  1. Ensure correctness and proper handling of edge cases, such as slow health checks under load.

port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

          timeoutSeconds: 10
References
  1. Ensure correctness and proper handling of edge cases, such as slow health checks under load.

Comment thread config/pods/perouter.yaml
port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

          timeoutSeconds: 10
References
  1. Ensure correctness and proper handling of edge cases, such as slow health checks under load.

port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

          timeoutSeconds: 10
References
  1. Ensure correctness and proper handling of edge cases, such as slow health checks under load.

@andreaskaris
Copy link
Copy Markdown
Contributor Author

@fedepaol for context see #454 - I'll also add a means to collect coredumps in another PR

Copy link
Copy Markdown
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts @andreaskaris / @fedepaol ?

port: 9080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the bot that 1 second is way too optimistic.

@maiqueb
Copy link
Copy Markdown
Contributor

maiqueb commented May 25, 2026

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?

@fedepaol do you think that would make sense ?

When an FRR daemon such as zebra crashes in a way that watchfrr cannot
recover by itself, the FRR container keeps running but remains in a
broken state. Add an HTTP liveness probe to the FRR container that
queries the reloader's /healthz endpoint (port 9080) so that
Kubernetes detects the persistent failure and restarts the pod,
recovering all FRR daemons.

Fixes openperouter#454

Signed-off-by: Andreas Karis <ak.karis@gmail.com>
@andreaskaris
Copy link
Copy Markdown
Contributor Author

Regardless of exposing or not values to CI, I checked the data on my test pr with the crash, and I can see the recovery. But Fede also added Dumping previous logs for pod openperouter-system-router-8229j-frr, so we're actually covered even after recovery. Be it as it may, I think this one here needs some further discussion

@fedepaol
Copy link
Copy Markdown
Contributor

fedepaol commented May 25, 2026

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?

@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

@maiqueb
Copy link
Copy Markdown
Contributor

maiqueb commented May 25, 2026

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.

Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

@fedepaol
Copy link
Copy Markdown
Contributor

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.

Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

@fedepaol
Copy link
Copy Markdown
Contributor

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.
Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

from some logs of a CI run, I am seeing always less than a second. Let's put 5 to be on the safe side.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
@fedepaol
Copy link
Copy Markdown
Contributor

fedepaol commented Jun 4, 2026

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.
Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

from some logs of a CI run, I am seeing always less than a second. Let's put 5 to be on the safe side.

@andreaskaris this is pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenPERouter: When zebra process crashes or is stopped, the FRR container keeps running

3 participants