Add liveness probe to FRR container to detect unrecoverable daemon failures by andreaskaris · Pull Request #455 · openperouter/openperouter

andreaskaris · 2026-05-23T21:10:34Z

Is this a BUG FIX or a FEATURE ?:

/kind bug

What this PR does / why we need it:

When an FRR daemon such as zebra crashes in a way that watchfrr cannot recover by itself, the FRR container keeps running but remains in a broken state. This adds an HTTP liveness probe to the FRR container that queries the reloader's /healthz endpoint (port 9080) so that Kubernetes detects the persistent failure and restarts the pod, recovering all FRR daemons.

Fixes #454

Special notes for your reviewer:

The FRR container reuses the same /healthz probe endpoint already served by the reloader container. If FRR's health is not o.k., both the frr and the reloader container will be restarted.

Release note:

Add liveness probe to FRR container to restart the pod when FRR daemons crash and cannot be recovered by watchfrr.

AI Guidelines Acknowledgment:

I have reviewed all changes in this PR, including any AI-generated content, and I take full responsibility for its accuracy and correctness.

gemini-code-assist · 2026-05-23T21:10:43Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a liveness probe to the FRR container to improve system reliability. By monitoring the health of the FRR daemon, Kubernetes can now automatically detect and recover from persistent failures that were previously left in a broken state, effectively addressing issues where the container remained running despite daemon crashes.

Highlights

Liveness Probe Implementation: Added an HTTP liveness probe to the FRR container to monitor the /healthz endpoint on port 9080.
Failure Recovery: Configured the probe to trigger a pod restart if the FRR daemon enters an unrecoverable state, ensuring both the FRR and reloader containers are reset.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements liveness probes for the FRR container across multiple deployment configurations and Helm templates to ensure the container restarts if the routing daemon becomes unhealthy. The reviewer feedback identifies that the 1-second timeout for the /healthz endpoint is likely too restrictive, as the underlying health check relies on commands that can take up to 10 seconds under heavy load, and suggests increasing the timeout to 10 seconds to prevent premature container restarts.

gemini-code-assist · 2026-05-23T21:12:04Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml (line 66), vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

timeoutSeconds: 10

References

Ensure correctness and proper handling of edge cases, such as slow health checks under load.

I agree with the bot that 1 second is way too optimistic.

gemini-code-assist · 2026-05-23T21:12:04Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

timeoutSeconds: 10

References

Ensure correctness and proper handling of edge cases, such as slow health checks under load.

gemini-code-assist · 2026-05-23T21:12:04Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

timeoutSeconds: 10

References

Ensure correctness and proper handling of edge cases, such as slow health checks under load.

gemini-code-assist · 2026-05-23T21:12:04Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

timeoutSeconds: 10

References

Ensure correctness and proper handling of edge cases, such as slow health checks under load.

gemini-code-assist · 2026-05-23T21:12:04Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


The timeoutSeconds is set to 1, which may be too restrictive. The health check endpoint /healthz likely triggers a call to vtysh to verify daemon status. As noted in values.yaml, vtysh commands can take up to 10 seconds under heavy load. A 1-second timeout could cause the liveness probe to fail prematurely if the underlying check is slow, leading to unnecessary container restarts. Consider increasing this value to at least 10 seconds to align with the expected vtysh timeout.

timeoutSeconds: 10

References

Ensure correctness and proper handling of edge cases, such as slow health checks under load.

andreaskaris · 2026-05-24T09:36:10Z

@fedepaol for context see #454 - I'll also add a means to collect coredumps in another PR

maiqueb

thoughts @andreaskaris / @fedepaol ?

maiqueb · 2026-05-25T09:39:26Z

+            port: 9080
+          initialDelaySeconds: 15
+          periodSeconds: 20
+          timeoutSeconds: 1


I agree with the bot that 1 second is way too optimistic.

maiqueb · 2026-05-25T10:13:43Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?

@fedepaol do you think that would make sense ?

When an FRR daemon such as zebra crashes in a way that watchfrr cannot recover by itself, the FRR container keeps running but remains in a broken state. Add an HTTP liveness probe to the FRR container that queries the reloader's /healthz endpoint (port 9080) so that Kubernetes detects the persistent failure and restarts the pod, recovering all FRR daemons. Fixes openperouter#454 Signed-off-by: Andreas Karis <ak.karis@gmail.com>

andreaskaris · 2026-05-25T10:48:14Z

Regardless of exposing or not values to CI, I checked the data on my test pr with the crash, and I can see the recovery. But Fede also added Dumping previous logs for pod openperouter-system-router-8229j-frr, so we're actually covered even after recovery. Be it as it may, I think this one here needs some further discussion

fedepaol · 2026-05-25T13:29:39Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?

@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

maiqueb · 2026-05-25T13:36:55Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.

Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

fedepaol · 2026-05-25T13:57:21Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.

Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

fedepaol · 2026-05-25T21:40:35Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.
Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

from some logs of a CI run, I am seeing always less than a second. Let's put 5 to be on the safe side.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

fedepaol · 2026-06-04T12:29:15Z

from our discussion on slack; maybe the right thing to do is to expose this value as configuration via helm / the operator ?
@fedepaol do you think that would make sense ?

yep, but it also makes sense to come up with realistic default values. If we relax to 10 seconds, maybe we should relax the period parameter too.

+1 on realistic defaults.
Any suggestion ? I think we're mostly guessing tbh - I just feel 1 second is way too low.

I'd look for a run to see how long vtysh takes normally and double that. The comment in values referred by gemini is probably related to an old frr version

from some logs of a CI run, I am seeing always less than a second. Let's put 5 to be on the safe side.

@andreaskaris this is pending.

github-actions Bot added the kind/bug label May 23, 2026

andreaskaris force-pushed the frr-liveness branch from 6f74dc9 to bdc89b4 Compare May 23, 2026 21:11

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

maiqueb requested changes May 25, 2026

View reviewed changes

andreaskaris force-pushed the frr-liveness branch from bdc89b4 to 5d4edcc Compare May 25, 2026 10:18

Merge branch 'main' into frr-liveness

e6ac5d4

Merge branch 'main' into frr-liveness

6b11763

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Conversation

andreaskaris commented May 23, 2026

Uh oh!

gemini-code-assist Bot commented May 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

maiqueb May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

andreaskaris commented May 24, 2026

Uh oh!

maiqueb left a comment

Choose a reason for hiding this comment

Uh oh!

maiqueb May 25, 2026

Choose a reason for hiding this comment

Uh oh!

maiqueb commented May 25, 2026

Uh oh!

andreaskaris commented May 25, 2026

Uh oh!

fedepaol commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maiqueb commented May 25, 2026

Uh oh!

fedepaol commented May 25, 2026

Uh oh!

fedepaol commented May 25, 2026

Uh oh!

fedepaol commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fedepaol commented May 25, 2026 •

edited

Loading