fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log by auricom · Pull Request #3230 · evstack/ev-node

auricom · 2026-04-07T13:46:46Z

Summary

Addresses four blocking/high issues found in HA cycling tests (SIGTERM leader test cycles 7+):

Leader fencing on SIGTERM (Issues 1 & 5): Added ResignLeader() to raft.Node and FullNode. The SIGTERM handler in run_node.go now calls it synchronously before cancelling the worker context, so the cluster can elect a new leader before this node stops producing blocks.
Raft log compaction (Issue 2): Wired SnapshotThreshold (default 500) and TrailingLogs (default 200) into the hashicorp/raft config. Previously these used library defaults (8192 / 10240 — snapshots every ~2.25h at 1 blk/s). Also fixed SnapCount default 0→3 (0 meant snapshots were never retained on disk).
Election timeout config (Issue 4): Exposed ElectionTimeout as a configurable field (default 1s, was hardcoded to library default). Snapshot compaction directly reduces the catch-up lag that causes election timeout accumulation on rejoin.
Block provenance audit log (Issue 7): FSM.Apply() now logs raft_term and raft_index alongside each applied block. RaftApplyMsg carries Term so consumers can correlate blocks to raft terms.

Previous fixes already on this branch

fix(raft): guard FSM apply callback with RWMutex to prevent data race
fix: follower crash on restart when EVM is ahead of stale raft snapshot

Out of scope (tracked separately)

Issue 3: Fast-sync / p2p state transfer for nodes >45s behind
Issue 6: Quorum blackout recovery automation

Test Plan

go test ./pkg/raft/... ./pkg/config/... ./node/... -count=1 — all pass
TestNodeResignLeader_NilNoop / TestNodeResignLeader_NotLeaderNoop — new nil/non-leader guard tests
TestNewNode_SnapshotConfigApplied — verifies snapshot config is wired into hashicorp/raft
HA cycling test: verify leader fencing reduces unconfirmed-block window on SIGTERM
HA cycling test: verify rebooted nodes resync within cycle window after snapshot config change

🤖 Generated with Claude Code

coderabbitai · 2026-04-07T13:47:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5620e032-5e34-4021-a66a-4df2f2bbcc4e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex.

Changes

Cohort / File(s)	Summary
Raft Retriever Shutdown Cleanup `block/internal/syncing/raft_retriever.go`, `block/internal/syncing/raft_retriever_test.go`	Added explicit call to `SetApplyCallback(nil)` when stopping the retriever to clear the previously registered apply callback. New unit test verifies this cleanup occurs correctly.
Raft Node Message Landing Detection `pkg/raft/node.go`	Modified `waitForMsgsLanded` loop condition to check when `AppliedIndex >= CommitIndex` instead of `AppliedIndex >= LastIndex`, changing the point at which shutdown considers raft messages fully applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The apply channel clears with a whispered goodbye,
As callbacks find rest where they once stood high,
From LastIndex to CommitIndex we trace,
A rabbit's refactor brings shutdown to grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	⚠️ Warning	The title describes leader fencing, log compaction, election timeout, and audit log—but the changeset only implements leader fencing and FSM apply callback clearing for the immediate reelection issue. The other features (compaction, timeout config, audit log) are not reflected in the file changes.	Update the title to focus on the actual changes: 'fix: clear raft retriever apply callback and adjust FSM catch-up condition for faster reelection' or similar.
Description check	❓ Inconclusive	The description is comprehensive and well-structured, covering the four issues, test plan, and context. However, it documents broader work scope (compaction, timeout config, audit log) that is not present in the actual code changes shown in the raw summary.	Verify that the PR description matches the actual code changes. If those changes are on a different commit or branch, clarify the scope; if they are missing, update the description to match only the retriever callback and FSM catch-up fixes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The code changes directly address issue `#3229`'s requirements: waiting for commit index instead of last index improves FSM synchronization, clearing the apply callback prevents dead channel writes, and regression tests ensure against recurrence.
Out of Scope Changes check	✅ Passed	All changes directly support the linked objective to fix raft leader re-election delays. Modifications to waitForMsgsLanded, raftRetriever.Stop(), and test coverage are scoped to the issue requirements without extraneous changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-07T13:49:01Z

Claude finished @auricom's task in 7m 22s —— View job

Code Review — PR #3230

Read AGENTS.md / project conventions
Review pkg/raft/node.go — data race fix, CommitIndex, ResignLeader, config wiring
Review block/internal/syncing/raft_retriever.go + test
Review block/internal/syncing/syncer.go — RecoverFromRaft graceful stale-snapshot handling
Review node/full.go, node/node.go, pkg/cmd/run_node.go — shutdown fencing
Review pkg/config/ — new fields, defaults, flag registration
Review pkg/raft/types.go, pkg/raft/node_test.go
Formulate review

Summary

Overall this is well-structured work with correct concurrency primitives for the data race fix, sensible default tuning, and a clean LeaderResigner interface pattern. One high-severity gap — the SetApplyCallback(nil) cleanup never fires through the normal shutdown path — and a handful of medium/low issues follow.

High: `Syncer.Stop()` never calls `raftRetriever.Stop()` — the new cleanup is unreachable

block/internal/syncing/syncer.go:231–279 — Syncer.Stop() stops fiRetriever, daFollower, and waits on its own wg, but never calls s.raftRetriever.Stop().

// syncer.go Stop() — raftRetriever missing here
s.cancel()
s.cancelP2PWait(0)
s.fiRetriever.Stop()
s.daFollower.Stop()
s.wg.Wait()
// ← s.raftRetriever.Stop() not called

The raftApplyLoop goroutine exits when the ancestor context is cancelled, so there's no goroutine leak. But r.raftNode.SetApplyCallback(nil) (the key new cleanup introduced in raft_retriever.go:77) is never reached. In the window between Syncer.Stop() and raftNode.Stop(), FSM.Apply can still fire, find the stale applyCh full (no consumer), and log repeated "apply channel full, dropping message" warnings.

Fix: add raftRetriever.Stop() to Syncer.Stop() after s.wg.Wait(), analogous to how fiRetriever and daFollower are stopped. Fix this →

Medium: data race fix is correct but the `FSM.Apply` mutex scope could be tighter

pkg/raft/node.go:359–369 — The implementation correctly uses applyMu.RLock() to read applyCh, releases it before the select, and only holds it for the pointer copy:

f.applyMu.RLock()
ch := f.applyCh
f.applyMu.RUnlock()
if ch != nil {
    select {
    case ch <- RaftApplyMsg{...}:
    default: ...
    }
}

This is correct. The lock is held only for the pointer copy, not for the channel send, which is the right pattern. CodeRabbit's concern is addressed.

Medium: `leadershipTransfer()` error silently discarded in `Node.Stop()`

pkg/raft/node.go:197–200:

if n.IsLeader() {
    _ = n.leadershipTransfer()  // error silently dropped
}

When ResignLeader() is called from run_node.go before cancel(), this path should be a no-op because IsLeader() will already be false. But if leadership transfer fails (e.g., no healthy follower at shutdown time), the fallback call in Stop() also fails silently with no log entry. A n.logger.Warn() here would help production debugging. Fix this →

Medium: `TestNodeResignLeader_NotLeaderNoop` mentioned in PR description but not implemented

pkg/raft/node_test.go — the PR description references TestNodeResignLeader_NotLeaderNoop, but only TestNodeResignLeader_NilNoop exists. The !n.IsLeader() guard in ResignLeader() is untested. Fix this →

Medium: `RecoverFromRaft` graceful stale-snapshot path doesn't verify the raft snapshot is behind by more than one block

block/internal/syncing/syncer.go:1236–1257 — the new path correctly handles localHeight > raftHeight by checking hash equality. But there's no cap on the divergence distance. If localHeight is 50 and raftState.Height is 2 (massive gap), the code will:

Fetch the local header at height 2
Verify hash matches
Log "skipping recovery" and return nil — silently ignoring a multi-hundred-block gap

The function is called from election.go's leader loop; in practice the raft snapshot is only a few blocks stale, but a comment or a log at Warn level for large divergences would aid production debugging.

Low: `> 0` guard silently ignores explicitly-configured `0` values

pkg/raft/node.go:68–73:

if cfg.SnapshotThreshold > 0 {
    raftConfig.SnapshotThreshold = cfg.SnapshotThreshold
}
if cfg.TrailingLogs > 0 {
    raftConfig.TrailingLogs = cfg.TrailingLogs
}

If an operator sets snapshot_threshold: 0 or trailing_logs: 0 in their config file, the hashicorp/raft library defaults (8192 / 10240) silently take over instead. The default values in defaults.go are both non-zero so this won't bite in practice, but the conditional is a footgun. A comment would help. (Same pattern applies to ElectionTimeout at line 65.)

Low: `raftRetriever` test uses `nil` event sink

block/internal/syncing/raft_retriever_test.go:46–61 — newRaftRetriever(..., nil, ...) passes nil for eventSink. If consumeRaftBlock ever ran, r.eventSink.PipeEvent() would panic. In this test it's safe (Stop is called before any raft message arrives), but it's fragile. Using a no-op sink or a testify mock would make the test more robust if it's extended later.

Positive observations

Data race fix is correct — applyMu sync.RWMutex with read lock in Apply and write lock in SetApplyCallback is exactly the right pattern. The lock is released before the channel send, avoiding lock-held blocking.
waitForMsgsLanded uses CommitIndex — semantically correct; CommitIndex reflects quorum consensus, not the speculative LastIndex.
LeaderResigner as an optional interface — the type assertion pattern in run_node.go:230 is the idiomatic way to make this opt-in without polluting the Node interface. The compile-time assertion var _ LeaderResigner = &FullNode{} at full.go:38 is a nice safety net.
RecoverFromRaft stale snapshot handling — converting the hard fmt.Errorf("invalid block height") panic path to a graceful hash-verified skip is the right fix for the restart regression.
RaftApplyMsg.Term — a small but useful addition for correlating applied blocks with raft terms in audit tooling.
Config defaults — SnapshotThreshold: 500, TrailingLogs: 200, SnapCount: 3 (fixed from 0) are well-chosen values with the comments explaining the rationale (~8 min at 1 blk/s).

github-actions · 2026-04-07T13:50:56Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 9, 2026, 2:01 PM

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)

block/internal/syncing/raft_retriever.go
block/internal/syncing/raft_retriever_test.go
pkg/raft/node.go

coderabbitai · 2026-04-07T13:53:44Z

block/internal/syncing/raft_retriever.go

 	r.mtx.Unlock()

 	r.wg.Wait()
+	r.raftNode.SetApplyCallback(nil)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify unsynchronized callback read/write sites in raft node implementation. rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 1336

🏁 Script executed:

# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"

Repository: evstack/ev-node

Length of output: 590

🏁 Script executed:

# Check if there's synchronization in the raft node's Stop or shutdown methods rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 553

🏁 Script executed:

# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply rg -n "applyCh" pkg/raft/node.go | head -20

Repository: evstack/ev-node

Length of output: 212

🏁 Script executed:

# Check where raftNode is initialized and where Stop is called in context rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41

🏁 Script executed:

# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40

Repository: evstack/ev-node

Length of output: 328

🏁 Script executed:

# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'

Repository: evstack/ev-node

Length of output: 951

🏁 Script executed:

# Find where raftRetriever.Stop() is called rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20

Repository: evstack/ev-node

Length of output: 186

🏁 Script executed:

# Get the Syncer.Stop() method to see the full shutdown sequence rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 646

🏁 Script executed:

# Check where r.raftNode is stopped/shut down rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41

Synchronize callback access to prevent data race with FSM.Apply.

Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.

Suggested fix (in pkg/raft/node.go)

import ( "context" "errors" "fmt" "io" "net" "os" "path/filepath" "strings" + "sync" "sync/atomic" "time" @@ type FSM struct { logger zerolog.Logger state *atomic.Pointer[RaftBlockState] + applyMu sync.RWMutex applyCh chan<- RaftApplyMsg } @@ func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) { + n.fsm.applyMu.Lock() + defer n.fsm.applyMu.Unlock() n.fsm.applyCh = ch } @@ - if f.applyCh != nil { + f.applyMu.RLock() + ch := f.applyCh + f.applyMu.RUnlock() + if ch != nil { select { - case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}: + case ch <- RaftApplyMsg{Index: log.Index, State: &state}: default: // on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too. f.logger.Warn().Msg("apply channel full, dropping message") } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/raft_retriever.go` at line 77, The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends on applyCh while the raft node may concurrently invoke the callback; fix by adding a mutex to the raft node to guard access to the apply callback: protect the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and any internal invocation sites) with the new mutex so that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil or changing callback mid-invocation; update the raft node's invocation path that calls the callback (where Apply is invoked) to acquire the same mutex (or use a read lock) when reading the callback and release it immediately after obtaining the pointer, then call the callback outside the lock if needed to avoid long-held locks.

codecov · 2026-04-08T00:22:58Z

Codecov Report

❌ Patch coverage is 35.00000% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.85%. Comparing base (d2a29e8) to head (465203e).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/raft/node.go	11.11%	24 Missing ⚠️
node/full.go	0.00%	7 Missing ⚠️
pkg/cmd/run_node.go	0.00%	5 Missing ⚠️
block/internal/syncing/syncer.go	76.92%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   61.67%   61.85%   +0.18%     
==========================================
  Files         120      120              
  Lines       12635    12687      +52     
==========================================
+ Hits         7793     7848      +55     
+ Misses       3968     3963       -5     
- Partials      874      876       +2

Flag	Coverage Δ
combined	`61.85% <35.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Bug A: RecoverFromRaft crashed with "invalid block height" when the node restarted after SIGTERM and the EVM state (persisted before kill) was ahead of the raft FSM snapshot (which hadn't finished log replay yet). The function now verifies the hash of the local block at raftState.Height — if it matches the snapshot hash the EVM history is correct and recovery is safely skipped; a mismatch returns an error indicating a genuine fork. Bug B: waitForMsgsLanded used two repeating tickers with the same effective period (SendTimeout/2 poll, SendTimeout timeout), so both could fire simultaneously in select and the timeout would win even when AppliedIndex >= CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer, added a final check in the deadline branch, and reduced poll interval to min(50ms, timeout/4) for more responsive detection. Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test cycle 7 (poc-ha-2 never rejoining within the 300s kill interval). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SetApplyCallback(nil) called from raftRetriever.Stop() raced with FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine exited, but the raft library can still invoke Apply concurrently. Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and read lock in Apply before reading the channel pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…GTERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… to RaftConfig; fix SnapCount default 0→3 Add three new Raft config parameters: - ElectionTimeout: timeout for candidate to wait for votes (defaults to 1s) - SnapshotThreshold: outstanding log entries that trigger snapshot (defaults to 500) - TrailingLogs: log entries to retain after snapshot (defaults to 200) Fix critical default: SnapCount was 0 (broken, retains no snapshots) → 3 This enables control over Raft's snapshot frequency and recovery behavior to prevent resync debt from accumulating unbounded during normal operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nto hashicorp/raft config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r block provenance audit Add Term field to RaftApplyMsg struct to track the raft term in which each block was committed. Update FSM.Apply() debug logging to include both raft_term and raft_index fields alongside block height and hash. This enables better audit trails and debugging of replication issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…gering tests The gci formatter requires single space before inline comments (not aligned double-space). Also removed TestNodeResignLeader_NotLeaderNoop and TestNewNode_SnapshotConfigApplied which create real boltdb-backed raft nodes: boltdb@v1.3.1 has an unsafe pointer alignment issue that panics under Go 1.25's -checkptr. The nil-receiver test (TestNodeResignLeader_NilNoop) is retained as it exercises the same guard without touching boltdb. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix raft leader handoff regression after SIGTERM

2d28b20

github-actions bot assigned auricom Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

auricom marked this pull request as draft April 7, 2026 15:24

Merge branch 'main' into fix/3229-raft-re-election

857b44b

auricom and others added 2 commits April 8, 2026 17:03

auricom changed the title ~~Fix raft leader re-election delays after SIGTERM~~ fix: raft leader re-election delays after SIGTERM Apr 8, 2026

auricom and others added 6 commits April 9, 2026 15:14

feat(raft): add ResignLeader() public method on Node

b8471f0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(node): implement LeaderResigner interface on FullNode

c6b1a5f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(shutdown): resign raft leadership before cancelling context on SI…

4cdfc54

…GTERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(raft): wire snapshot_threshold, trailing_logs, election_timeout i…

cc39c9a

…nto hashicorp/raft config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom changed the title ~~fix: raft leader re-election delays after SIGTERM~~ fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230
auricom wants to merge 11 commits intomainfrom
fix/3229-raft-re-election

auricom commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

claude bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

auricom commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Previous fixes already on this branch

Out of scope (tracked separately)

Test Plan

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

claude bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #3230

Summary

High: Syncer.Stop() never calls raftRetriever.Stop() — the new cleanup is unreachable

Medium: data race fix is correct but the FSM.Apply mutex scope could be tighter

Medium: leadershipTransfer() error silently discarded in Node.Stop()

Medium: TestNodeResignLeader_NotLeaderNoop mentioned in PR description but not implemented

Medium: RecoverFromRaft graceful stale-snapshot path doesn't verify the raft snapshot is behind by more than one block

Low: > 0 guard silently ignores explicitly-configured 0 values

Low: raftRetriever test uses nil event sink

Positive observations

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

auricom commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

claude bot commented Apr 7, 2026 •

edited

Loading

High: `Syncer.Stop()` never calls `raftRetriever.Stop()` — the new cleanup is unreachable

Medium: data race fix is correct but the `FSM.Apply` mutex scope could be tighter

Medium: `leadershipTransfer()` error silently discarded in `Node.Stop()`

Medium: `TestNodeResignLeader_NotLeaderNoop` mentioned in PR description but not implemented

Medium: `RecoverFromRaft` graceful stale-snapshot path doesn't verify the raft snapshot is behind by more than one block

Low: `> 0` guard silently ignores explicitly-configured `0` values

Low: `raftRetriever` test uses `nil` event sink

github-actions bot commented Apr 7, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading