Skip to content

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230

Draft
auricom wants to merge 11 commits intomainfrom
fix/3229-raft-re-election
Draft

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230
auricom wants to merge 11 commits intomainfrom
fix/3229-raft-re-election

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 7, 2026

Summary

Addresses four blocking/high issues found in HA cycling tests (SIGTERM leader test cycles 7+):

  • Leader fencing on SIGTERM (Issues 1 & 5): Added ResignLeader() to raft.Node and FullNode. The SIGTERM handler in run_node.go now calls it synchronously before cancelling the worker context, so the cluster can elect a new leader before this node stops producing blocks.

  • Raft log compaction (Issue 2): Wired SnapshotThreshold (default 500) and TrailingLogs (default 200) into the hashicorp/raft config. Previously these used library defaults (8192 / 10240 — snapshots every ~2.25h at 1 blk/s). Also fixed SnapCount default 0→3 (0 meant snapshots were never retained on disk).

  • Election timeout config (Issue 4): Exposed ElectionTimeout as a configurable field (default 1s, was hardcoded to library default). Snapshot compaction directly reduces the catch-up lag that causes election timeout accumulation on rejoin.

  • Block provenance audit log (Issue 7): FSM.Apply() now logs raft_term and raft_index alongside each applied block. RaftApplyMsg carries Term so consumers can correlate blocks to raft terms.

Previous fixes already on this branch

  • fix(raft): guard FSM apply callback with RWMutex to prevent data race
  • fix: follower crash on restart when EVM is ahead of stale raft snapshot

Out of scope (tracked separately)

  • Issue 3: Fast-sync / p2p state transfer for nodes >45s behind
  • Issue 6: Quorum blackout recovery automation

Test Plan

  • go test ./pkg/raft/... ./pkg/config/... ./node/... -count=1 — all pass
  • TestNodeResignLeader_NilNoop / TestNodeResignLeader_NotLeaderNoop — new nil/non-leader guard tests
  • TestNewNode_SnapshotConfigApplied — verifies snapshot config is wired into hashicorp/raft
  • HA cycling test: verify leader fencing reduces unconfirmed-block window on SIGTERM
  • HA cycling test: verify rebooted nodes resync within cycle window after snapshot config change

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5620e032-5e34-4021-a66a-4df2f2bbcc4e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex.

Changes

Cohort / File(s) Summary
Raft Retriever Shutdown Cleanup
block/internal/syncing/raft_retriever.go, block/internal/syncing/raft_retriever_test.go
Added explicit call to SetApplyCallback(nil) when stopping the retriever to clear the previously registered apply callback. New unit test verifies this cleanup occurs correctly.
Raft Node Message Landing Detection
pkg/raft/node.go
Modified waitForMsgsLanded loop condition to check when AppliedIndex >= CommitIndex instead of AppliedIndex >= LastIndex, changing the point at which shutdown considers raft messages fully applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The apply channel clears with a whispered goodbye,
As callbacks find rest where they once stood high,
From LastIndex to CommitIndex we trace,
A rabbit's refactor brings shutdown to grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The title describes leader fencing, log compaction, election timeout, and audit log—but the changeset only implements leader fencing and FSM apply callback clearing for the immediate reelection issue. The other features (compaction, timeout config, audit log) are not reflected in the file changes. Update the title to focus on the actual changes: 'fix: clear raft retriever apply callback and adjust FSM catch-up condition for faster reelection' or similar.
Description check ❓ Inconclusive The description is comprehensive and well-structured, covering the four issues, test plan, and context. However, it documents broader work scope (compaction, timeout config, audit log) that is not present in the actual code changes shown in the raw summary. Verify that the PR description matches the actual code changes. If those changes are on a different commit or branch, clarify the scope; if they are missing, update the description to match only the retriever callback and FSM catch-up fixes.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The code changes directly address issue #3229's requirements: waiting for commit index instead of last index improves FSM synchronization, clearing the apply callback prevents dead channel writes, and regression tests ensure against recurrence.
Out of Scope Changes check ✅ Passed All changes directly support the linked objective to fix raft leader re-election delays. Modifications to waitForMsgsLanded, raftRetriever.Stop(), and test coverage are scoped to the issue requirements without extraneous changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 7, 2026

Claude finished @auricom's task in 7m 22s —— View job


Code Review — PR #3230

  • Read AGENTS.md / project conventions
  • Review pkg/raft/node.go — data race fix, CommitIndex, ResignLeader, config wiring
  • Review block/internal/syncing/raft_retriever.go + test
  • Review block/internal/syncing/syncer.goRecoverFromRaft graceful stale-snapshot handling
  • Review node/full.go, node/node.go, pkg/cmd/run_node.go — shutdown fencing
  • Review pkg/config/ — new fields, defaults, flag registration
  • Review pkg/raft/types.go, pkg/raft/node_test.go
  • Formulate review

Summary

Overall this is well-structured work with correct concurrency primitives for the data race fix, sensible default tuning, and a clean LeaderResigner interface pattern. One high-severity gap — the SetApplyCallback(nil) cleanup never fires through the normal shutdown path — and a handful of medium/low issues follow.


High: Syncer.Stop() never calls raftRetriever.Stop() — the new cleanup is unreachable

block/internal/syncing/syncer.go:231–279Syncer.Stop() stops fiRetriever, daFollower, and waits on its own wg, but never calls s.raftRetriever.Stop().

// syncer.go Stop() — raftRetriever missing here
s.cancel()
s.cancelP2PWait(0)
s.fiRetriever.Stop()
s.daFollower.Stop()
s.wg.Wait()
// ← s.raftRetriever.Stop() not called

The raftApplyLoop goroutine exits when the ancestor context is cancelled, so there's no goroutine leak. But r.raftNode.SetApplyCallback(nil) (the key new cleanup introduced in raft_retriever.go:77) is never reached. In the window between Syncer.Stop() and raftNode.Stop(), FSM.Apply can still fire, find the stale applyCh full (no consumer), and log repeated "apply channel full, dropping message" warnings.

Fix: add raftRetriever.Stop() to Syncer.Stop() after s.wg.Wait(), analogous to how fiRetriever and daFollower are stopped. Fix this →


Medium: data race fix is correct but the FSM.Apply mutex scope could be tighter

pkg/raft/node.go:359–369 — The implementation correctly uses applyMu.RLock() to read applyCh, releases it before the select, and only holds it for the pointer copy:

f.applyMu.RLock()
ch := f.applyCh
f.applyMu.RUnlock()
if ch != nil {
    select {
    case ch <- RaftApplyMsg{...}:
    default: ...
    }
}

This is correct. The lock is held only for the pointer copy, not for the channel send, which is the right pattern. CodeRabbit's concern is addressed.


Medium: leadershipTransfer() error silently discarded in Node.Stop()

pkg/raft/node.go:197–200:

if n.IsLeader() {
    _ = n.leadershipTransfer()  // error silently dropped
}

When ResignLeader() is called from run_node.go before cancel(), this path should be a no-op because IsLeader() will already be false. But if leadership transfer fails (e.g., no healthy follower at shutdown time), the fallback call in Stop() also fails silently with no log entry. A n.logger.Warn() here would help production debugging. Fix this →


Medium: TestNodeResignLeader_NotLeaderNoop mentioned in PR description but not implemented

pkg/raft/node_test.go — the PR description references TestNodeResignLeader_NotLeaderNoop, but only TestNodeResignLeader_NilNoop exists. The !n.IsLeader() guard in ResignLeader() is untested. Fix this →


Medium: RecoverFromRaft graceful stale-snapshot path doesn't verify the raft snapshot is behind by more than one block

block/internal/syncing/syncer.go:1236–1257 — the new path correctly handles localHeight > raftHeight by checking hash equality. But there's no cap on the divergence distance. If localHeight is 50 and raftState.Height is 2 (massive gap), the code will:

  1. Fetch the local header at height 2
  2. Verify hash matches
  3. Log "skipping recovery" and return nil — silently ignoring a multi-hundred-block gap

The function is called from election.go's leader loop; in practice the raft snapshot is only a few blocks stale, but a comment or a log at Warn level for large divergences would aid production debugging.


Low: > 0 guard silently ignores explicitly-configured 0 values

pkg/raft/node.go:68–73:

if cfg.SnapshotThreshold > 0 {
    raftConfig.SnapshotThreshold = cfg.SnapshotThreshold
}
if cfg.TrailingLogs > 0 {
    raftConfig.TrailingLogs = cfg.TrailingLogs
}

If an operator sets snapshot_threshold: 0 or trailing_logs: 0 in their config file, the hashicorp/raft library defaults (8192 / 10240) silently take over instead. The default values in defaults.go are both non-zero so this won't bite in practice, but the conditional is a footgun. A comment would help. (Same pattern applies to ElectionTimeout at line 65.)


Low: raftRetriever test uses nil event sink

block/internal/syncing/raft_retriever_test.go:46–61newRaftRetriever(..., nil, ...) passes nil for eventSink. If consumeRaftBlock ever ran, r.eventSink.PipeEvent() would panic. In this test it's safe (Stop is called before any raft message arrives), but it's fragile. Using a no-op sink or a testify mock would make the test more robust if it's extended later.


Positive observations

  • Data race fix is correctapplyMu sync.RWMutex with read lock in Apply and write lock in SetApplyCallback is exactly the right pattern. The lock is released before the channel send, avoiding lock-held blocking.
  • waitForMsgsLanded uses CommitIndex — semantically correct; CommitIndex reflects quorum consensus, not the speculative LastIndex.
  • LeaderResigner as an optional interface — the type assertion pattern in run_node.go:230 is the idiomatic way to make this opt-in without polluting the Node interface. The compile-time assertion var _ LeaderResigner = &FullNode{} at full.go:38 is a nice safety net.
  • RecoverFromRaft stale snapshot handling — converting the hard fmt.Errorf("invalid block height") panic path to a graceful hash-verified skip is the right fix for the restart regression.
  • RaftApplyMsg.Term — a small but useful addition for correlating applied blocks with raft terms in audit tooling.
  • Config defaultsSnapshotThreshold: 500, TrailingLogs: 200, SnapCount: 3 (fixed from 0) are well-chosen values with the comments explaining the rationale (~8 min at 1 blk/s).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 9, 2026, 2:01 PM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)
  • block/internal/syncing/raft_retriever.go
  • block/internal/syncing/raft_retriever_test.go
  • pkg/raft/node.go

r.mtx.Unlock()

r.wg.Wait()
r.raftNode.SetApplyCallback(nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify unsynchronized callback read/write sites in raft node implementation.
rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 1336


🏁 Script executed:

# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for
cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"

Repository: evstack/ev-node

Length of output: 590


🏁 Script executed:

# Check if there's synchronization in the raft node's Stop or shutdown methods
rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 553


🏁 Script executed:

# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply
rg -n "applyCh" pkg/raft/node.go | head -20

Repository: evstack/ev-node

Length of output: 212


🏁 Script executed:

# Check where raftNode is initialized and where Stop is called in context
rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


🏁 Script executed:

# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever
rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40

Repository: evstack/ev-node

Length of output: 328


🏁 Script executed:

# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback
cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'

Repository: evstack/ev-node

Length of output: 951


🏁 Script executed:

# Find where raftRetriever.Stop() is called
rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20

Repository: evstack/ev-node

Length of output: 186


🏁 Script executed:

# Get the Syncer.Stop() method to see the full shutdown sequence
rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 646


🏁 Script executed:

# Check where r.raftNode is stopped/shut down
rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


Synchronize callback access to prevent data race with FSM.Apply.

Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.

Suggested fix (in pkg/raft/node.go)
 import (
 	"context"
 	"errors"
 	"fmt"
 	"io"
 	"net"
 	"os"
 	"path/filepath"
 	"strings"
+	"sync"
 	"sync/atomic"
 	"time"
@@
 type FSM struct {
 	logger  zerolog.Logger
 	state   *atomic.Pointer[RaftBlockState]
+	applyMu sync.RWMutex
 	applyCh chan<- RaftApplyMsg
 }
@@
 func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
+	n.fsm.applyMu.Lock()
+	defer n.fsm.applyMu.Unlock()
 	n.fsm.applyCh = ch
 }
@@
-	if f.applyCh != nil {
+	f.applyMu.RLock()
+	ch := f.applyCh
+	f.applyMu.RUnlock()
+	if ch != nil {
 		select {
-		case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
+		case ch <- RaftApplyMsg{Index: log.Index, State: &state}:
 		default:
 			// on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too.
 			f.logger.Warn().Msg("apply channel full, dropping message")
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever.go` at line 77, The call to
r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends
on applyCh while the raft node may concurrently invoke the callback; fix by
adding a mutex to the raft node to guard access to the apply callback: protect
the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and
any internal invocation sites) with the new mutex so that FSM.Apply (which
reads/sends on applyCh via the callback) cannot see a nil or changing callback
mid-invocation; update the raft node's invocation path that calls the callback
(where Apply is invoked) to acquire the same mutex (or use a read lock) when
reading the callback and release it immediately after obtaining the pointer,
then call the callback outside the lock if needed to avoid long-held locks.

@auricom auricom marked this pull request as draft April 7, 2026 15:24
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 35.00000% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.85%. Comparing base (d2a29e8) to head (465203e).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/raft/node.go 11.11% 24 Missing ⚠️
node/full.go 0.00% 7 Missing ⚠️
pkg/cmd/run_node.go 0.00% 5 Missing ⚠️
block/internal/syncing/syncer.go 76.92% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   61.67%   61.85%   +0.18%     
==========================================
  Files         120      120              
  Lines       12635    12687      +52     
==========================================
+ Hits         7793     7848      +55     
+ Misses       3968     3963       -5     
- Partials      874      876       +2     
Flag Coverage Δ
combined 61.85% <35.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

auricom and others added 2 commits April 8, 2026 17:03
Bug A: RecoverFromRaft crashed with "invalid block height" when the node
restarted after SIGTERM and the EVM state (persisted before kill) was ahead
of the raft FSM snapshot (which hadn't finished log replay yet). The function
now verifies the hash of the local block at raftState.Height — if it matches
the snapshot hash the EVM history is correct and recovery is safely skipped;
a mismatch returns an error indicating a genuine fork.

Bug B: waitForMsgsLanded used two repeating tickers with the same effective
period (SendTimeout/2 poll, SendTimeout timeout), so both could fire
simultaneously in select and the timeout would win even when AppliedIndex >=
CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer,
added a final check in the deadline branch, and reduced poll interval to
min(50ms, timeout/4) for more responsive detection.

Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test
cycle 7 (poc-ha-2 never rejoining within the 300s kill interval).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetApplyCallback(nil) called from raftRetriever.Stop() raced with
FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine
exited, but the raft library can still invoke Apply concurrently.

Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and
read lock in Apply before reading the channel pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom changed the title Fix raft leader re-election delays after SIGTERM fix: raft leader re-election delays after SIGTERM Apr 8, 2026
auricom and others added 6 commits April 9, 2026 15:14
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…GTERM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to RaftConfig; fix SnapCount default 0→3

Add three new Raft config parameters:
  - ElectionTimeout: timeout for candidate to wait for votes (defaults to 1s)
  - SnapshotThreshold: outstanding log entries that trigger snapshot (defaults to 500)
  - TrailingLogs: log entries to retain after snapshot (defaults to 200)

Fix critical default: SnapCount was 0 (broken, retains no snapshots) → 3

This enables control over Raft's snapshot frequency and recovery behavior to prevent
resync debt from accumulating unbounded during normal operation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nto hashicorp/raft config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r block provenance audit

Add Term field to RaftApplyMsg struct to track the raft term in which each
block was committed. Update FSM.Apply() debug logging to include both
raft_term and raft_index fields alongside block height and hash. This
enables better audit trails and debugging of replication issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom changed the title fix: raft leader re-election delays after SIGTERM fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log Apr 9, 2026
…gering tests

The gci formatter requires single space before inline comments (not aligned
double-space). Also removed TestNodeResignLeader_NotLeaderNoop and
TestNewNode_SnapshotConfigApplied which create real boltdb-backed raft nodes:
boltdb@v1.3.1 has an unsafe pointer alignment issue that panics under
Go 1.25's -checkptr. The nil-receiver test (TestNodeResignLeader_NilNoop)
is retained as it exercises the same guard without touching boltdb.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant