Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.3.x] Fix stepping down on timeout #24773

Open
wants to merge 8 commits into
base: v24.3.x
Choose a base branch
from

Conversation

mmaslankaprv
Copy link
Member

Backport of PR #24590

The `raft::reply_result::follower_busy` is indicating that the follower
was unable to process the heartbeat fast enough to generate a response.
Renaming the reply from `timeout` will make it less confusing for the
reader and differentiate the error code from an RPC timeout.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 6a1e34b)
Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 95a29db)
Wired raft RPC service handler into Raft fixture to make the tests more
accurate and cover the service code with tests.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 5f69d9b)
@mmaslankaprv mmaslankaprv added this to the v24.3.x-next milestone Jan 10, 2025
@mmaslankaprv mmaslankaprv added the kind/backport PRs targeting a stable branch label Jan 10, 2025
@mmaslankaprv mmaslankaprv requested a review from bharathv January 10, 2025 14:17
Propagating timeout to the node sending RPC request is crucial for
accurate testing of Raft implementation.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 7d33bb5)
Added a wrapper around the `storage::log` allowing us to inject storage
layer failures in Raft fixture tests.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit f04995a)
When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the `follower_busy` result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the `follower_busy` error.

This way we prevent to eager leader step downs when follower replicas
are slow.
Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 8b57b42)
Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 67e7c6e)
Fixed previously unstable test. Now the test simply blocks append entry
requests from leader instead of relying on uncertain timeouts. Added
waiting for enqueue of replicate requests to make sure the requests
landed in the buffer before the leadership changed.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the manual-backport-24590-v24.3.x-659 branch from f056ac4 to db89ec7 Compare January 10, 2025 16:14
@mmaslankaprv mmaslankaprv marked this pull request as ready for review January 10, 2025 16:28
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#60601
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60601#0194510f-fb2f-46bb-9b31-fd2ff5692505 FLAKY 1/2

@@ -91,20 +91,31 @@ TEST_P_CORO(monitor_test_fixture, truncation_detection) {

for (auto& [id, node] : nodes()) {
if (id == leader) {
node->on_dispatch(
[](model::node_id, raft::msg_type) { return ss::sleep(3s); });
node->on_dispatch([](model::node_id, raft::msg_type mt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this diff be in the dev too? This is not specific to 24.3.x, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants