KAFKA-19632: Handle overlap batch on partition re-assignment #20395

apoorvmittal10 · 2025-08-21T21:54:22Z

The PR fixes the batch alignment issue when partitions are re-assigned.
During initial read of state the batches can be broken arbitrarily. Say
the start offset is 10 and cache contains [15-18] batch during
initialization. When fetch happens at offset 10 and say the fetched
batch contain 10 records i.e. [10-19] then correct batches will be
created if maxFetchRecords is greater than 10. But if maxFetchRecords is
less than 10 then last offset of batch is determined, which will be 19.
Hence acquire method will incorrectly create a batch of [10-19] while
[15-18] already exists. Below check is required t resolve the issue:

if (isInitialReadGapOffsetWindowActive() && lastAcquiredOffset >
lastOffset) {
     lastAcquiredOffset = lastOffset;
}

While testing with other cases, other issues were determined while
updating the gap offset, acquire of records prior share partitions end
offset and determining next fetch offset with compacted topics. All
these issues can arise mainly during initial read window after partition
re-assignment.

Reviewers: Andrew Schofield aschofield@confluent.io, Abhinav Dixit
adixit@confluent.io, Chirag Wadhwa cwadhwa@confluent.io

AndrewJSchofield

Thanks for the PR. I've done an initial review and the testing looks comprehensive. I want to take another pass on the SharePartition code.

AndrewJSchofield · 2025-08-22T12:51:50Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+
+        // Create a single batch record that covers the entire range from 10 to 30 of initial read gap.
+        // The records in the batch are from 10 to 49.
+        MemoryRecords records = memoryRecords(40, 10);


nit: One thing that makes this a bit harder than necessary to review is the inconsistency in the conventions about the offset ranges. For example, this could read memoryRecords(10,49) which would align with the firstOffset, lastOffset convention used in the persister. Not something that needs to be fixed on this PR, but potentially something to refactor later on.

Sure, I ll refactor in subsequent PR.

adixitconfluent · 2025-08-25T12:05:12Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+        assertEquals(RecordState.AVAILABLE, sharePartition.cachedState().get(26L).offsetState().get(29L).state());
+        assertEquals(RecordState.AVAILABLE, sharePartition.cachedState().get(26L).offsetState().get(30L).state());
+        assertEquals(30L, sharePartition.endOffset());
+        assertNotNull(sharePartition.initialReadGapOffset());


Maybe I'm missing something, but we have filled the existing gaps, whatever is left is not actually a gap, right? So ideally the code should have been done in a way that this is null, wdyt?

The initialReadGapOffset is not intelligent enough to detect all individual gaps in the cachedState. Instead, it remains active until the cached state has been checked for gaps until the offset initialReadGapOffset.endOffset. Additionally, its purpose is to only track the gaps introduced by the persister.readState request. So, if we acquire new batches post initialReadGapOffset.endOffset which makes initialReadGapOffset as null, after this point new gaps maybe introduced to the cached state as per the natural gaps in the underlying partition, but initialReadGapOffset is not responsible for them.

So as per the example above, in the subsequent fetch requests, when the remaining records 29 and 30 are acquired again, the code will have reached initialReadGapOffset.endOffset (which is 30 in this case), and initialReadGapOffset will then be set to null.

Yeah the endOffset is 30 but gapOffset is still 28. Also it will happen when endOffset is past gap's endOffset. So when anything further new acquired then the gap tracking will become null.

perhaps, the variable initialGapOffsetReadWindow is a better variable name, if it makes sense @apoorvmittal10

as discussed offline, we are gonna make the variable name change in a different PR.

As discussed, I ll open next PR with following changes:

`persisterReadResultGapWindow` as variable name and class name as `GapWindow`

chirag-wadhwa5

Thanks for PR. Left some minor comments.

chirag-wadhwa5 · 2025-08-25T12:42:35Z

core/src/main/java/kafka/server/share/SharePartition.java

+                // If the initial read gap offset window is active then it's not guaranteed that the
+                // batches align on batch boundaries. Hence, reset to last offset itself if the batch's
+                // last offset is greater than the last offset for acquisition, else there could be
+                // a situation where the batch overlaps with the initial read gap offset window batch.


Thanks for the PR. I think a small example here would be better, like the one that was provided above for gapStartOffset update change in nextFetchOffset.

Good sugestion, done.

chirag-wadhwa5 · 2025-08-25T12:56:46Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+                MEMBER_ID,
+                BATCH_SIZE,
+                1,
+                DEFAULT_FETCH_OFFSET,


nit: I believe it should not be DEFAULT_FETCH_OFFSET, but 10, since that is the nextFetchOffset in this case.

Yeah good to set to the one where fetch happened, done.

chirag-wadhwa5 · 2025-08-25T13:16:06Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+                MEMBER_ID,
+                BATCH_SIZE,
+                10,
+                DEFAULT_FETCH_OFFSET,


Likewise, here it should be 15 as per the new nextFetchOffset value.

chirag-wadhwa5 · 2025-08-25T13:16:23Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+                MEMBER_ID,
+                BATCH_SIZE,
+                1,
+                DEFAULT_FETCH_OFFSET,


And, here it should be 31

chirag-wadhwa5 · 2025-08-25T13:36:22Z

core/src/test/java/kafka/server/share/SharePartitionTest.java

+        assertEquals(RecordState.AVAILABLE, sharePartition.cachedState().get(26L).offsetState().get(29L).state());
+        assertEquals(RecordState.AVAILABLE, sharePartition.cachedState().get(26L).offsetState().get(30L).state());
+        assertEquals(30L, sharePartition.endOffset());
+        assertNotNull(sharePartition.initialReadGapOffset());


The initialReadGapOffset is not intelligent enough to detect all individual gaps in the cachedState. Instead, it remains active until the cached state has been checked for gaps until the offset initialReadGapOffset.endOffset. Additionally, its purpose is to only track the gaps introduced by the persister.readState request. So, if we acquire new batches post initialReadGapOffset.endOffset which makes initialReadGapOffset as null, after this point new gaps maybe introduced to the cached state as per the natural gaps in the underlying partition, but initialReadGapOffset is not responsible for them.

So as per the example above, in the subsequent fetch requests, when the remaining records 29 and 30 are acquired again, the code will have reached initialReadGapOffset.endOffset (which is 30 in this case), and initialReadGapOffset will then be set to null.

chirag-wadhwa5

LGTM

@adixitconfluent

As per the suggestion by @adixitconfluent and @chirag-wadhwa5, [here](#20395 (comment)), I have refactored the code with variable and method names. Reviewers: Andrew Schofield <aschofield@confluent.io>, Chirag Wadhwa <cwadhwa@confluent.io>

github-actions bot added triage PRs from the community core Kafka Broker KIP-932 Queues for Kafka labels Aug 21, 2025

apoorvmittal10 requested review from AndrewJSchofield and chirag-wadhwa5 August 21, 2025 21:54

apoorvmittal10 added ci-approved and removed triage PRs from the community labels Aug 21, 2025

apoorvmittal10 force-pushed the KAFKA-19632 branch from 356681f to b05ef87 Compare August 21, 2025 21:58

KAFKA-19632: Handle overlap batch on partition re-assignment

d41b564

apoorvmittal10 force-pushed the KAFKA-19632 branch from b05ef87 to d41b564 Compare August 21, 2025 21:59

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19632

b2e07d6

AndrewJSchofield reviewed Aug 22, 2025

View reviewed changes

adixitconfluent reviewed Aug 25, 2025

View reviewed changes

chirag-wadhwa5 reviewed Aug 25, 2025

View reviewed changes

apoorvmittal10 added 2 commits August 25, 2025 20:46

Addressing review comments

9215f7e

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19632

36c71ed

apoorvmittal10 requested review from AndrewJSchofield, chirag-wadhwa5 and adixitconfluent August 25, 2025 19:48

AndrewJSchofield approved these changes Aug 26, 2025

View reviewed changes

adixitconfluent approved these changes Aug 26, 2025

View reviewed changes

chirag-wadhwa5 approved these changes Aug 26, 2025

View reviewed changes

apoorvmittal10 merged commit 49ee1fb into apache:trunk Aug 26, 2025
25 checks passed

apoorvmittal10 deleted the KAFKA-19632 branch August 26, 2025 12:51

apoorvmittal10 mentioned this pull request Aug 26, 2025

MINOR: Refactored gap window names in share partition #20411

Merged

KAFKA-19632: Handle overlap batch on partition re-assignment #20395

KAFKA-19632: Handle overlap batch on partition re-assignment #20395

Uh oh!

Conversation

apoorvmittal10 commented Aug 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adixitconfluent Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chirag-wadhwa5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chirag-wadhwa5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

apoorvmittal10 commented Aug 21, 2025 •

edited by github-actions bot

Loading

adixitconfluent Aug 26, 2025 •

edited

Loading