feat: delete Kafka consumer group on drop #20065

xxchan · 2025-01-08T02:23:06Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

fix #18416

Delete Kafka consumer group when deleting source fragments. (Backfill is handled in the next PR) i.e.,

DROP (shared) SOURCE
DROP MV on (non-shared) source

Impl details:

The deletion is done in meta, when handling source changes.
Source manager sends a command to the Worker (which owns the SplitEnumerator)
When error happens, it will be ignored.

Checklist

I have written necessary rustdoc comments.
I have added necessary unit tests and integration tests.
I have added test labels as necessary.
I have added fuzzing tests or opened an issue to track them.
My PR contains breaking changes.
My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

My PR needs documentation updates.

Release note

Now Kafka consumer groups created by RisingWave Kafka source will be deleted when the sources or related materialized views are dropped.

xxchan · 2025-01-08T06:19:59Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

stdrc

LGTM

xxchan · 2025-01-09T08:11:15Z

2025-01-09T06:54:23.898047083Z DEBUG risingwave_connector::source::kafka::enumerator: delete groups result: [Ok("my_group_non_shared-543")] topic="test_consumer_group_non_shared" fragment_ids=[543]





failed to run `e2e_test/source_inline/kafka/consumer_group_non_shared.slt.serial`
  |  
  | Caused by:
  | system command stdout mismatch:
  | [command] ./e2e_test/source_inline/kafka/consumer_group.mjs count-groups \| grep my_group_non_shared
  | [Diff] (-expected\|+actual)
  | -   my_group_non_shared: 1
  | +   my_group_non_shared: 2
  | at e2e_test/source_inline/kafka/consumer_group_non_shared.slt.serial:115

strange

Signed-off-by: xxchan <xxchan22f@gmail.com>

tabVersion

One more thing to confirm, we do not remove source backfill's consumer group until the relevant streaming job is dropped.

tabVersion · 2025-01-09T12:08:24Z

src/connector/src/source/kafka/enumerator.rs

@@ -196,6 +168,70 @@ impl SplitEnumerator for KafkaSplitEnumerator {

        Ok(ret)
    }
+
+    async fn on_drop_fragments(&mut self, fragment_ids: Vec<u32>) -> ConnectorResult<()> {
+        let admin = build_kafka_admin(&self.config, &self.properties).await?;


Does it mean we need to create an admin client each time we drop one consumer group? Shall we reuse the client, just like we reuse the client used for normal list splits.

I'm worried that this increases unnecessary idle clients.

User also once asked why not close the connection after fetching metadata.

#18949 (comment)

If that is the case, why not close the consumer after fetching the metadata?

This seems to be the Kafka library's limitation. Even just for fetching metadata, we need to create consumers. And the consumers will immediately connect to all brokers (instead of only the bootstrap server)

If that is the case, why not close the consumer after fetching the metadata?

I think the user mistakenly believes we do nothing after the first fetch. But, actually, we keep querying the metadata and we cannot take the price of recreate one for each tick.

I'm worried that this increases unnecessary idle clients.

It is just 1 additional client per cluster, per one source. I think it should be acceptable.
If you still have concerns, we may do a bench with MSK to see how large creating multiple clients simultaneously can impact the broker and local server.

we cannot take the price of recreate one for each tick.

For tick yes, but for drop_fragment here it's by nature short-lived

as I mentioned before,

"I am afraid if iter over all source and create an admin client for each one when doing migration will put too much pressure on kafka broker."

the thing is about creating many clients in a short period of time rather than how long each client lives.

And again, a benchmark can resolve our gap here, hope it can correct my misunderstanding of the overhead of creating Kafka clients.

tabVersion · 2025-01-09T12:29:17Z

src/meta/src/stream/source_manager.rs

+                }
+            }
+        } else {
+            for (source_id, fragment_ids) in &mut self.source_fragments {


I am afraid if iter over all source and create an admin client for each one when doing migration will put too much pressure on kafka broker.

src/connector/src/source/kafka/enumerator.rs

tabVersion · 2025-01-09T13:02:49Z

src/meta/src/stream/source_manager/worker.rs

+    /// Force [`ConnectorSourceWorker::tick()`] to be called.
+    pub async fn force_tick(&self) -> MetaResult<()> {
+        let (tx, rx) = oneshot::channel();
+        self.send_command(SourceWorkerCommand::Tick(tx))?;
+        rx.await
+            .context("failed to receive tick command response from source worker")?
+            .context("source worker tick failed")?;
+        Ok(())
+    }


I don't fully agree with making force_tick rely on the command channel.
In prev impl, each enumerator's force_tick is independent with each other. But in this impl, each request to force_tick must be handled sequentially.

Consider this case, a channel containing (drop_fragment(x), force_tick(y)), when handling drop_fragment(x), the network to x's broker is unstable and the call may hang for a while, or forever. But it blocks call for force_tick(y).
A real-world case, we call force_tick when creating a source, the behavior to the user is like everything is good but SQL never returns, though the root cause is irrelevant to the creating one. It can cause much trouble when doing POC.

A real-world case, we call force_tick when creating a source

No it will only be called in rare cases when splits is None. When CREATE SOURCE, we call create_source_worker, which will tick before creating the source worker

I don't think it's a big deal to handle force_tick separately here. If drop_fragment hangs, tick will also unlikely succeed

If drop_fragment hangs, tick will also unlikely succeed

I don't seem to understand your logic here. Drop_fragment and force_tick can happen to different sources, why they tend to fail together?

1 ConnectorSourceWorker corresponds to 1 SplitEnumerator.

In prev impl, each enumerator's force_tick is independent with each other

enumerators are still independent with each other now. This is not changed.

github-actions bot added type/feature ci/run-e2e-single-node-tests ci/run-e2e-test-other-backends labels Jan 8, 2025

xxchan force-pushed the xxchan/drop-cg branch from b422fe1 to b27f4ec Compare January 8, 2025 06:19

xxchan changed the base branch from main to xxchan/split_source January 8, 2025 06:19

xxchan mentioned this pull request Jan 8, 2025

refactor(meta): splits source_manager into smaller mods #20071

Merged

8 tasks

xxchan force-pushed the xxchan/drop-cg branch from b27f4ec to 770ea43 Compare January 8, 2025 06:21

xxchan force-pushed the xxchan/split_source branch from 7d1afa8 to 7cbe311 Compare January 8, 2025 06:23

xxchan force-pushed the xxchan/drop-cg branch from 770ea43 to 0696dbb Compare January 8, 2025 06:23

xxchan force-pushed the xxchan/split_source branch from 7cbe311 to 1dc89f3 Compare January 8, 2025 10:00

xxchan force-pushed the xxchan/drop-cg branch from 0696dbb to f4e1837 Compare January 8, 2025 10:00

xxchan force-pushed the xxchan/split_source branch from 1dc89f3 to 00d90c8 Compare January 8, 2025 13:04

xxchan force-pushed the xxchan/drop-cg branch from f4e1837 to 43c2812 Compare January 8, 2025 13:04

Base automatically changed from xxchan/split_source to main January 8, 2025 13:51

xxchan force-pushed the xxchan/drop-cg branch from 43c2812 to 8fcd9ed Compare January 8, 2025 13:55

xxchan marked this pull request as ready for review January 8, 2025 13:57

xxchan force-pushed the xxchan/drop-cg branch from 8fcd9ed to e844a12 Compare January 8, 2025 13:57

xxchan mentioned this pull request Jan 9, 2025

feat: delete Kafka consumer group when source backfill is finished #20077

Open

8 tasks

xxchan force-pushed the xxchan/drop-cg branch from f3f6a3e to ac61515 Compare January 9, 2025 04:49

xxchan mentioned this pull request Jan 9, 2025

Add delete_groups API to AdminClient. (no-op in simulation) madsim-rs/madsim#235

Merged

xxchan requested a review from a team as a code owner January 9, 2025 06:34

xxchan requested a review from fuyufjh January 9, 2025 06:34

stdrc approved these changes Jan 9, 2025

View reviewed changes

xxchan requested review from chenzl25, tabVersion and BugenZhao January 9, 2025 08:49

xxchan added 2 commits January 9, 2025 18:44

.

54cbc50

Signed-off-by: xxchan <xxchan22f@gmail.com>

fix tests

0fe3c5f

Signed-off-by: xxchan <xxchan22f@gmail.com>

xxchan added 7 commits January 9, 2025 18:44

add log

1cea1bc

Signed-off-by: xxchan <xxchan22f@gmail.com>

fix

94fb56d

Signed-off-by: xxchan <xxchan22f@gmail.com>

bump

7cdf8b8

add retry

f1b315d

Signed-off-by: xxchan <xxchan22f@gmail.com>

try retry

d3dc215

Signed-off-by: xxchan <xxchan22f@gmail.com>

add log

217eb79

Signed-off-by: xxchan <xxchan22f@gmail.com>

bump

40eed20

Signed-off-by: xxchan <xxchan22f@gmail.com>

xxchan force-pushed the xxchan/drop-cg branch from 5f945c8 to 40eed20 Compare January 9, 2025 10:44

tabVersion reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: delete Kafka consumer group on drop #20065

feat: delete Kafka consumer group on drop #20065

xxchan commented Jan 8, 2025 •

edited

Loading

xxchan commented Jan 8, 2025 •

edited

Loading

stdrc left a comment

xxchan commented Jan 9, 2025

tabVersion left a comment

tabVersion Jan 9, 2025

xxchan Jan 9, 2025

xxchan Jan 9, 2025

tabVersion Jan 10, 2025

xxchan Jan 10, 2025

tabVersion Jan 10, 2025

tabVersion Jan 9, 2025

tabVersion Jan 9, 2025

xxchan Jan 9, 2025

xxchan Jan 9, 2025

tabVersion Jan 10, 2025

xxchan Jan 10, 2025 •

edited

Loading

feat: delete Kafka consumer group on drop #20065

Are you sure you want to change the base?

feat: delete Kafka consumer group on drop #20065

Conversation

xxchan commented Jan 8, 2025 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

xxchan commented Jan 8, 2025 • edited Loading

stdrc left a comment

Choose a reason for hiding this comment

xxchan commented Jan 9, 2025

tabVersion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxchan Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

xxchan commented Jan 8, 2025 •

edited

Loading

xxchan commented Jan 8, 2025 •

edited

Loading

xxchan Jan 10, 2025 •

edited

Loading