Skip to content

Admin/XMover: Add CrateDB shard analyzer and movement tool #523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 19, 2025

About

@WalBeh contributed a powerful utility that uses CrateDB's system tables to find out about cluster imbalances related to shard number and shard size distribution across the whole cluster. Thanks!

After analysing the situation, the program presents solutions in form of SQL commands to bring the cluster into a more balanced state again.

Install

uv pip install --upgrade 'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover'

Documentation

https://cratedb-toolkit--523.org.readthedocs.build/admin/xmover/

References

Copy link

coderabbitai bot commented Aug 19, 2025

Walkthrough

Adds XMover: a new administrative tool for CrateDB that analyzes shard distribution, recommends and optionally executes safe shard relocations, validates moves, monitors recoveries, and exposes a Click CLI, data models, analysis/operational modules, utilities, tests, docs, and a script entry point.

Changes

Cohort / File(s) Summary
Package & CLI Integration
CHANGES.md, pyproject.toml, cratedb_toolkit/cli.py, cratedb_toolkit/admin/xmover/__init__.py
Announce XMover in changelog; add rich dependency and script entrypoint scripts.xmover; register top-level xmover CLI subcommand; add package metadata for xmover.
Top-level CLI Implementation
cratedb_toolkit/admin/xmover/cli.py
New Click-based CLI wiring to initialize CrateDB client, expose analyze/find-candidates/recommend/validate/monitor/check-balance/explain commands, render via Rich, support dry-run/auto-exec and connection testing.
Core Models
cratedb_toolkit/admin/xmover/model.py
New dataclasses: NodeInfo, ShardInfo, RecoveryInfo, ShardRelocationRequest/Response, DistributionStats, SizeCriteria, ShardRelocationConstraints; includes helpers like to_sql() and safety_score.
Database Client
cratedb_toolkit/admin/xmover/util/database.py
New CrateDBClient for HTTP /_sql queries and helpers: execute_query, get_nodes_info, get_shards_info, summaries, recovery queries, watermarks, and recovery parsing.
Formatting & Error Helpers
cratedb_toolkit/admin/xmover/util/format.py, .../util/error.py
Add size/percentage/translog formatters and an explain_cratedb_error helper with pattern-matching diagnostics and Rich output.
Shard Analysis & Reporting
cratedb_toolkit/admin/xmover/analysis/shard.py, .../analysis/table.py, .../analysis/zone.py
Implement ShardAnalyzer, ShardReporter, DistributionAnalyzer and ZoneReport: cluster/table/zone distribution analysis, imbalance/anomaly detection, move validation, decommission planning, and Rich rendering.
Operational Modules
cratedb_toolkit/admin/xmover/operational/candidates.py, .../operational/recommend.py, .../operational/monitor.py
CandidateFinder, ShardRelocationRecommender, and RecoveryMonitor: find candidate shards, generate/validate/execute relocation plans with recovery-aware sequencing, and monitor ongoing recoveries (watch mode).
Attic / Skeleton
cratedb_toolkit/admin/xmover/attic.py
Commented skeleton for a decommission command (non-active documentation/skeleton).
Documentation
doc/admin/index.md, doc/admin/xmover/* (handbook.md, index.md, queries.md, troubleshooting.md) , doc/index.md
Add admin docs section and comprehensive XMover docs: handbook, query gallery, troubleshooting, and index updates.
Tests
tests/admin/test_cli.py
Add CLI tests invoking multiple xmover subcommands and validate-move success/error scenarios.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover CLI
  participant Client as CrateDBClient
  participant Analyzer as ShardAnalyzer
  participant Recommender as ShardRelocationRecommender
  participant DB as CrateDB

  User->>CLI: xmover recommend [options]
  CLI->>Client: init + test_connection()
  CLI->>Analyzer: init(client)
  CLI->>Recommender: execute(constraints, auto_execute, validate, dry_run)
  Recommender->>Analyzer: generate_rebalancing_recommendations(constraints)
  Analyzer->>Client: get_nodes_info()/get_shards_info()
  Client-->>Analyzer: nodes, shards
  Analyzer-->>Recommender: recommendations
  alt auto_execute and not dry_run
    Recommender->>Client: execute_query(ALTER TABLE ... REROUTE MOVE SHARD ...)
    Client->>DB: POST /_sql
    DB-->>Client: result
    Client-->>Recommender: success/failure
    Recommender->>Recommender: _wait_for_recovery_capacity()
  else dry_run
    Recommender-->>CLI: render recommendations (no execution)
  end
  CLI-->>User: output table, SQL, status
Loading
sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover monitor-recovery
  participant Client as CrateDBClient
  participant Monitor as RecoveryMonitor
  participant DB as CrateDB

  User->>CLI: xmover monitor-recovery [--watch]
  CLI->>Monitor: start(watch)
  loop refresh_interval (watch mode)
    Monitor->>Client: get_all_recovering_shards(filters)
    Client->>DB: query sys.allocations/sys.shards
    DB-->>Client: rows
    Client-->>Monitor: RecoveryInfo[]
    Monitor-->>CLI: formatted table + deltas
  end
  CLI-->>User: summary
Loading
sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover validate-move
  participant Client as CrateDBClient
  participant Analyzer as ShardAnalyzer

  User->>CLI: xmover validate-move <schema.table> <shard> <from> <to>
  CLI->>Analyzer: init(client)
  CLI->>Analyzer: validate_move_safety(recommendation, max_disk_usage)
  Analyzer->>Client: lookups (nodes/shards)
  Client-->>Analyzer: details
  Analyzer-->>CLI: (is_safe, reason)
  CLI-->>User: verdict + SQL command
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nudge the shards with twitchy care,
Across the zones, through balanced air.
I map their sizes, I plot their hops,
I wait on recoveries — no reckless drops.
Thump, thump, I nudge them home — then eat my carrot. 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xmover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (36)
CHANGES.md (1)

4-5: Unify changelog style and grammar; keep “Thanks” on the same bullet

Across the file, “Thanks, …” is typically on the same line as the bullet. Merge it for consistency and tighten casing.

- - Admin: Added XMover - CrateDB Shard Analyzer and Movement Tool.
-   Thanks, @WalBeh.
+ - Admin: Added XMover - CrateDB shard analyzer and movement tool. Thanks, @WalBeh.
doc/admin/xmover/queries.md (2)

88-93: Validate identifier casing and quoting in REROUTE statement

CrateDB lowercases unquoted identifiers. Ensure the quoted identifiers match the actual table casing, or drop quotes and use lowercase consistently to avoid “relation does not exist”.

-alter table "curvo"."bottlefieldData" reroute move shard 21 from 'data-hot-2' to 'data-hot-3';
+ALTER TABLE curvo.bottlefielddata REROUTE MOVE SHARD 21 FROM 'data-hot-2' TO 'data-hot-3';
+-- or keep quotes if the table was created with mixed-case identifiers:
+-- ALTER TABLE "curvo"."bottlefieldData" REROUTE MOVE SHARD 21 FROM 'data-hot-2' TO 'data-hot-3';

1-3: Minor: Fix top-of-page anchor formatting

The explicit anchor looks fine, but LanguageTool flagged it. If you see Sphinx warnings, consider moving the anchor to the title line:

-(xmover-queries)=
-# XMover Query Gallery
+# XMover Query Gallery
+(xmover-queries)=
cratedb_toolkit/admin/xmover/database.py (3)

373-388: Subquery for node filter may return multiple rows; prefer IN or LIMIT 1

If duplicate node names existed (misconfiguration, but possible), “=” with a multi-row subquery would error. Safer to use IN.

-            where_conditions.append("node_id = (SELECT id FROM sys.nodes WHERE name = ?)")
+            where_conditions.append("node_id IN (SELECT id FROM sys.nodes WHERE name = ?)")

144-151: Preserve server error details in exceptions

Wrapping exceptions loses useful diagnostics from CrateDB (JSON “error” body). Include response text where available to speed up troubleshooting.

-        except requests.exceptions.RequestException as e:
-            raise Exception(f"Failed to execute query: {e}") from e
+        except requests.exceptions.RequestException as e:
+            detail = ""
+            if getattr(e, "response", None) is not None:
+                try:
+                    detail = f" | body={e.response.text}"
+                except Exception:
+                    pass
+            raise Exception(f"Failed to execute query: {e}{detail}") from e

118-133: Optional: support credentials from the URL as a fallback

If the connection string contains userinfo (https://user:pass@host), parse and use it when env vars are absent. Not critical, but improves ergonomics.

+        # Optionally parse credentials from the connection string if not provided via env
+        if not (self.username and self.password):
+            try:
+                from urllib.parse import urlparse
+                parsed = urlparse(self.connection_string)
+                if parsed.username and parsed.password:
+                    self.username = parsed.username
+                    self.password = parsed.password
+            except Exception:
+                pass
doc/admin/index.md (1)

3-5: Add maxdepth to the toctree for consistent navigation

Small UX improvement for the docs sidebar depth.

Apply this diff:

 ```{toctree}
-xmover/index
+:maxdepth: 1
+
+xmover/index

</blockquote></details>
<details>
<summary>pyproject.toml (1)</summary><blockquote>

`113-113`: **Align rich version constraints across base and extras**

The base dependency pins `rich<14` (pyproject.toml line 113), but both `optional-dependencies.docs-api` (line 158) and `optional-dependencies.dynamodb` (line 211) declare `rich>=3.3.2,<15`. While the resolver will still converge to `<14`, keeping these in sync will reduce confusion for future maintainers.  

• pyproject.toml 113: `"rich<14"`  
• pyproject.toml 158: `"rich>=3.3.2,<15"`  
• pyproject.toml 211: `"rich>=3.3.2,<15"`  

Recommendation:  
- Update the extras to `"rich>=3.3.2,<14"`, or  
- Add a comment explaining why the extras are permitted to allow 14.x if that was intentional.

</blockquote></details>
<details>
<summary>doc/admin/xmover/index.md (2)</summary><blockquote>

`7-10`: **Tighten intro phrasing (optional)**

Minor style: remove “looking-glass” idiom for clarity.


Apply this diff:

```diff
-A comprehensive looking-glass utility for analyzing CrateDB shard
-distribution across nodes and availability zones. It generates safe
+A comprehensive utility for analyzing CrateDB shard distribution across nodes
+and availability zones. It generates safe
 SQL commands for shard rebalancing and node decommissioning.

26-29: Use consistent title casing in toctree entries

“Query gallery” → “Query Gallery” for consistency with “Handbook” and “Troubleshooting”.

-Query gallery <queries>
+Query Gallery <queries>
doc/admin/xmover/handbook.md (6)

68-76: Avoid duplicate section titles (“Advanced Troubleshooting”)

This page also uses “Advanced Troubleshooting” later; consider renaming this one to avoid duplicated anchors and improve scanability.

-### Advanced Troubleshooting
+### Advanced Operations

371-375: Specify a language for fenced code block

Helps markdown linters and improves rendering.

-```
+```text
 https://hostname:port

---

`410-412`: **Specify language for error snippets**

Use text for fenced error examples.


```diff
-   ```
+   ```text
    Error: "NO(a copy of this shard is already allocated to this node)"
    ```

418-420: Specify language for error snippets

Use text for fenced error examples.

-   ```
+   ```text
    Error: "too many copies of the shard allocated to nodes with attribute [zone]"
    ```

426-428: Specify language for error snippets

Use text for fenced error examples.

-   ```
+   ```text
    Error: "not enough disk space"
    ```

434-436: Specify language for error snippets

Use text for fenced error examples.

-   ```
+   ```text
    Error: "Target node disk usage too high (85.3%)"
    ```
doc/admin/xmover/troubleshooting.md (5)

286-289: Unify validate-move argument names for consistency with CLI help.

Elsewhere you use FROM_NODE/TO_NODE. Here it says FROM TO. Recommend using the explicit names to avoid ambiguity.

-   xmover validate-move SCHEMA.TABLE SHARD_ID FROM TO
+   xmover validate-move SCHEMA.TABLE SHARD_ID FROM_NODE TO_NODE

400-403: Replace bare URLs with Markdown links to satisfy MD034.

Markdownlint flags bare URLs. Wrap them as proper links.

-- **CrateDB Documentation**: https://crate.io/docs/
-- **Shard Allocation Guide**: https://crate.io/docs/crate/reference/en/latest/admin/system-information.html
-- **Cluster Settings**: https://crate.io/docs/crate/reference/en/latest/config/cluster.html
+- **CrateDB Documentation**: [https://crate.io/docs/](https://crate.io/docs/)
+- **Shard Allocation Guide**: [https://crate.io/docs/crate/reference/en/latest/admin/system-information.html](https://crate.io/docs/crate/reference/en/latest/admin/system-information.html)
+- **Cluster Settings**: [https://crate.io/docs/crate/reference/en/latest/config/cluster.html](https://crate.io/docs/crate/reference/en/latest/config/cluster.html)

40-49: Consider using headings instead of emphasized “Step” labels (MD036).

Multiple sections use bold emphasis for steps. Converting them to level-4 headings improves structure and fixes the linter warning.

Example:

-**Step 1: Analyze Current Distribution**
+#### Step 1: Analyze Current Distribution

Apply similarly for other “Step X” occurrences flagged by the linter.


214-236: Harden the curl example with explicit content type and quoting.

Small doc hardening to encourage safe/portable usage across shells.

-curl -u username:password https://your-cluster:4200/_sql -d '{"stmt":"SELECT 1"}'
+curl -u 'username:password' \
+  -H 'Content-Type: application/json' \
+  'https://your-cluster:4200/_sql' \
+  -d '{"stmt":"SELECT 1"}'

64-67: Clarify that --recommend dry-run prevents SQL emission, not execution.

The CLI generates SQL only when --execute is used and never auto-executes unless --auto-execute is provided with confirmations. A short clarification will reduce confusion.

-- Enable dry-run mode by default: `xmover recommend --dry-run`
+- Enable dry-run mode by default: `xmover recommend --dry-run` (no SQL is emitted)
cratedb_toolkit/admin/xmover/cli.py (10)

21-29: Use consistent binary units (GiB/TiB) to match DB-side 1024-based calculations.

The formatter uses 1000 while size math in database/analyzer uses 1024**3. Consider aligning to 1024 for consistency.

-    if size_gb >= 1000:
-        return f"{size_gb / 1000:.1f}TB"
+    if size_gb >= 1024:
+        return f"{size_gb / 1024:.1f}TiB"

79-90: Prefer click.Abort for CLI termination over sys.exit for consistency.

click.Abort integrates better with Click and avoids stack traces if debug flags are added later.

-            sys.exit(1)
+            raise click.Abort()
...
-        sys.exit(1)
+        raise click.Abort()

357-367: Surface analyzer progress via Rich instead of print to keep output styling consistent.

analyzer.generate_rebalancing_recommendations prints to stdout. Consider allowing a callback/logger to route progress through rich.

I can provide a patch once we agree on the preferred mechanism (e.g., pass console or a logging callback).


468-499: Ensure skip counters are defined in both branches and report them consistently.

You already define safe_moves/zone_conflicts in both dry-run and execute branches. Good. Consider adding a consistent summary in both modes to reduce confusion.


511-530: Mask connection errors cleanly and exit via click.Abort.

Minor polish: stick to click.Abort for a uniform exit path.

-            sys.exit(1)
+            raise click.Abort()
...
-        sys.exit(1)
+        raise click.Abort()

692-709: Example table name uses mixed case that may need quoting in CrateDB.

CUROV.maddoxxFormfactor with mixed case could mislead users if their identifiers aren’t quoted. Consider using lowercase examples or explicitly mention quoting rules.


973-1001: Watch loop: consider clearing or compacting output to avoid unbounded scroll.

The watch mode prints a new status line every interval. For long sessions, consider an option to refresh in place (e.g., live panel) or throttle unchanged updates.

I can prototype a Live-based Rich table that re-renders in place.


1237-1262: Busy-wait loop could be parameterized and interruptible.

_wait_for_recovery_capacity polls every 10s. Consider allowing a timeout and honoring KeyboardInterrupt from the caller to exit promptly.

-def _wait_for_recovery_capacity(client, max_concurrent_recoveries: int = 5):
+def _wait_for_recovery_capacity(client, max_concurrent_recoveries: int = 5, timeout_seconds: int = 0):
@@
-    while True:
+    start = time.time()
+    while True:
@@
-        sleep(10)  # Check every 10 seconds
+        sleep(10)  # Check every 10 seconds
+        if timeout_seconds and (time.time() - start) >= timeout_seconds:
+            console.print("    [yellow]⏳ Timeout waiting for recovery capacity[/yellow]")
+            break

1264-1316: Auto-exec validation uses a hard-coded 95% disk usage threshold; make it configurable.

Hard-coding 95% diverges from the user-provided --max-disk-usage. Consider threading the CLI value through to this function.

-def _execute_recommendations_safely(client, recommendations, validate: bool):
+def _execute_recommendations_safely(client, recommendations, validate: bool, max_disk_usage_percent: float = 95.0):
@@
-            is_safe, safety_msg = analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
+            is_safe, safety_msg = analyzer.validate_move_safety(rec, max_disk_usage_percent=max_disk_usage_percent)

And pass the CLI’s max_disk_usage when calling:

-            _execute_recommendations_safely(client, recommendations, validate)
+            _execute_recommendations_safely(client, recommendations, validate, max_disk_usage_percent=max_disk_usage)

1317-1366: Consider batching with a ceiling on concurrent active recoveries, not just sequential gating.

Right now, you serialize based on capacity checks. You could optionally issue up to N moves at once, then wait until active falls below threshold. This can speed up large plans on healthy clusters.

If useful, I can draft a batched executor that respects a max_active_recoveries cap.

cratedb_toolkit/admin/xmover/analyzer.py (5)

425-431: Make the 50GB buffer configurable to adapt to environment variability.

A fixed 50GB buffer may be too conservative or too lax depending on shard sizes and cluster IO characteristics.

  • Add a parameter buffer_gb to validate_move_safety (default 50).
  • Thread it from CLI options for advanced users, or make it a module-level constant.
-    def validate_move_safety(
-        self, recommendation: MoveRecommendation, max_disk_usage_percent: float = 90.0
-    ) -> Tuple[bool, str]:
+    def validate_move_safety(
+        self, recommendation: MoveRecommendation, max_disk_usage_percent: float = 90.0, buffer_gb: float = 50.0
+    ) -> Tuple[bool, str]:
@@
-        required_space_gb = recommendation.size_gb + 50  # 50GB buffer
+        required_space_gb = recommendation.size_gb + buffer_gb

248-257: Prefer structured logging over print for long-running analyses.

Printing from library code couples output to stdout and complicates CLI/UI reuse. Consider injecting a logger or progress callback.

I can refactor generate_rebalancing_recommendations to accept an optional progress callable.


183-225: find_nodes_with_capacity: defaults differ from CLI; consider alignment or docstring note.

Default max_disk_usage_percent is 85 here, while CLI uses 90 for recommendations/validation. Either align defaults or document why they differ.


514-647: Zone-conflict checks are comprehensive; consider replication awareness.

You guard against node/zone duplicates and insufficient healthy copies. For clusters allowing multiple replicas per zone by design, the “target zone already has a copy” and “would have X copies” checks might be too strict. If intentional, consider documenting or making this behavior configurable.

Would you like me to add a toggle (e.g., allow_same_zone_copies) that relaxes checks for environments without strict zone-awareness?


737-873: Decommission plan: pick safest targets deterministically.

Because find_nodes_with_capacity returns nodes sorted by available space desc, selecting safe_targets[0] achieves “most space” preference. Consider adding a tie-breaker on zone underload if available to aid balancing during decommission.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8c76f and 445d908.

📒 Files selected for processing (13)
  • CHANGES.md (1 hunks)
  • cratedb_toolkit/admin/xmover/__init__.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analyzer.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/database.py (1 hunks)
  • cratedb_toolkit/cli.py (2 hunks)
  • doc/admin/index.md (1 hunks)
  • doc/admin/xmover/handbook.md (1 hunks)
  • doc/admin/xmover/index.md (1 hunks)
  • doc/admin/xmover/queries.md (1 hunks)
  • doc/admin/xmover/troubleshooting.md (1 hunks)
  • doc/index.md (1 hunks)
  • pyproject.toml (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
cratedb_toolkit/cli.py (1)
cratedb_toolkit/admin/xmover/cli.py (1)
  • main (71-89)
cratedb_toolkit/admin/xmover/cli.py (2)
cratedb_toolkit/admin/xmover/analyzer.py (13)
  • MoveRecommendation (14-50)
  • RecoveryMonitor (875-1019)
  • ShardAnalyzer (65-872)
  • get_cluster_overview (654-695)
  • analyze_distribution (88-120)
  • find_moveable_shards (141-162)
  • generate_rebalancing_recommendations (226-407)
  • validate_move_safety (409-437)
  • to_sql (28-34)
  • check_zone_balance (164-181)
  • get_cluster_recovery_status (881-897)
  • get_recovery_summary (899-940)
  • format_recovery_display (942-974)
cratedb_toolkit/admin/xmover/database.py (11)
  • CrateDBClient (115-585)
  • size_gb (91-93)
  • translog_size_gb (105-107)
  • test_connection (331-337)
  • available_space_gb (38-39)
  • shard_type (60-61)
  • shard_type (96-97)
  • get_nodes_info (153-187)
  • get_shards_info (189-269)
  • overall_progress (86-88)
  • execute_query (133-151)
cratedb_toolkit/admin/xmover/database.py (2)
cratedb_toolkit/model.py (2)
  • username (179-183)
  • password (186-190)
cratedb_toolkit/admin/xmover/cli.py (1)
  • test_connection (511-533)
cratedb_toolkit/admin/xmover/analyzer.py (1)
cratedb_toolkit/admin/xmover/database.py (17)
  • CrateDBClient (115-585)
  • NodeInfo (17-39)
  • RecoveryInfo (65-112)
  • ShardInfo (43-61)
  • shard_type (60-61)
  • shard_type (96-97)
  • size_gb (91-93)
  • get_nodes_info (153-187)
  • get_shards_info (189-269)
  • available_space_gb (38-39)
  • disk_usage_percent (34-35)
  • execute_query (133-151)
  • get_cluster_watermarks (339-358)
  • heap_usage_percent (30-31)
  • get_all_recovering_shards (451-482)
  • overall_progress (86-88)
  • total_time_seconds (100-102)
🪛 LanguageTool
doc/admin/xmover/index.md

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Movement Recommendations: Intell...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...

(QB_NEW_EN)

doc/admin/xmover/queries.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...

(QB_NEW_EN)

CHANGES.md

[grammar] ~4-~4: There might be a mistake here.
Context: ...rateDB Shard Analyzer and Movement Tool. Thanks, @WalBeh. ## 2025/08/19 v0.0.4...

(QB_NEW_EN)

doc/admin/xmover/troubleshooting.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-troubleshooting)= # Troubleshooting CrateDB using XMover Th...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ...ns ### 1. Zone Conflicts #### Symptoms - Error: `NO(a copy of this shard is alrea...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ...s in safety validation #### Root Causes - Target node already has a copy of the sh...

(QB_NEW_EN)


[grammar] ~64-~64: There might be a mistake here.
Context: ...D FROM_NODE TO_NODE ``` #### Prevention - Always use xmover recommend instead of...

(QB_NEW_EN)


[grammar] ~71-~71: There might be a mistake here.
Context: ...Insufficient Space Issues #### Symptoms - Error: not enough disk space - Safety ...

(QB_NEW_EN)


[grammar] ~72-~72: There might be a mistake here.
Context: ...nt Space Issues #### Symptoms - Error: not enough disk space - Safety validation fails with space warni...

(QB_NEW_EN)


[grammar] ~76-~76: There might be a mistake here.
Context: ...es in cluster analysis #### Root Causes - Target node doesn't have enough free spa...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...max-size 50 ``` Step 3: Free Up Space - Delete old snapshots and unused data - M...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...ng nodes to the cluster #### Prevention - Monitor disk usage regularly with `xmove...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: .... Node Performance Issues #### Symptoms - Error: shard recovery limit - High hea...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...formance Issues #### Symptoms - Error: shard recovery limit - High heap usage warnings - Slow shard mo...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...covery limit` - High heap usage warnings - Slow shard movement operations #### Roo...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...rd movement operations #### Root Causes - Too many concurrent shard movements - Hi...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...es - Too many concurrent shard movements - High heap usage on target nodes (>80%) -...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...- High heap usage on target nodes (>80%) - Resource contention during moves #### S...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ... --prioritize-space ``` #### Prevention - Move shards gradually (5-10 at a time) -...

(QB_NEW_EN)


[grammar] ~161-~161: There might be a mistake here.
Context: ...nificantly more shards #### Root Causes - Historical data distribution patterns - ...

(QB_NEW_EN)


[grammar] ~162-~162: There might be a mistake here.
Context: ... - Historical data distribution patterns - Node additions/removals without rebalanc...

(QB_NEW_EN)


[grammar] ~163-~163: There might be a mistake here.
Context: ...e additions/removals without rebalancing - Tables created with poor initial distrib...

(QB_NEW_EN)


[grammar] ~194-~194: There might be a mistake here.
Context: ...progress and repeat ``` #### Prevention - Run regular balance checks: `xmover chec...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...## Symptoms - "Connection failed" errors - Authentication failures - SSL/TLS errors...

(QB_NEW_EN)


[grammar] ~203-~203: There might be a mistake here.
Context: ...failed" errors - Authentication failures - SSL/TLS errors #### Root Causes - Incor...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...lures - SSL/TLS errors #### Root Causes - Incorrect connection string in .env - ...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ...ion string in .env - Wrong credentials - Network connectivity issues - SSL certif...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...redentials - Network connectivity issues - SSL certificate problems #### Solutions...

(QB_NEW_EN)


[grammar] ~238-~238: There might be a mistake here.
Context: ..."stmt":"SELECT 1"}' ``` #### Prevention - Use .env.example as a template - Verif...

(QB_NEW_EN)


[grammar] ~241-~241: There might be a mistake here.
Context: ... with CrateDB admin - Test connectivity from deployment environment ## Error Messag...

(QB_NEW_EN)


[grammar] ~259-~259: There might be a mistake here.
Context: ... | Error Pattern | Meaning | Quick Fix | |---------------|---------|-----------| ...

(QB_NEW_EN)


[grammar] ~260-~260: There might be a mistake here.
Context: ... |---------------|---------|-----------| | `copy of this shard is already allocat...

(QB_NEW_EN)


[grammar] ~261-~261: There might be a mistake here.
Context: ...s shard | Choose different target node | | `too many copies...with attribute [zon...

(QB_NEW_EN)


[grammar] ~262-~262: There might be a mistake here.
Context: ...imit exceeded | Move to different zone | | not enough disk space | Insufficient...

(QB_NEW_EN)


[grammar] ~263-~263: There might be a mistake here.
Context: ... | Free space or choose different node | | shard recovery limit | Too many conc...

(QB_NEW_EN)


[grammar] ~264-~264: There might be a mistake here.
Context: ...oves | Wait and retry with fewer moves | | allocation is disabled | Cluster all...

(QB_NEW_EN)


[grammar] ~328-~328: There might be a mistake here.
Context: ... performance** - Query response times - Resource utilization - Error rates ## Emerge...

(QB_NEW_EN)


[grammar] ~329-~329: There might be a mistake here.
Context: ...response times - Resource utilization - Error rates ## Emergency Procedures ### Stu...

(QB_NEW_EN)


[grammar] ~400-~400: There might be a mistake here.
Context: ... Documentation**: https://crate.io/docs/ - Shard Allocation Guide: https://crate....

(QB_NEW_EN)


[grammar] ~401-~401: There might be a mistake here.
Context: .../en/latest/admin/system-information.html - Cluster Settings: https://crate.io/doc...

(QB_NEW_EN)


[grammar] ~408-~408: There might be a mistake here.
Context: ...e: 1. XMover version and command used 2. Complete error message 3. **Cluster in...

(QB_NEW_EN)


[grammar] ~409-~409: There might be a mistake here.
Context: ...mmand used** 2. Complete error message 3. Cluster information (xmover analyze ...

(QB_NEW_EN)


[grammar] ~410-~410: There might be a mistake here.
Context: ... information** (xmover analyze output) 4. Zone analysis (xmover zone-analysis ...

(QB_NEW_EN)


[grammar] ~411-~411: There might be a mistake here.
Context: ...alysis** (xmover zone-analysis output) 5. CrateDB version and configuration ###...

(QB_NEW_EN)

doc/admin/xmover/handbook.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t: Analyze specific table only **Example...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t: Find candidates in specific table only...

(QB_NEW_EN)


[grammar] ~94-~94: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~96-~96: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node: Only show candidates from this specifi...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~112-~112: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)


[grammar] ~145-~145: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t: Analyze zones for specific table only ...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...: Analyze zones for specific table only - --show-shards/--no-show-shards`: Show individual shard details (default...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t: Check balance for specific table only ...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...: Check balance for specific table only - --tolerance`: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE: Schema and table name (format: schema....

(QB_NEW_EN)


[grammar] ~172-~172: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source...

(QB_NEW_EN)


[grammar] ~173-~173: There might be a mistake here.
Context: ...ma.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source node name - TO_NODE: Target n...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... to move - FROM_NODE: Source node name - TO_NODE: Target node name Examples: ```bas...

(QB_NEW_EN)


[grammar] ~189-~189: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE: The CrateDB error message to analyze (...

(QB_NEW_EN)


[grammar] ~204-~204: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t: Monitor recovery for specific table on...

(QB_NEW_EN)


[grammar] ~205-~205: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n: Monitor recovery on specific node only...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w: Continuously monitor (refresh every 10...

(QB_NEW_EN)


[grammar] ~207-~207: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval: Refresh interval for watch mode in sec...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type: Filter by recovery type - PEER, DISK, ...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning: Include recently completed recoveries ...

(QB_NEW_EN)


[grammar] ~230-~230: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...

(QB_NEW_EN)


[grammar] ~231-~231: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...

(QB_NEW_EN)


[grammar] ~364-~364: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) - CRATE_USERNAME`: Username for authentication (optional)...

(QB_NEW_EN)


[grammar] ~365-~365: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD: Password for authentication (optional)...

(QB_NEW_EN)


[grammar] ~366-~366: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY: Enable SSL certificate verification (d...

(QB_NEW_EN)


[grammar] ~421-~421: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...

(QB_NEW_EN)


[grammar] ~422-~422: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...

(QB_NEW_EN)


[grammar] ~430-~430: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...

(QB_NEW_EN)


[grammar] ~441-~441: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...

(QB_NEW_EN)


[grammar] ~471-~471: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...

(QB_NEW_EN)


[style] ~472-~472: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...

(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)


[grammar] ~474-~474: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...

(QB_NEW_EN)


[grammar] ~479-~479: There might be a mistake here.
Context: ...f}xmover-troubleshooting which covers: - Step-by-step diagnostic procedures - Eme...

(QB_NEW_EN)


[grammar] ~480-~480: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...

(QB_NEW_EN)


[grammar] ~481-~481: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...

(QB_NEW_EN)


[grammar] ~482-~482: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/troubleshooting.md

40-40: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


49-49: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


58-58: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


92-92: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


101-101: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


125-125: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


134-134: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


143-143: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


168-168: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


177-177: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


186-186: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


214-214: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


220-220: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


232-232: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


400-400: Bare URL used

(MD034, no-bare-urls)


401-401: Bare URL used

(MD034, no-bare-urls)


402-402: Bare URL used

(MD034, no-bare-urls)

doc/admin/xmover/handbook.md

371-371: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


410-410: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


418-418: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


426-426: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


434-434: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


462-462: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🔇 Additional comments (7)
doc/index.md (1)

42-42: Admin docs entry verified – no broken links risk

Confirmed that doc/admin/index.md exists and contains a top-level heading (# Administrative Utilities), ensuring the Sphinx build won’t break. Approving these docs changes.

cratedb_toolkit/admin/xmover/__init__.py (1)

1-10: Module metadata is clear and minimal—LGTM

Docstring + version/author/description are appropriate for a package entrypoint. No issues.

pyproject.toml (2)

327-333: Scoped T201 (print) ignore looks fine

Allowing print only in analyzer.py (and keeping existing other exceptions) is sensible given CLI tool ergonomics. No further action.


267-267: Console script entrypoint validated

Verified that cratedb_toolkit.admin.xmover.cli:main exists and is correctly decorated as a Click group:

  • cratedb_toolkit/admin/xmover/cli.py
    • @click.group() on line 66
    • @click.version_option() on line 69
    • @click.pass_context on line 70
    • def main(ctx) on line 71

No further changes needed.

cratedb_toolkit/cli.py (1)

7-7: Wiring XMover into the root CLI is correct

Import alias and add_command are consistent with existing patterns in this file. Nice addition.

Also applies to: 31-31

cratedb_toolkit/admin/xmover/analyzer.py (2)

122-140: Balance score formula looks good and stable.

The coefficient-of-variation-based scoring provides an intuitive 0–100 metric and gracefully handles edge cases.


654-695: Good cluster overview shaping with watermark-derived capacity metrics.

Using sys.cluster watermarks and per-node deltas adds actionable context to the analyzer output.

Comment on lines 456 to 479
def _check_zone_conflict_cached(self, recommendation: MoveRecommendation) -> Optional[str]:
"""Check zone conflicts with caching"""
# Create cache key: table, shard, target zone
target_zone = self._get_node_zone(recommendation.to_node)
cache_key = (recommendation.table_name, recommendation.shard_id, target_zone)

if cache_key in self._zone_conflict_cache:
self._cache_hits += 1
return self._zone_conflict_cache[cache_key]

# Cache miss - do expensive check
self._cache_misses += 1
result = self._check_zone_conflict(recommendation)
self._zone_conflict_cache[cache_key] = result
return result

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix cache-key collision in zone conflict cache (missing schema_name).

The cache key omits schema_name. Two tables with the same name across different schemas will collide, returning incorrect cached results and potentially allowing unsafe moves.

-        target_zone = self._get_node_zone(recommendation.to_node)
-        cache_key = (recommendation.table_name, recommendation.shard_id, target_zone)
+        target_zone = self._get_node_zone(recommendation.to_node)
+        cache_key = (recommendation.schema_name, recommendation.table_name, recommendation.shard_id, target_zone)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _check_zone_conflict_cached(self, recommendation: MoveRecommendation) -> Optional[str]:
"""Check zone conflicts with caching"""
# Create cache key: table, shard, target zone
target_zone = self._get_node_zone(recommendation.to_node)
cache_key = (recommendation.table_name, recommendation.shard_id, target_zone)
if cache_key in self._zone_conflict_cache:
self._cache_hits += 1
return self._zone_conflict_cache[cache_key]
# Cache miss - do expensive check
self._cache_misses += 1
result = self._check_zone_conflict(recommendation)
self._zone_conflict_cache[cache_key] = result
return result
def _check_zone_conflict_cached(self, recommendation: MoveRecommendation) -> Optional[str]:
"""Check zone conflicts with caching"""
# Create cache key: schema, table, shard, target zone
target_zone = self._get_node_zone(recommendation.to_node)
cache_key = (recommendation.schema_name, recommendation.table_name, recommendation.shard_id, target_zone)
if cache_key in self._zone_conflict_cache:
self._cache_hits += 1
return self._zone_conflict_cache[cache_key]
# Cache miss - do expensive check
self._cache_misses += 1
result = self._check_zone_conflict(recommendation)
self._zone_conflict_cache[cache_key] = result
return result
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyzer.py around lines 456 to 471, the cache
key for zone conflict caching omits the schema name which causes collisions for
identical table names in different schemas; update the cache_key to include the
recommendation.schema_name (e.g., (recommendation.schema_name,
recommendation.table_name, recommendation.shard_id, target_zone)) so cached
results are partitioned by schema, and ensure recommendation.schema_name exists
(or fall back to a default) before using it in the key.

Comment on lines 206 to 216
where_conditions = []
if not for_analysis:
# For operations, only include healthy shards
where_conditions.extend(["s.routing_state = 'STARTED'", "s.recovery['files']['percent'] = 100.0"])
parameters = []

if table_name:
where_conditions.append("s.table_name = ?")
parameters.append(table_name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

“Healthy shards” filter may exclude STARTED shards due to NULL recovery percent

On STARTED shards, recovery['files']['percent'] may be NULL. Equality to 100.0 would exclude those rows. Use COALESCE and >= for robustness.

-        if not for_analysis:
-            # For operations, only include healthy shards
-            where_conditions.extend(["s.routing_state = 'STARTED'", "s.recovery['files']['percent'] = 100.0"])
+        if not for_analysis:
+            # For operations, only include healthy shards
+            where_conditions.extend([
+                "s.routing_state = 'STARTED'",
+                "coalesce(s.recovery['files']['percent'], 100.0) >= 100.0",
+            ])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
where_conditions = []
if not for_analysis:
# For operations, only include healthy shards
where_conditions.extend(["s.routing_state = 'STARTED'", "s.recovery['files']['percent'] = 100.0"])
parameters = []
if table_name:
where_conditions.append("s.table_name = ?")
parameters.append(table_name)
where_conditions = []
if not for_analysis:
# For operations, only include healthy shards
where_conditions.extend([
"s.routing_state = 'STARTED'",
"coalesce(s.recovery['files']['percent'], 100.0) >= 100.0",
])
parameters = []
if table_name:
where_conditions.append("s.table_name = ?")
parameters.append(table_name)
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/database.py around lines 206 to 214, the WHERE
clause currently uses "s.recovery['files']['percent'] = 100.0" which will
exclude STARTED shards where that value is NULL; replace the equality check with
a COALESCE + >= check (e.g. COALESCE(s.recovery['files']['percent'], 100.0) >=
100.0) so NULL is treated as fully recovered and use >= instead of = to be
robust.

Comment on lines 407 to 437
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""

# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""

result = self.execute_query(query, [table_name, shard_id])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Schema parameter is ignored in get_recovery_details; can return wrong shard

The method accepts schema_name but doesn’t filter by it. This can pick the wrong table if identical names exist across schemas.

-        WHERE s.table_name = ? AND s.id = ?
+        WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
...
-        result = self.execute_query(query, [table_name, shard_id])
+        result = self.execute_query(query, [schema_name, table_name, shard_id])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""
# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""
result = self.execute_query(query, [table_name, shard_id])
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""
# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""
result = self.execute_query(query, [schema_name, table_name, shard_id])
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/database.py around lines 407 to 431, the
get_recovery_details method accepts schema_name but does not filter by it, which
can return the wrong shard when identical table names exist in multiple schemas;
update the SQL WHERE clause to include s.schema_name = ? and add schema_name as
the first parameter in the execute_query parameter list (so params become
[schema_name, table_name, shard_id]) ensuring parameter order matches the query
placeholders, and adjust any related docs/comments if needed.

Comment on lines +65 to +93
## List biggest SHARDS on a particular Nodes

```sql
select node['name'], table_name, schema_name, id, sum(size) / 1024^3 from sys.shards
where node['name'] = 'data-hot-2'
AND routing_state = 'STARTED'
AND recovery['files']['percent'] = 0
group by 1,2,3,4 order by 5 desc limit 8;
+--------------+-----------------------+-------------+----+-----------------------------+
| node['name'] | table_name | schema_name | id | (sum(size) / 1.073741824E9) |
+--------------+-----------------------+-------------+----+-----------------------------+
| data-hot-2 | bottleFieldData | curvo | 5 | 135.568662205711 |
| data-hot-2 | bottleFieldData | curvo | 8 | 134.813782049343 |
| data-hot-2 | bottleFieldData | curvo | 3 | 133.43549298401922 |
| data-hot-2 | bottleFieldData | curvo | 11 | 130.10448653809726 |
| data-hot-2 | turtleFieldData | curvo | 31 | 54.642812703736126 |
| data-hot-2 | turtleFieldData | curvo | 29 | 54.06101848650724 |
| data-hot-2 | turtleFieldData | curvo | 5 | 53.96749582327902 |
| data-hot-2 | turtleFieldData | curvo | 21 | 53.72262619435787 |
+--------------+-----------------------+-------------+----+-----------------------------+
SELECT 8 rows in set (0.062 sec)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Recovery predicate likely inverted; “percent = 0” excludes completed shards

To filter fully healthy shards, require percent = 100 (or >= 100), not 0. Also, fix the section heading grammar (“Node”, singular) and use power() for GiB conversion.

-## List biggest SHARDS on a particular Nodes
+## List biggest shards on a particular node
...
-select node['name'], table_name, schema_name, id,  sum(size) / 1024^3 from sys.shards
+select node['name'], table_name, schema_name, id,  sum(size) / power(1024, 3) as size_gb
+from sys.shards
     where node['name'] = 'data-hot-2'
     AND routing_state = 'STARTED'
-    AND recovery['files']['percent'] = 0
+    AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     group by 1,2,3,4  order by 5  desc limit 8;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## List biggest SHARDS on a particular Nodes
```sql
select node['name'], table_name, schema_name, id, sum(size) / 1024^3 from sys.shards
where node['name'] = 'data-hot-2'
AND routing_state = 'STARTED'
AND recovery['files']['percent'] = 0
group by 1,2,3,4 order by 5 desc limit 8;
+--------------+-----------------------+-------------+----+-----------------------------+
| node['name'] | table_name | schema_name | id | (sum(size) / 1.073741824E9) |
+--------------+-----------------------+-------------+----+-----------------------------+
| data-hot-2 | bottleFieldData | curvo | 5 | 135.568662205711 |
| data-hot-2 | bottleFieldData | curvo | 8 | 134.813782049343 |
| data-hot-2 | bottleFieldData | curvo | 3 | 133.43549298401922 |
| data-hot-2 | bottleFieldData | curvo | 11 | 130.10448653809726 |
| data-hot-2 | turtleFieldData | curvo | 31 | 54.642812703736126 |
| data-hot-2 | turtleFieldData | curvo | 29 | 54.06101848650724 |
| data-hot-2 | turtleFieldData | curvo | 5 | 53.96749582327902 |
| data-hot-2 | turtleFieldData | curvo | 21 | 53.72262619435787 |
+--------------+-----------------------+-------------+----+-----------------------------+
SELECT 8 rows in set (0.062 sec)
```
## List biggest shards on a particular node
🤖 Prompt for AI Agents
In doc/admin/xmover/queries.md around lines 65 to 86, the query and heading are
wrong: the heading should be singular "Node", the recovery predicate is inverted
(percent = 0 excludes fully recovered shards) so change it to
recovery['files']['percent'] >= 100 (or = 100) to select fully healthy shards,
and replace the manual 1024^3 GiB conversion with power(1024,3) (e.g., sum(size)
/ power(1024,3)) for clarity and SQL compatibility; update the heading text and
the SQL predicate/size expression accordingly.

Comment on lines +97 to +130
WITH shard_summary AS (
SELECT
node['name'] AS node_name,
table_name,
schema_name,
CASE
WHEN "primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
SUM(size) / 1024^3 AS total_size_gb
FROM sys.shards
WHERE table_name = 'orderffD'
AND routing_state = 'STARTED'
AND recovery['files']['percent'] = 0
GROUP BY node['name'], table_name, schema_name, "primary"
)
SELECT
node_name,
table_name,
schema_name,
shard_type,
shard_count,
ROUND(total_size_gb, 2) AS total_size_gb,
ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
FROM shard_summary
ORDER BY node_name, shard_type DESC, total_size_gb DESC;
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

CTE uses recovery percent 0; should be completed shards and power() for units

For summarizing shard sizes, filter to completed recoveries and use power() to compute GiB. Consider using ROUND on the division rather than the alias, to avoid rounding cascades.

-        SUM(size) / 1024^3 AS total_size_gb
+        SUM(size) / power(1024, 3) AS total_size_gb
...
-    WHERE table_name = 'orderffD'
-        AND routing_state = 'STARTED'
-        AND recovery['files']['percent'] = 0
+    WHERE table_name = 'orderffd'
+        AND routing_state = 'STARTED'
+        AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
...
-    ROUND(total_size_gb, 2) AS total_size_gb,
-    ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
+    ROUND(total_size_gb, 2) AS total_size_gb,
+    ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
WITH shard_summary AS (
SELECT
node['name'] AS node_name,
table_name,
schema_name,
CASE
WHEN "primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
SUM(size) / 1024^3 AS total_size_gb
FROM sys.shards
WHERE table_name = 'orderffD'
AND routing_state = 'STARTED'
AND recovery['files']['percent'] = 0
GROUP BY node['name'], table_name, schema_name, "primary"
)
SELECT
node_name,
table_name,
schema_name,
shard_type,
shard_count,
ROUND(total_size_gb, 2) AS total_size_gb,
ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
FROM shard_summary
ORDER BY node_name, shard_type DESC, total_size_gb DESC;
```
WITH shard_summary AS (
SELECT
node['name'] AS node_name,
table_name,
schema_name,
CASE
WHEN "primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
SUM(size) / power(1024, 3) AS total_size_gb
FROM sys.shards
WHERE table_name = 'orderffd'
AND routing_state = 'STARTED'
AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
GROUP BY node['name'], table_name, schema_name, "primary"
)
SELECT
node_name,
table_name,
schema_name,
shard_type,
shard_count,
ROUND(total_size_gb, 2) AS total_size_gb,
ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb
FROM shard_summary
ORDER BY node_name, shard_type DESC, total_size_gb DESC;
🤖 Prompt for AI Agents
In doc/admin/xmover/queries.md around lines 97-124, update the CTE to filter for
completed recoveries (use recovery['files']['percent'] = 100 instead of 0) and
replace the 1024^3 unit expression with power(1024, 3) when converting bytes to
GiB; compute total_size_gb and avg_shard_size_gb using the raw SUM()/COUNT()
expressions so you can apply ROUND directly to each numeric division (e.g.,
ROUND(SUM(size)/power(1024,3),2) and ROUND(SUM(size)/power(1024,3)/COUNT(*),2)),
and keep the same GROUP BY columns.

Comment on lines +126 to +177
```sql
-- Comprehensive shard distribution showing both node and zone details
SELECT
n.attributes['zone'] AS zone,
s.node['name'] AS node_name,
s.table_name,
s.schema_name,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
s.id AS shard_id,
s.size / 1024^3 AS shard_size_gb,
s.num_docs,
s.state
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'your_table_name' -- Replace with your specific table name
AND s.routing_state = 'STARTED'
AND s.recovery['files']['percent'] = 0
ORDER BY
n.attributes['zone'],
s.node['name'],
s."primary" DESC, -- Primary shards first
s.id;

-- Summary by zone and shard type
SELECT
n.attributes['zone'] AS zone,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
COUNT(DISTINCT s.node['name']) AS nodes_with_shards,
ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
SUM(s.num_docs) AS total_documents
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'orderffD' -- Replace with your specific table name
AND s.routing_state = 'STARTED'
AND s.recovery['files']['percent'] = 0
GROUP BY n.attributes['zone'], s."primary"
ORDER BY zone, shard_type DESC;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix unit conversion and clarify placeholder usage

  • Replace “^” with power() to compute GiB.
  • Keep the placeholder comment, but show an example to reduce copy/paste errors.
-    s.size / 1024^3 AS shard_size_gb,
+    s.size / power(1024, 3) AS shard_size_gb,
...
-WHERE s.table_name = 'your_table_name'  -- Replace with your specific table name
+WHERE s.table_name = 'your_table_name'  -- Replace with your table name, e.g. 'orders'
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

And in the summary query:

-    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
-    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
+    ROUND(SUM(s.size) / power(1024, 3), 2) AS total_size_gb,
+    ROUND(AVG(s.size) / power(1024, 3), 3) AS avg_shard_size_gb,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```sql
-- Comprehensive shard distribution showing both node and zone details
SELECT
n.attributes['zone'] AS zone,
s.node['name'] AS node_name,
s.table_name,
s.schema_name,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
s.id AS shard_id,
s.size / 1024^3 AS shard_size_gb,
s.num_docs,
s.state
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'your_table_name' -- Replace with your specific table name
AND s.routing_state = 'STARTED'
AND s.recovery['files']['percent'] = 0
ORDER BY
n.attributes['zone'],
s.node['name'],
s."primary" DESC, -- Primary shards first
s.id;
-- Summary by zone and shard type
SELECT
n.attributes['zone'] AS zone,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
COUNT(DISTINCT s.node['name']) AS nodes_with_shards,
ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
SUM(s.num_docs) AS total_documents
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'orderffD' -- Replace with your specific table name
AND s.routing_state = 'STARTED'
AND s.recovery['files']['percent'] = 0
GROUP BY n.attributes['zone'], s."primary"
ORDER BY zone, shard_type DESC;
-- Comprehensive shard distribution showing both node and zone details
SELECT
n.attributes['zone'] AS zone,
s.node['name'] AS node_name,
s.table_name,
s.schema_name,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
s.id AS shard_id,
s.size / power(1024, 3) AS shard_size_gb,
s.num_docs,
s.state
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'your_table_name' -- Replace with your table name, e.g. 'orders'
AND s.routing_state = 'STARTED'
AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0
ORDER BY
n.attributes['zone'],
s.node['name'],
s."primary" DESC, -- Primary shards first
s.id;
-- Summary by zone and shard type
SELECT
n.attributes['zone'] AS zone,
CASE
WHEN s."primary" = true THEN 'PRIMARY'
ELSE 'REPLICA'
END AS shard_type,
COUNT(*) AS shard_count,
COUNT(DISTINCT s.node['name']) AS nodes_with_shards,
ROUND(SUM(s.size) / power(1024, 3), 2) AS total_size_gb,
ROUND(AVG(s.size) / power(1024, 3), 3) AS avg_shard_size_gb,
SUM(s.num_docs) AS total_documents
FROM sys.shards s
JOIN sys.nodes n ON s.node['id'] = n.id
WHERE s.table_name = 'orderffD' -- Replace with your specific table name
AND s.routing_state = 'STARTED'
AND s.recovery['files']['percent'] = 0
GROUP BY n.attributes['zone'], s."primary"
ORDER BY zone, shard_type DESC;

Comment on lines 118 to 133
def __init__(self, connection_string: Optional[str] = None):
load_dotenv()

self.connection_string = connection_string or os.getenv("CRATE_CONNECTION_STRING")
if not self.connection_string:
raise ValueError("CRATE_CONNECTION_STRING not found in environment or provided")

self.username = os.getenv("CRATE_USERNAME")
self.password = os.getenv("CRATE_PASSWORD")
self.ssl_verify = os.getenv("CRATE_SSL_VERIFY", "true").lower() == "true"

# Ensure connection string ends with _sql endpoint
if not self.connection_string.endswith("/_sql"):
self.connection_string = self.connection_string.rstrip("/") + "/_sql"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably connect XMover to the existing CLI subsystem that has been established to take care about relevant details for all the many modules accruing here.

Copy link
Member Author

@amotl amotl Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WalBeh had a wish about those configuration variables suitable for port-forwarding scenarios.

CRATE_CONNECTION_STRING=https://localhost:4200/
CRATE_USERNAME="crate"
CRATE_PASSWORD=""
CRATE_SSL_VERIFY="false"
PYTHONHTTPSVERIFY=0
PYTHONWARNINGS="ignore:Unverified HTTPS request"

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (6)
doc/admin/xmover/handbook.md (1)

462-469: Fix validate-move placeholders to match prior examples

Keep placeholder names consistent: use FROM_NODE and TO_NODE (not FROM/TO). Retain emphasis styling per project preference.

 **Disk Usage Threshold (default: 85%)**
 ```bash
 # Allow moves to nodes with higher disk usage
 xmover recommend --max-disk-usage 90 --prioritize-space

 # For urgent space relief
-xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO> --max-disk-usage 95
+xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE> --max-disk-usage 95

</blockquote></details>
<details>
<summary>doc/admin/xmover/queries.md (4)</summary><blockquote>

`51-70`: **Fix mismatch between query and example output table**

The SQL selects zone and available_gb from fs['total']['available'], but the sample output header/values show attributes['zone'] heading and fs[1]['disks'] indexing. Update the example block to mirror the query.



```diff
-```text
-+------------+--------------------+-----------------------------------------------+
-| name       | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) |
-+------------+--------------------+-----------------------------------------------+
-| data-hot-5 | us-west-2a         |                            142.3342628479004  |
-...
-+------------+--------------------+-----------------------------------------------+
-```
+```text
++------------+--------------------+---------------+
+| name       | zone               | available_gb  |
++------------+--------------------+---------------+
+| data-hot-5 | us-west-2a         | 142.33        |
+| data-hot-0 | us-west-2a         | 142.03        |
+| data-hot-6 | us-west-2b         | 159.69        |
+| data-hot-3 | us-west-2b         | 798.81        |
+| data-hot-2 | us-west-2b         | 156.79        |
+| data-hot-1 | us-west-2c         | 145.74        |
+| data-hot-4 | us-west-2c         | 148.40        |
++------------+--------------------+---------------+
+```

72-93: Fix heading grammar and select only fully healthy shards

Use singular “node”. To list biggest shards, prefer STARTED and fully recovered shards. Keep 1024^3.

-## List biggest SHARDS on a particular Nodes
+## List biggest shards on a particular node
@@
-select node['name'], table_name, schema_name, id,  sum(size) / 1024^3 from sys.shards
+select
+  node['name'],
+  table_name,
+  schema_name,
+  id,
+  sum(size) / 1024^3 as size_gb
+from sys.shards
     where node['name'] = 'data-hot-2'
     AND routing_state = 'STARTED'
-    AND recovery['files']['percent'] = 0
+    AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     group by 1,2,3,4  order by 5  desc limit 8;

101-130: CTE: use completed recoveries and robust averages; normalize placeholders

Prefer picking a neutral placeholder table; filter to healthy shards; guard division by zero. Keep 1024^3.

 WITH shard_summary AS (
     SELECT
         node['name'] AS node_name,
         table_name,
         schema_name,
         CASE
             WHEN "primary" = true THEN 'PRIMARY'
             ELSE 'REPLICA'
         END AS shard_type,
         COUNT(*) AS shard_count,
-        SUM(size) / 1024^3 AS total_size_gb
+        SUM(size) / 1024^3 AS total_size_gb
     FROM sys.shards
-    WHERE table_name = 'orderffD'
-        AND routing_state = 'STARTED'
-        AND recovery['files']['percent'] = 0
+    WHERE table_name = 'your_table_name'  -- e.g., 'orders'
+        AND routing_state = 'STARTED'
+        AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     GROUP BY node['name'], table_name, schema_name, "primary"
 )
 SELECT
     node_name,
     table_name,
     schema_name,
     shard_type,
     shard_count,
-    ROUND(total_size_gb, 2) AS total_size_gb,
-    ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
+    ROUND(total_size_gb, 2) AS total_size_gb,
+    ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb
 FROM shard_summary
 ORDER BY node_name, shard_type DESC, total_size_gb DESC;

132-175: Comprehensive distribution: prefer healthy shards and clearer placeholder guidance

Use coalesce(..., 100) >= 100 to focus on fully recovered shards and clarify placeholders. Keep 1024^3.

-    s.size / 1024^3 AS shard_size_gb,
+    s.size / 1024^3 AS shard_size_gb,
@@
-WHERE s.table_name = 'your_table_name'  -- Replace with your specific table name
+WHERE s.table_name = 'your_table_name'  -- Replace with your table name, e.g. 'orders'
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0
@@
-    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
-    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
+    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
+    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
@@
-WHERE s.table_name = 'orderffD'  -- Replace with your specific table name
+WHERE s.table_name = 'your_table_name'  -- Replace with your table name, e.g. 'orders'
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0
cratedb_toolkit/admin/xmover/analyzer.py (1)

76-82: Fix cache-key collision in zone-conflict cache by including schema_name

Two tables with the same name in different schemas collide on (table_name, shard_id, zone). This can return stale/conflicting results and allow unsafe moves. Include schema_name in the cache key and adjust the declared type accordingly.

-        self._zone_conflict_cache: Dict[Tuple[str, int, str], Union[str, None]] = {}
+        self._zone_conflict_cache: Dict[Tuple[str, str, int, str], Union[str, None]] = {}
@@
-        target_zone = self._get_node_zone(recommendation.to_node)
-        cache_key = (recommendation.table_name, recommendation.shard_id, target_zone)
+        target_zone = self._get_node_zone(recommendation.to_node)
+        cache_key = (recommendation.schema_name, recommendation.table_name, recommendation.shard_id, target_zone)

Also applies to: 464-469

🧹 Nitpick comments (15)
doc/admin/xmover/handbook.md (1)

171-176: Unify placeholder naming: use SCHEMA.TABLE, not SCHEMA_TABLE

Elsewhere in the docs and tests you consistently use SCHEMA.TABLE. Align the arguments list accordingly for consistency and to match the CLI usage examples.

-**Arguments:**
-- `SCHEMA_TABLE`: Schema and table name (format: schema.table)
-- `SHARD_ID`: Shard ID to move
-- `FROM_NODE`: Source node name
-- `TO_NODE`: Target node name
+**Arguments:**
+- `SCHEMA.TABLE`: Schema and table name (format: schema.table)
+- `SHARD_ID`: Shard ID to move
+- `FROM_NODE`: Source node name
+- `TO_NODE`: Target node name
doc/admin/xmover/troubleshooting.md (3)

291-292: Consistent placeholders: FROM_NODE and TO_NODE

Use FROM_NODE/TO_NODE to match the rest of the docs and CLI usage.

-   xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO>
+   xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE>

398-399: Update placeholders in Built-in Help example

Align with the standardized placeholders used across the docs.

- xmover validate-move SCHEMA.TABLE SHARD_ID FROM TO
+ xmover validate-move SCHEMA.TABLE SHARD_ID FROM_NODE TO_NODE

403-406: Replace bare URLs with proper Markdown links

Avoid MD034 by using Markdown link syntax for consistency with the rest of the docs.

-- **CrateDB Documentation**: https://crate.io/docs/
-- **Shard Allocation Guide**: https://crate.io/docs/crate/reference/en/latest/admin/system-information.html
-- **Cluster Settings**: https://crate.io/docs/crate/reference/en/latest/config/cluster.html
+- **CrateDB Documentation**: [crate.io/docs](https://crate.io/docs/)
+- **Shard Allocation Guide**: [System Information](https://crate.io/docs/crate/reference/en/latest/admin/system-information.html)
+- **Cluster Settings**: [Cluster Configuration](https://crate.io/docs/crate/reference/en/latest/config/cluster.html)
tests/admin/test_cli.py (1)

7-19: Optional: parametrize full argv to avoid interactive hangs for explain-error

If explain-error can prompt interactively when no message is provided, this test may hang on some environments. Parametrize the full argument list per subcommand to keep the test fully non-interactive.

-@pytest.mark.parametrize(
-    "subcommand",
-    [
-        "analyze",
-        "check-balance",
-        "explain-error",
-        "find-candidates",
-        "monitor-recovery",
-        "recommend",
-        "test-connection",
-        "zone-analysis",
-    ],
-)
-def test_xmover_all(cratedb, subcommand):
+@pytest.mark.parametrize(
+    "argv",
+    [
+        ["analyze"],
+        ["check-balance"],
+        ["explain-error", "NO(a copy of this shard is already allocated to this node)"],
+        ["find-candidates"],
+        ["monitor-recovery"],
+        ["recommend"],
+        ["test-connection"],
+        ["zone-analysis"],
+    ],
+)
+def test_xmover_all(cratedb, argv):
@@
-    result = runner.invoke(
-        cli,
-        args=[subcommand],
+    result = runner.invoke(
+        cli,
+        args=argv,
         env={"CRATE_CONNECTION_STRING": http_url},
         catch_exceptions=False,
     )

Would you like me to extend coverage with a smoke test for recommend --execute that asserts the output contains SQL (without applying it)?

Also applies to: 20-20, 27-29

cratedb_toolkit/admin/xmover/attic.py (3)

8-18: Align parameter naming with analyzer API (min_free_space_gb)

The analyzer exposes plan_node_decommission(node_name, min_free_space_gb=...). The CLI stub uses min_free_space. Aligning names reduces confusion and avoids mismatches when wiring this up.

-# def decommission(ctx, node_name: str, min_free_space: float, dry_run: bool):
+# def decommission(ctx, node_name: str, min_free_space_gb: float, dry_run: bool):
@@
-#     plan = analyzer.plan_node_decommission(node_name, min_free_space)
+#     plan = analyzer.plan_node_decommission(node_name, min_free_space_gb)

16-43: Commented CLI stub references undefined symbols; keep notes or add imports when activating

console, Panel, Table, box, and format_size are used but not imported. Fine while commented; when enabling, import from rich and any helpers to avoid NameError.

Example imports to add when activating:

+# from rich.console import Console
+# from rich.panel import Panel
+# from rich.table import Table
+# from rich import box
+# from .cli_helpers import format_size  # or wherever it is defined
+# console = Console()

111-119: Exit status messages look good; consider exposing actionable next steps

The three terminal states cover infeasible, zero shards, and pending moves. Consider printing a one-liner hint to re-run with --execute when feasible, or a pointer to the “XMover Query Gallery” section for manual checks.

doc/admin/xmover/queries.md (2)

6-21: Add aliases and (optionally) filter to healthy shards for clarity

Readable column names help, and most users want STARTED shards in such summaries. Keep 1024^3 as-is; CrateDB’s caret is exponentiation.

-select node['name'], sum(size) / 1024^3, count(id)  from sys.shards  group by 1  order by 1 asc;
+select
+  node['name'],
+  sum(size) / 1024^3 as total_size_gb,
+  count(id) as shard_count
+from sys.shards
+-- optionally focus on healthy shards:
+-- where routing_state = 'STARTED'
+group by 1
+order by 1 asc;

24-47: PRIMARY/REPLICA breakdown query is fine; consider aliasing columns

Query is correct and idiomatic; adding aliases improves readability in docs.

-select node['name'], primary,  sum(size) / 1024^3, count(id)  from sys.shards  group by 1,2  order by 1 asc;
+select
+  node['name'],
+  primary,
+  sum(size) / 1024^3 as total_size_gb,
+  count(id) as shard_count
+from sys.shards
+group by 1,2
+order by 1 asc;
cratedb_toolkit/admin/xmover/analyzer.py (5)

259-279: Use mean-based target per zone and pass table_name explicitly

target_per_zone uses integer division (//), which biases thresholds downward and can misclassify zones around the boundary. Use the mean to compute thresholds. Also pass table_name by name for clarity (check_zone_balance ignores tolerance).

-        zone_stats = self.check_zone_balance(table_name, zone_tolerance_percent)
+        zone_stats = self.check_zone_balance(table_name=table_name)
@@
-        total_shards = sum(stats["TOTAL"] for stats in zone_stats.values())
-        zones = list(zone_stats.keys())
-        target_per_zone = total_shards // len(zones) if zones else 0
+        total_shards = sum(stats["TOTAL"] for stats in zone_stats.values())
+        zones = list(zone_stats.keys())
+        target_per_zone = (total_shards / len(zones)) if zones else 0.0
@@
-            if current_count > threshold_high:
+            if current_count > threshold_high:
                 overloaded_zones.append(zone)
-            elif current_count < threshold_low:
+            elif current_count < threshold_low:
                 underloaded_zones.append(zone)

254-256: Prefer logger over print for library code

Using logger with appropriate levels makes output configurable for CLI vs. library use and integrates with existing logging.

-        print(f"Analyzing {len(moveable_shards)} candidate shards in size range {min_size_gb}-{max_size_gb}GB...")
+        logger.info(
+            "Analyzing %d candidate shards in size range %.1f-%.1f GB...",
+            len(moveable_shards), min_size_gb, max_size_gb
+        )
@@
-        print(f"Generated {len(recommendations)} move recommendations (evaluated {total_evaluated} shards)")
-        print(f"Performance: {self.get_cache_stats()}")
+        logger.info(
+            "Generated %d move recommendations (evaluated %d shards). %s",
+            len(recommendations), total_evaluated, self.get_cache_stats()
+        )

Also applies to: 411-415


433-445: Avoid hard-coded 50GB buffer; make it configurable

Hard-coding an extra 50GB may be too conservative or too lax depending on clusters. Consider making buffer space a parameter (propagated from CLI), or derive from watermarks and shard size (e.g., max(5% of shard, 10GB)).

Example refactor (signature-only sketch, no diff applied elsewhere):

  • Add parameter: validate_move_safety(..., extra_buffer_gb: float = 50.0)
  • Pass through from generate_rebalancing_recommendations and CLI options.

765-779: Decommission plan uses STARTED shards but includes analysis-time data; consider using fresh healthy-shard snapshot

self.shards is fetched with for_analysis=True (includes non-healthy states). The filter checks routing_state == 'STARTED', but doesn’t enforce completed recovery. Consider re-fetching via client.get_shards_info(for_analysis=False) scoped to the node to ensure strictly healthy shards for move planning.

-        node_shards = [s for s in self.shards if s.node_name == node_name and s.routing_state == "STARTED"]
+        # Prefer an operational snapshot limited to the node
+        node_shards = [
+            s for s in self.client.get_shards_info(for_analysis=False)
+            if s.node_name == node_name
+        ]

610-655: Zone-conflict checks are solid; consider deduping the two similar checks

You check for “zone already has a copy” twice (once via zones_with_copies, once via current_zone_counts). Not harmful, but you can collapse to a single check to reduce query load.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 445d908 and 6ab37b7.

📒 Files selected for processing (15)
  • CHANGES.md (1 hunks)
  • cratedb_toolkit/admin/xmover/__init__.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analyzer.py (1 hunks)
  • cratedb_toolkit/admin/xmover/attic.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/database.py (1 hunks)
  • cratedb_toolkit/cli.py (2 hunks)
  • doc/admin/index.md (1 hunks)
  • doc/admin/xmover/handbook.md (1 hunks)
  • doc/admin/xmover/index.md (1 hunks)
  • doc/admin/xmover/queries.md (1 hunks)
  • doc/admin/xmover/troubleshooting.md (1 hunks)
  • doc/index.md (1 hunks)
  • pyproject.toml (3 hunks)
  • tests/admin/test_cli.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cratedb_toolkit/admin/xmover/init.py
  • CHANGES.md
  • cratedb_toolkit/cli.py
  • pyproject.toml
  • doc/admin/index.md
  • cratedb_toolkit/admin/xmover/cli.py
  • cratedb_toolkit/admin/xmover/database.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T09:20:33.795Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/handbook.md:462-469
Timestamp: 2025-08-20T09:20:33.795Z
Learning: In large documentation files with many existing headings, be less strict about flagging emphasis-as-heading (MD036) issues, as emphasis can serve valid purposes for minor subsections and clarifications rather than requiring full heading treatment.

Applied to files:

  • doc/admin/xmover/handbook.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.

Applied to files:

  • doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.

Applied to files:

  • doc/admin/xmover/queries.md
🧬 Code Graph Analysis (2)
cratedb_toolkit/admin/xmover/analyzer.py (1)
cratedb_toolkit/admin/xmover/database.py (16)
  • NodeInfo (17-39)
  • RecoveryInfo (65-112)
  • ShardInfo (43-61)
  • shard_type (60-61)
  • shard_type (96-97)
  • size_gb (91-93)
  • get_nodes_info (155-189)
  • get_shards_info (191-271)
  • available_space_gb (38-39)
  • disk_usage_percent (34-35)
  • execute_query (135-153)
  • get_cluster_watermarks (345-364)
  • heap_usage_percent (30-31)
  • get_all_recovering_shards (457-488)
  • overall_progress (86-88)
  • total_time_seconds (100-102)
tests/admin/test_cli.py (2)
tests/conftest.py (1)
  • cratedb (133-138)
cratedb_toolkit/testing/testcontainers/cratedb.py (2)
  • http_url (216-222)
  • get_http_url (209-213)
🪛 LanguageTool
doc/admin/xmover/handbook.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t: Analyze specific table only **Example...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t: Find candidates in specific table only...

(QB_NEW_EN)


[grammar] ~94-~94: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~96-~96: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node: Only show candidates from this specifi...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~112-~112: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)


[grammar] ~145-~145: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t: Analyze zones for specific table only ...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...: Analyze zones for specific table only - --show-shards/--no-show-shards`: Show individual shard details (default...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t: Check balance for specific table only ...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...: Check balance for specific table only - --tolerance`: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE: Schema and table name (format: schema....

(QB_NEW_EN)


[grammar] ~172-~172: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source...

(QB_NEW_EN)


[grammar] ~173-~173: There might be a mistake here.
Context: ...ma.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source node name - TO_NODE: Target n...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... to move - FROM_NODE: Source node name - TO_NODE: Target node name Examples: ```bas...

(QB_NEW_EN)


[grammar] ~189-~189: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE: The CrateDB error message to analyze (...

(QB_NEW_EN)


[grammar] ~204-~204: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t: Monitor recovery for specific table on...

(QB_NEW_EN)


[grammar] ~205-~205: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n: Monitor recovery on specific node only...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w: Continuously monitor (refresh every 10...

(QB_NEW_EN)


[grammar] ~207-~207: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval: Refresh interval for watch mode in sec...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type: Filter by recovery type - PEER, DISK, ...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning: Include recently completed recoveries ...

(QB_NEW_EN)


[grammar] ~230-~230: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...

(QB_NEW_EN)


[grammar] ~231-~231: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...

(QB_NEW_EN)


[grammar] ~364-~364: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) - CRATE_USERNAME`: Username for authentication (optional)...

(QB_NEW_EN)


[grammar] ~365-~365: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD: Password for authentication (optional)...

(QB_NEW_EN)


[grammar] ~366-~366: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY: Enable SSL certificate verification (d...

(QB_NEW_EN)


[grammar] ~421-~421: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...

(QB_NEW_EN)


[grammar] ~422-~422: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...

(QB_NEW_EN)


[grammar] ~430-~430: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...

(QB_NEW_EN)


[grammar] ~441-~441: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...

(QB_NEW_EN)


[grammar] ~471-~471: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...

(QB_NEW_EN)


[style] ~472-~472: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...

(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)


[grammar] ~474-~474: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...

(QB_NEW_EN)


[grammar] ~479-~479: There might be a mistake here.
Context: ...f}xmover-troubleshooting which covers: - Step-by-step diagnostic procedures - Eme...

(QB_NEW_EN)


[grammar] ~480-~480: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...

(QB_NEW_EN)


[grammar] ~481-~481: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...

(QB_NEW_EN)


[grammar] ~482-~482: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...

(QB_NEW_EN)

doc/admin/xmover/index.md

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Movement Recommendations: Intell...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...

(QB_NEW_EN)

doc/admin/xmover/queries.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...

(QB_NEW_EN)

doc/admin/xmover/troubleshooting.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-troubleshooting)= # Troubleshooting CrateDB using XMover Th...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ...ns ### 1. Zone Conflicts #### Symptoms - Error: `NO(a copy of this shard is alrea...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ...s in safety validation #### Root Causes - Target node already has a copy of the sh...

(QB_NEW_EN)


[grammar] ~64-~64: There might be a mistake here.
Context: ...D FROM_NODE TO_NODE ``` #### Prevention - Always use xmover recommend instead of...

(QB_NEW_EN)


[grammar] ~71-~71: There might be a mistake here.
Context: ...Insufficient Space Issues #### Symptoms - Error: not enough disk space - Safety ...

(QB_NEW_EN)


[grammar] ~72-~72: There might be a mistake here.
Context: ...nt Space Issues #### Symptoms - Error: not enough disk space - Safety validation fails with space warni...

(QB_NEW_EN)


[grammar] ~76-~76: There might be a mistake here.
Context: ...es in cluster analysis #### Root Causes - Target node doesn't have enough free spa...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...max-size 50 ``` Step 3: Free Up Space - Delete old snapshots and unused data - M...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...ng nodes to the cluster #### Prevention - Monitor disk usage regularly with `xmove...

(QB_NEW_EN)


[grammar] ~107-~107: There might be a mistake here.
Context: ...ion - Monitor disk usage regularly with xmover analyze - Set conservative --min-free-space valu...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: .... Node Performance Issues #### Symptoms - Error: shard recovery limit - High hea...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...formance Issues #### Symptoms - Error: shard recovery limit - High heap usage warnings - Slow shard mo...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...covery limit` - High heap usage warnings - Slow shard movement operations #### Roo...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...rd movement operations #### Root Causes - Too many concurrent shard movements - Hi...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...es - Too many concurrent shard movements - High heap usage on target nodes (>80%) -...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...- High heap usage on target nodes (>80%) - Resource contention during moves #### S...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ... --prioritize-space ``` #### Prevention - Move shards gradually (5-10 at a time) -...

(QB_NEW_EN)


[grammar] ~161-~161: There might be a mistake here.
Context: ...nificantly more shards #### Root Causes - Historical data distribution patterns - ...

(QB_NEW_EN)


[grammar] ~162-~162: There might be a mistake here.
Context: ... - Historical data distribution patterns - Node additions/removals without rebalanc...

(QB_NEW_EN)


[grammar] ~163-~163: There might be a mistake here.
Context: ...e additions/removals without rebalancing - Tables created with poor initial distrib...

(QB_NEW_EN)


[grammar] ~194-~194: There might be a mistake here.
Context: ...progress and repeat ``` #### Prevention - Run regular balance checks: `xmover chec...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...## Symptoms - "Connection failed" errors - Authentication failures - SSL/TLS errors...

(QB_NEW_EN)


[grammar] ~203-~203: There might be a mistake here.
Context: ...failed" errors - Authentication failures - SSL/TLS errors #### Root Causes - Incor...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...lures - SSL/TLS errors #### Root Causes - Incorrect connection string in .env - ...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ...ion string in .env - Wrong credentials - Network connectivity issues - SSL certif...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...redentials - Network connectivity issues - SSL certificate problems #### Solutions...

(QB_NEW_EN)


[grammar] ~241-~241: There might be a mistake here.
Context: ..."stmt":"SELECT 1"}' ``` #### Prevention - Use .env.example as a template - Verif...

(QB_NEW_EN)


[grammar] ~244-~244: There might be a mistake here.
Context: ... with CrateDB admin - Test connectivity from deployment environment ## Error Messag...

(QB_NEW_EN)


[grammar] ~262-~262: There might be a mistake here.
Context: ... | Error Pattern | Meaning | Quick Fix | |---------------|---------|-----------| ...

(QB_NEW_EN)


[grammar] ~263-~263: There might be a mistake here.
Context: ... |---------------|---------|-----------| | `copy of this shard is already allocat...

(QB_NEW_EN)


[grammar] ~264-~264: There might be a mistake here.
Context: ...s shard | Choose different target node | | `too many copies...with attribute [zon...

(QB_NEW_EN)


[grammar] ~265-~265: There might be a mistake here.
Context: ...imit exceeded | Move to different zone | | not enough disk space | Insufficient...

(QB_NEW_EN)


[grammar] ~266-~266: There might be a mistake here.
Context: ... | Free space or choose different node | | shard recovery limit | Too many conc...

(QB_NEW_EN)


[grammar] ~267-~267: There might be a mistake here.
Context: ...oves | Wait and retry with fewer moves | | allocation is disabled | Cluster all...

(QB_NEW_EN)


[grammar] ~331-~331: There might be a mistake here.
Context: ... performance** - Query response times - Resource utilization - Error rates ## Emerge...

(QB_NEW_EN)


[grammar] ~332-~332: There might be a mistake here.
Context: ...response times - Resource utilization - Error rates ## Emergency Procedures ### Stu...

(QB_NEW_EN)


[grammar] ~403-~403: There might be a mistake here.
Context: ... Documentation**: https://crate.io/docs/ - Shard Allocation Guide: https://crate....

(QB_NEW_EN)


[grammar] ~404-~404: There might be a mistake here.
Context: .../en/latest/admin/system-information.html - Cluster Settings: https://crate.io/doc...

(QB_NEW_EN)


[grammar] ~411-~411: There might be a mistake here.
Context: ...e: 1. XMover version and command used 2. Complete error message 3. **Cluster in...

(QB_NEW_EN)


[grammar] ~412-~412: There might be a mistake here.
Context: ...mmand used** 2. Complete error message 3. Cluster information (xmover analyze ...

(QB_NEW_EN)


[grammar] ~413-~413: There might be a mistake here.
Context: ... information** (xmover analyze output) 4. Zone analysis (xmover zone-analysis ...

(QB_NEW_EN)


[grammar] ~414-~414: There might be a mistake here.
Context: ...alysis** (xmover zone-analysis output) 5. CrateDB version and configuration ###...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/handbook.md

462-462: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

doc/admin/xmover/troubleshooting.md

40-40: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


49-49: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


58-58: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


92-92: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


101-101: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


125-125: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


134-134: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


143-143: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


168-168: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


177-177: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


186-186: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


214-214: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


220-220: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


232-232: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


403-403: Bare URL used

(MD034, no-bare-urls)


404-404: Bare URL used

(MD034, no-bare-urls)


405-405: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (3)
doc/index.md (1)

33-33: Confirm placement of Admin docs under Diagnostics

Adding admin/index under the Diagnostics toctree looks intentional. Please confirm this is the desired placement versus “Utilities,” given that Admin pages may include operational tooling. If intended, no change needed; otherwise, consider moving it to the Utilities toctree for consistency.

doc/admin/xmover/index.md (1)

1-29: Solid overview and navigation

Clear feature list and a tidy toctree. Reads well and aligns with the new CLI and analyzer modules.

tests/admin/test_cli.py (1)

27-33: Click CliRunner args must be a list, not a string

Passing a string to args will be iterated character-by-character. Provide a list of arguments to avoid parsing issues.

-    result = runner.invoke(
-        cli,
-        args=subcommand,
-        env={"CRATE_CONNECTION_STRING": http_url},
-        catch_exceptions=False,
-    )
+    result = runner.invoke(
+        cli,
+        args=[subcommand],
+        env={"CRATE_CONNECTION_STRING": http_url},
+        catch_exceptions=False,
+    )

Likely an incorrect or invalid review comment.

Comment on lines +111 to +136
**Options:**
- `--table, -t`: Generate recommendations for specific table only
- `--min-size`: Minimum shard size in GB (default: 40)
- `--max-size`: Maximum shard size in GB (default: 60)
- `--zone-tolerance`: Zone balance tolerance percentage (default: 10)
- `--min-free-space`: Minimum free space required on target nodes in GB (default: 100)
- `--max-moves`: Maximum number of move recommendations (default: 10)
- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
- `--validate/--no-validate`: Validate move safety (default: True)
- `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
- `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify documented defaults match CLI option defaults

The defaults listed for recommend flags look reasonable. Please verify they match the actual Click option defaults to prevent drift.

Run this script to extract option definitions and defaults from the CLI module and spot-check against the handbook:


🏁 Script executed:

#!/bin/bash
# Inspect CLI options for `recommend` and others to verify defaults vs. docs.
# Shows the Click option lines with defaults in the xmover CLI.

# Find the xmover CLI module
fd -t f -g 'cli.py' | rg -n 'admin/xmover/cli.py' -n || true

# Dump option definitions with surrounding context
rg -n -C2 --type=py --fixed-strings \
  -e "@click.command(" \
  -e "@click.option(" \
  -e "def recommend(" \
  -e "def find_candidates(" \
  -e "def analyze(" \
  -e "def zone_analysis(" \
  -e "def check_balance(" \
  -e "def validate_move(" \
  -e "def explain_error(" \
  -e "def monitor_recovery(" \
  cratedb_toolkit/admin/xmover/cli.py

Length of output: 5429


Documented defaults mismatch in recommend command

The CLI’s defaults largely align with the handbook, except:

  • --max-disk-usage is defined in code as 90.0 (%) but the doc shows 85.
  • The --auto-execute flag (default False) is present in the CLI but missing from the handbook.

Please update doc/admin/xmover/handbook.md accordingly:

• Change
--max-disk-usage: Maximum disk usage percentage for target nodes (default: 85)

--max-disk-usage: Maximum disk usage percentage for target nodes (default: 90)

• Add
--auto-execute: DANGER: Automatically execute the SQL commands (requires --execute, asks for confirmation) (default: False)

🧰 Tools
🪛 LanguageTool

[grammar] ~111-~111: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~112-~112: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In doc/admin/xmover/handbook.md around lines 111 to 123, update the documented
defaults to match the CLI: change the `--max-disk-usage` line to show (default:
90) instead of 85, and add a new bullet for `--auto-execute` reading something
like: `--auto-execute`: DANGER: Automatically execute the SQL commands (requires
--execute, asks for confirmation) (default: False); ensure formatting and
ordering match the other options.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (25)
cratedb_toolkit/admin/xmover/util/error.py (8)

3-3: Prefer get_console() over constructing Console directly

Using Rich’s global console improves integration with global config, test capture, and TTY detection.

-from rich.console import Console
+from rich.console import Console, get_console
@@
-console = Console()
+console = get_console()

Also applies to: 6-6


9-29: Add return type, make prompting optional, and fix the “press Enter twice” behavior

  • Return type clarifies intent.
  • Optional prompting prevents accidental blocking in non-interactive contexts.
  • Implement the documented “Enter twice” exit condition while still preserving single blank lines within the message.
-def explain_cratedb_error(error_message: Optional[str]):
+def explain_cratedb_error(
+    error_message: Optional[str] = None,
+    *,
+    interactive: bool = True,
+) -> None:
+    """
+    Decode and troubleshoot common CrateDB shard allocation errors.
+
+    Parameters
+    ----------
+    error_message:
+        Raw CrateDB error message. If None and interactive=True, the user is prompted
+        to paste the message (finish with two blank lines).
+    interactive:
+        When False, never prompt for input; return early if no message is provided.
+    """
@@
-    if not error_message:
+    if error_message is None and interactive:
         console.print("Please paste the CrateDB error message (press Enter twice when done):")
         lines: List[str] = []
         while True:
             try:
-                line = input()
-                if line.strip() == "" and lines:
-                    break
-                lines.append(line)
+                line = input()
+                # Implement "press Enter twice" to finish; keep single blank lines.
+                if line.strip() == "":
+                    # Break on two consecutive blank lines
+                    if lines and lines[-1] == "":
+                        break
+                    # Keep a single blank line
+                    lines.append("")
+                    continue
+                lines.append(line)
             except (EOFError, KeyboardInterrupt):
                 break
         error_message = "\n".join(lines)
 
-    if not error_message.strip():
+    if not (error_message or "").strip():
         console.print("[yellow]No error message provided[/yellow]")
         return

85-87: Merge the REST endpoint and payload into a single solution bullet; remove stray trailing space

This reads cleaner and avoids an orphaned list item containing only JSON.

-                "Re-enable allocation: PUT /_cluster/settings "
-                '{"persistent":{"cluster.routing.allocation.enable":"all"}}',
+                "Re-enable allocation via REST: PUT /_cluster/settings with body "
+                "{'persistent': {'cluster.routing.allocation.enable': 'all'}}",

35-92: Define a typed structure for error patterns

Using a TypedDict or dataclass improves readability, editor support, and avoids casts.

Example:

from typing import TypedDict, List

class ErrorPattern(TypedDict):
    pattern: str
    title: str
    explanation: str
    solutions: List[str]
    prevention: str

error_patterns: List[ErrorPattern] = [
    {"pattern": "...", "title": "...", "explanation": "...", "solutions": ["..."], "prevention": "..."},
    # ...
]

98-100: Substring matching is brittle; consider regex with IGNORECASE

Many upstream messages vary slightly. Regex (re.search with re.I) will reduce false negatives (and allow multiple phrasing variants).

If you prefer to keep it simple now, at least normalize whitespace before matching:

error_norm = " ".join(error_message.split()).lower()

Then compare patterns against error_norm.


1-1: Remove unnecessary typing.cast usage and import

Keys are known; the cast provides little value and can be dropped.

-from typing import List, Optional, cast
+from typing import List, Optional
@@
-        if cast(str, pattern_info["pattern"]).lower() in error_lower:
+        if pattern_info["pattern"].lower() in error_lower:

Also applies to: 99-99


121-127: Deduplicate general steps; steps 1 and 4 both reference analyze

Consider replacing step 4 with a capacity-focused command or subcommand hint (e.g., “xmover analyze --nodes” if available) to avoid repetition.


131-133: Link to XMover troubleshooting docs rather than generic CrateDB page

Point users directly to the new XMover troubleshooting docs for higher relevance.

Proposed change:

-        "[dim]📚 For more help: https://crate.io/docs/crate/reference/en/latest/admin/system-information.html[/dim]"
+        "[dim]📚 For more help: https://cratedb-toolkit.readthedocs.io/en/latest/admin/xmover/troubleshooting.html[/dim]"

Please verify the final URL matches the published docs path after merge.

cratedb_toolkit/admin/xmover/util/format.py (3)

1-8: Prefer 1024-based units to match upstream size calculations (consistency).

Model.size_gb uses 1024**3, but format_size switches units at 1000 and converts MB with *1000. This mixes IEC and SI units and can lead to user-visible inconsistencies (e.g., 0.99 GiB showing as 990MB). Consider using 1024 thresholds and conversions throughout to stay consistent with CrateDB byte math.

Apply this diff:

 def format_size(size_gb: float) -> str:
     """Format size in GB with appropriate precision"""
-    if size_gb >= 1000:
-        return f"{size_gb / 1000:.1f}TB"
-    elif size_gb >= 1:
+    if size_gb >= 1024:
+        return f"{size_gb / 1024:.1f}TB"
+    elif size_gb >= 1:
         return f"{size_gb:.1f}GB"
     else:
-        return f"{size_gb * 1000:.0f}MB"
+        return f"{size_gb * 1024:.0f}MB"

39-45: Reuse format_size in format_translog_info; remove duplicated unit logic.

This keeps formatting consistent (TB/GB/MB) and reduces code duplication.

Apply this diff:

-    # Format size
-    if tl_gb >= 1.0:
-        size_str = f"{tl_gb:.1f}GB"
-    else:
-        size_str = f"{tl_gb * 1000:.0f}MB"
+    # Format size
+    size_str = format_size(tl_gb)

25-27: Extract the “significance” threshold into a named constant.

Makes it easier to tune and to reference in tests.

Add this near the top of the module:

TL_MIN_BYTES = 10 * 1024 * 1024  # 10MiB threshold for visibility

Then:

-    if tl_bytes < 10 * 1024 * 1024:  # 10MB for production
+    if tl_bytes < TL_MIN_BYTES:
         return ""
cratedb_toolkit/admin/xmover/tune/recover.py (5)

10-10: Import format_size and use it for human-friendly sizes.

You already have a formatter; reuse it for consistency across outputs.

Apply this diff:

-from cratedb_toolkit.admin.xmover.util.format import format_translog_info
+from cratedb_toolkit.admin.xmover.util.format import format_translog_info, format_size

129-146: Use format_size in the table and fix the column header.

Present sizes with proper units (MB/GB/TB) and consistent precision.

Apply this diff:

-        headers = ["Table", "Shard", "Node", "Type", "Stage", "Progress", "Size(GB)", "Time(s)"]
+        headers = ["Table", "Shard", "Node", "Type", "Stage", "Progress", "Size", "Time(s)"]
@@
-            row = [
+            row = [
                 f"{recovery.schema_name}.{recovery.table_name}",
                 str(recovery.shard_id),
                 recovery.node_name,
                 recovery.shard_type,
                 recovery.stage,
                 f"{recovery.overall_progress:.1f}%",
-                f"{recovery.size_gb:.1f}",
+                format_size(recovery.size_gb),
                 f"{recovery.total_time_seconds:.1f}",
             ]

228-233: Use format_size in watch output for consistent units.

Replace manual “.1f GB” formatting with format_size for all change messages.

Apply this diff:

-                                            f"{recovery.overall_progress:.1f}% (+{diff:.1f}%) "
-                                            f"{recovery.size_gb:.1f}GB{translog_info}{node_route}"
+                                            f"{recovery.overall_progress:.1f}% (+{diff:.1f}%) "
+                                            f"{format_size(recovery.size_gb)}{translog_info}{node_route}"
@@
-                                            f"{recovery.overall_progress:.1f}% ({diff:.1f}%) "
-                                            f"{recovery.size_gb:.1f}GB{translog_info}{node_route}"
+                                            f"{recovery.overall_progress:.1f}% ({diff:.1f}%) "
+                                            f"{format_size(recovery.size_gb)}{translog_info}{node_route}"
@@
-                                        f"{prev['stage']}→{recovery.stage} "
-                                        f"{recovery.size_gb:.1f}GB{translog_info}{node_route}"
+                                        f"{prev['stage']}→{recovery.stage} "
+                                        f"{format_size(recovery.size_gb)}{translog_info}{node_route}"
@@
-                                        f"{recovery.stage} {recovery.overall_progress:.1f}% "
-                                        f"{recovery.size_gb:.1f}GB{translog_info}{node_route}"
+                                        f"{recovery.stage} {recovery.overall_progress:.1f}% "
+                                        f"{format_size(recovery.size_gb)}{translog_info}{node_route}"

Also applies to: 235-239, 251-255, 275-278


181-181: Remove unused previous_timestamp to keep the loop clean.

It’s set but never read (already flagged with noqa).

Apply this diff:

-                    previous_timestamp = None
@@
-                        previous_timestamp = current_time  # noqa: F841

Also applies to: 310-310


73-86: Minor: avoid O(n^2) when computing avg per type.

You can accumulate per-type progress sum and count in one pass and avoid recomputing type_recoveries.

If helpful, I can provide a compact refactor.

cratedb_toolkit/admin/xmover/analyze/zone.py (3)

33-39: Use fractional target per zone (not integer division).

Using // can misclassify balanced distributions when total_shards isn’t divisible by number of zones. Compute a float target and show it in the title.

Apply this diff:

-        total_shards = sum(stats["TOTAL"] for stats in zone_stats.values())
-        zones = list(zone_stats.keys())
-        target_per_zone = total_shards // len(zones) if zones else 0
-        tolerance_range = (target_per_zone * (1 - tolerance / 100), target_per_zone * (1 + tolerance / 100))
+        total_shards = sum(stats["TOTAL"] for stats in zone_stats.values())
+        zones = list(zone_stats.keys())
+        expected_per_zone = (total_shards / len(zones)) if zones else 0.0
+        tolerance_range = (
+            expected_per_zone * (1 - tolerance / 100),
+            expected_per_zone * (1 + tolerance / 100),
+        )
@@
-        balance_table = Table(title=f"Zone Balance Analysis (Target: {target_per_zone} ±{tolerance}%)", box=box.ROUNDED)
+        balance_table = Table(
+            title=f"Zone Balance Analysis (Target: {expected_per_zone:.1f} ±{tolerance}%)",
+            box=box.ROUNDED,
+        )

48-53: Base under/over deltas on the fractional target.

Current delta compares against integer target. Use the fractional target to avoid off-by-one signaling on small totals.

Apply this diff:

-            if tolerance_range[0] <= total <= tolerance_range[1]:
+            if tolerance_range[0] <= total <= tolerance_range[1]:
                 status = "[green]✓ Balanced[/green]"
             elif total < tolerance_range[0]:
-                status = f"[yellow]⚠ Under ({total - target_per_zone:+})[/yellow]"
+                status = f"[yellow]⚠ Under ({total - expected_per_zone:+.1f})[/yellow]"
             else:
-                status = f"[red]⚠ Over ({total - target_per_zone:+})[/red]"
+                status = f"[red]⚠ Over ({total - expected_per_zone:+.1f})[/red]"

119-121: Replication assumption is hardcoded (may not match table settings).

You flag “under-replicated” when total_copies < 2, implicitly assuming 1 replica. Consider deriving expected replica count per table from sys.tables.number_of_replicas.

If you want, I can wire a helper on CrateDBClient to fetch per-table replicas and adjust this logic.

cratedb_toolkit/admin/xmover/analyze/report.py (2)

65-68: Deterministic output: sort zones for stable display.

Non-deterministic dict iteration can make CLI diffs noisy.

Apply this diff:

-        for zone, count in overview["zone_distribution"].items():
+        for zone, count in sorted(overview["zone_distribution"].items()):
             percentage = (count / total_shards * 100) if total_shards > 0 else 0
             zone_table.add_row(zone, str(count), f"{percentage:.1f}%")

100-105: Optional: colorize disk usage and watermark remaining consistently.

You already use format_percentage; consider adding colors to “Exceeded” and perhaps green/yellow/red bands for remaining capacities for better at-a-glance scanning.

I can propose a small helper if you’d like it in this PR.

cratedb_toolkit/admin/xmover/util/database.py (1)

20-36: Reuse HTTP connections via requests.Session for better performance.

XMover does many sequential queries; a Session improves latency via connection pooling and keeps TLS handshakes down.

Apply this diff:

     def __init__(self, connection_string: Optional[str] = None):
         load_dotenv()
@@
-        # Ensure connection string ends with _sql endpoint
+        # Ensure connection string ends with _sql endpoint
         if not self.connection_string.endswith("/_sql"):
             self.connection_string = self.connection_string.rstrip("/") + "/_sql"
+
+        # Reuse connections across requests
+        self.session = requests.Session()
cratedb_toolkit/admin/xmover/tune/recommend.py (1)

262-274: Consider timeout configuration for user input

The input() calls could hang indefinitely. Consider adding a timeout or allowing configuration of the timeout value.

You could use a timeout mechanism or make the confirmation step configurable:

import signal

def timeout_handler(signum, frame):
    raise TimeoutError("Input timeout")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(60)  # 60 second timeout
try:
    response1 = input("Type 'EXECUTE' to proceed with automatic execution: ").strip()
finally:
    signal.alarm(0)  # Cancel the alarm
cratedb_toolkit/admin/xmover/analyze/shard.py (2)

31-36: Consider cache invalidation strategy

The session-based caches could become stale if the cluster state changes during long-running operations. Consider adding a method to clear caches or implementing TTL-based expiration.

Add a method to clear caches when needed:

def clear_caches(self):
    """Clear all internal caches"""
    self._zone_conflict_cache.clear()
    self._node_lookup_cache.clear() 
    self._target_nodes_cache.clear()
    self._cache_hits = 0
    self._cache_misses = 0
    logger.debug("Cleared all analyzer caches")

826-826: Rough time estimate may be misleading

The estimated time of 0.1 hours (6 minutes) per move is a very rough estimate that doesn't account for shard size, network speed, or cluster load.

Consider making the estimate more realistic based on shard size:

-            "estimated_time_hours": len(move_plan) * 0.1,  # Rough estimate: 6 minutes per move
+            # Rough estimate based on size: ~10GB/min transfer rate
+            "estimated_time_hours": sum(m.size_gb for m in move_plan) / 600.0,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6ab37b7 and 04abce6.

📒 Files selected for processing (12)
  • cratedb_toolkit/admin/xmover/analyze/report.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analyze/shard.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analyze/zone.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/model.py (1 hunks)
  • cratedb_toolkit/admin/xmover/tune/recommend.py (1 hunks)
  • cratedb_toolkit/admin/xmover/tune/recover.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/database.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/error.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/format.py (1 hunks)
  • pyproject.toml (3 hunks)
  • tests/admin/test_cli.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/admin/test_cli.py
  • pyproject.toml
  • cratedb_toolkit/admin/xmover/cli.py
🧰 Additional context used
🧬 Code Graph Analysis (7)
cratedb_toolkit/admin/xmover/tune/recover.py (3)
cratedb_toolkit/admin/xmover/model.py (4)
  • RecoveryInfo (55-102)
  • size_gb (81-83)
  • overall_progress (76-78)
  • total_time_seconds (90-92)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (17-493)
  • get_all_recovering_shards (359-390)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_translog_info (21-45)
cratedb_toolkit/admin/xmover/util/format.py (1)
cratedb_toolkit/admin/xmover/model.py (2)
  • size_gb (81-83)
  • translog_size_gb (95-97)
cratedb_toolkit/admin/xmover/analyze/report.py (4)
cratedb_toolkit/admin/xmover/cli.py (2)
  • analyze (56-61)
  • validate_move (204-225)
cratedb_toolkit/admin/xmover/analyze/shard.py (5)
  • ShardAnalyzer (22-828)
  • get_cluster_overview (610-651)
  • analyze_distribution (45-77)
  • find_moveable_shards (98-119)
  • validate_move_safety (365-393)
cratedb_toolkit/admin/xmover/model.py (5)
  • ShardMoveRecommendation (117-153)
  • ShardMoveRequest (106-113)
  • SizeCriteria (169-173)
  • available_space_gb (28-29)
  • size_gb (81-83)
cratedb_toolkit/admin/xmover/util/format.py (2)
  • format_percentage (11-18)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/tune/recommend.py (5)
cratedb_toolkit/admin/xmover/analyze/shard.py (3)
  • ShardAnalyzer (22-828)
  • generate_rebalancing_recommendations (183-363)
  • validate_move_safety (365-393)
cratedb_toolkit/admin/xmover/model.py (2)
  • RecommendationConstraints (177-186)
  • size_gb (81-83)
cratedb_toolkit/admin/xmover/tune/recover.py (4)
  • RecoveryMonitor (24-384)
  • RecoveryOptions (16-21)
  • start (167-384)
  • get_cluster_recovery_status (31-43)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (17-493)
  • execute_query (37-55)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/analyze/zone.py (3)
cratedb_toolkit/admin/xmover/analyze/shard.py (1)
  • check_zone_balance (121-138)
cratedb_toolkit/admin/xmover/model.py (1)
  • ShardInfo (33-51)
cratedb_toolkit/admin/xmover/util/database.py (1)
  • get_shards_info (93-173)
cratedb_toolkit/admin/xmover/analyze/shard.py (2)
cratedb_toolkit/admin/xmover/model.py (9)
  • DistributionStats (157-165)
  • NodeInfo (7-29)
  • RecommendationConstraints (177-186)
  • ShardInfo (33-51)
  • ShardMoveRecommendation (117-153)
  • size_gb (81-83)
  • available_space_gb (28-29)
  • disk_usage_percent (24-25)
  • heap_usage_percent (20-21)
cratedb_toolkit/admin/xmover/util/database.py (4)
  • get_nodes_info (57-91)
  • get_shards_info (93-173)
  • execute_query (37-55)
  • get_cluster_watermarks (247-266)
cratedb_toolkit/admin/xmover/util/database.py (2)
cratedb_toolkit/admin/xmover/model.py (5)
  • NodeInfo (7-29)
  • RecoveryInfo (55-102)
  • ShardInfo (33-51)
  • size_gb (81-83)
  • overall_progress (76-78)
cratedb_toolkit/admin/xmover/cli.py (1)
  • test_connection (150-172)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: build-and-publish
  • GitHub Check: Kinesis: Python 3.11 on OS ubuntu-latest
  • GitHub Check: CFR for OS windows-latest
  • GitHub Check: Kinesis: Python 3.13 on OS ubuntu-latest
  • GitHub Check: CFR for OS ubuntu-latest
  • GitHub Check: Python 3.13, PostgreSQL latest, OS ubuntu-latest
  • GitHub Check: CFR: Python 3.9 on OS ubuntu-latest
  • GitHub Check: CFR for OS macos-latest
  • GitHub Check: CFR for OS macos-13
  • GitHub Check: Generic: Python 3.9 on OS ubuntu-latest
  • GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
  • GitHub Check: build-and-test
  • GitHub Check: Python 3.13, LocalStack 4.5, OS ubuntu-latest
  • GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
  • GitHub Check: Python 3.10, LocalStack 4.5, OS ubuntu-latest
🔇 Additional comments (6)
cratedb_toolkit/admin/xmover/util/database.py (1)

319-337: Confirm translog field name across supported CrateDB versions.

The query uses s.translog_stats['size'] which may differ by version. If targeting multiple versions, consider probing sys.shards columns once and falling back (e.g., to s.translog['size_in_bytes']) to avoid breaking on clusters with older schemas.

If needed, I can add a compatibility probe that SELECTs available translog field names and caches the choice.

cratedb_toolkit/admin/xmover/tune/recommend.py (2)

104-104: LGTM! Good use of node space mapping

Creating a mapping of node names to available space improves efficiency by avoiding repeated lookups in the nodes list when displaying target free space.


199-202: Remove debug print statements

Debug print statements should not be in production code. Use logging instead.

-        print(
-            f"Analyzing {len(moveable_shards)} candidate shards "
-            f"in size range {constraints.min_size}-{constraints.max_size}GB..."
-        )
+        logger.info(
+            f"Analyzing {len(moveable_shards)} candidate shards "
+            f"in size range {constraints.min_size}-{constraints.max_size}GB..."
+        )

Likely an incorrect or invalid review comment.

cratedb_toolkit/admin/xmover/model.py (1)

134-137: LGTM! SQL generation properly escapes identifiers

The to_sql method correctly uses double quotes for schema and table names to handle special characters and reserved keywords.

cratedb_toolkit/admin/xmover/analyze/shard.py (2)

490-495: SQL query construction is safe from injection

The query uses parameterized queries correctly with placeholders, preventing SQL injection attacks.


675-692: Watermark parsing logic handles multiple formats well

The method correctly handles percentage strings, decimal values, and numeric types with appropriate defaults.

Comment on lines 296 to 298
"[dim]# Check with: SELECT * FROM sys.shards "
"WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix placeholder formatting: missing f-string results in literals being printed.

The table_name and shard_id placeholders aren’t expanded.

Apply this diff:

-            console.print(
-                "[dim]# Check with: SELECT * FROM sys.shards "
-                "WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]"
-            )
+            console.print(
+                f"[dim]# Check with: SELECT * FROM sys.shards "
+                f"WHERE table_name = '{table_name}' AND id = {request.shard_id};[/dim]"
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"[dim]# Check with: SELECT * FROM sys.shards "
"WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]"
)
console.print(
f"[dim]# Check with: SELECT * FROM sys.shards "
f"WHERE table_name = '{table_name}' AND id = {request.shard_id};[/dim]"
)
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyze/report.py around lines 296 to 298, the
string with placeholders for table_name and shard_id is missing an f-string (or
.format), so the placeholders are printed literally; update the string to
interpolate the variables (e.g., convert the multi-line string to an f-string or
call .format(table_name=table_name, shard_id=shard_id)) so the actual table_name
and shard_id values are inserted into the message.

Comment on lines 199 to 202
print(
f"Analyzing {len(moveable_shards)} candidate shards "
f"in size range {constraints.min_size}-{constraints.max_size}GB..."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove print statements from library code

Print statements should not be used in library/analyzer code. Use the logger that's already imported.

-        print(
-            f"Analyzing {len(moveable_shards)} candidate shards "
-            f"in size range {constraints.min_size}-{constraints.max_size}GB..."
-        )
+        logger.info(
+            "Analyzing %d candidate shards in size range %.1f-%.1fGB",
+            len(moveable_shards), constraints.min_size, constraints.max_size
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"Analyzing {len(moveable_shards)} candidate shards "
f"in size range {constraints.min_size}-{constraints.max_size}GB..."
)
logger.info(
"Analyzing %d candidate shards in size range %.1f-%.1fGB",
len(moveable_shards), constraints.min_size, constraints.max_size
)
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyze/shard.py around lines 199 to 202, remove
the print(...) call and replace it with the module logger (e.g.,
logger.info(...)) using the same formatted message; ensure you call
logger.info(f"Analyzing {len(moveable_shards)} candidate shards in size range
{constraints.min_size}-{constraints.max_size}GB...") so library code uses
structured logging instead of printing to stdout, and remove any remaining stray
print statements in this module if present.

# Optimize processing: if filtering by source node, only process those shards
if constraints.source_node:
processing_shards = [s for s in moveable_shards if s.node_name == constraints.source_node]
print(f"Focusing on {len(processing_shards)} shards from node {constraints.source_node}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove print statement from library code

-            print(f"Focusing on {len(processing_shards)} shards from node {constraints.source_node}")
+            logger.info("Focusing on %d shards from node %s", len(processing_shards), constraints.source_node)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"Focusing on {len(processing_shards)} shards from node {constraints.source_node}")
logger.info("Focusing on %d shards from node %s", len(processing_shards), constraints.source_node)
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyze/shard.py around line 232, remove the
direct print(...) call and replace it with a proper logger call; import or use
the module-level logger (e.g., logging.getLogger(__name__) or the existing
project logger) and emit this message at an appropriate level (info or debug) so
library code does not print to stdout directly.

processing_shards = moveable_shards

# Generate move recommendations
safe_recommendations = 0 # noqa: F841
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused variable

The variable safe_recommendations is assigned but never used.

-        safe_recommendations = 0  # noqa: F841
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
safe_recommendations = 0 # noqa: F841
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyze/shard.py around line 237, the variable
`safe_recommendations` is assigned but never used; remove the unused assignment
(or, if it was intended to be used, replace it with the correct usage) so the
dead variable is eliminated and linter warnings go away.

Comment on lines 250 to 251
if len(processing_shards) > 100 and i > 0 and i % 50 == 0:
print(".", end="", flush=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove print statement from library code

-            if len(processing_shards) > 100 and i > 0 and i % 50 == 0:
-                print(".", end="", flush=True)
+            if len(processing_shards) > 100 and i > 0 and i % 50 == 0:
+                logger.debug("Processing shard %d of %d", i, len(processing_shards))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(processing_shards) > 100 and i > 0 and i % 50 == 0:
print(".", end="", flush=True)
if len(processing_shards) > 100 and i > 0 and i % 50 == 0:
logger.debug("Processing shard %d of %d", i, len(processing_shards))
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analyze/shard.py around lines 250-251, remove
the direct print(".", end="", flush=True) used for progress output; replace it
with a no-side-effect logging call (e.g. get a module logger via
logging.getLogger(__name__) and call logger.debug or logger.info with a succinct
progress marker) or invoke a provided progress callback if one exists; add the
import logging at the top if missing and ensure the logging call does not alter
function return values or behavior.

Comment on lines 225 to 232
"""Execute recommendations with extensive safety measures"""

# Filter to only safe recommendations
safe_recommendations = []
if validate:
for rec in recommendations:
is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
if is_safe:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential division by zero error in validate_move_safety

The method calls self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0) with a hardcoded value of 95.0, but it should use the constraints value.

-                is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
+                is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=self.constraints.max_disk_usage)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Execute recommendations with extensive safety measures"""
# Filter to only safe recommendations
safe_recommendations = []
if validate:
for rec in recommendations:
is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
if is_safe:
"""Execute recommendations with extensive safety measures"""
# Filter to only safe recommendations
safe_recommendations = []
if validate:
for rec in recommendations:
is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=self.constraints.max_disk_usage)
if is_safe:
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/tune/recommend.py around lines 225 to 232, the
call to self.analyzer.validate_move_safety(...) uses a hardcoded 95.0 for
max_disk_usage_percent; replace that literal with the appropriate constraints
value (e.g., constraints.max_disk_usage_percent or
self.constraints.max_disk_usage_percent) so the validator uses configured
limits, and add a sensible fallback/default if constraints is missing or the
attribute is None.

Comment on lines 295 to 298
if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve success check for SQL execution

The condition result.get("rowcount", 0) >= 0 will always be true for successful queries since rowcount is 0 or positive. This doesn't properly distinguish between success and failure.

-                if result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
+                # ALTER TABLE REROUTE commands don't return rowcount, check for no error instead
+                if "error" not in result:
                     console.print("    [green]✅ SUCCESS[/green] - Move initiated")
                     successful_moves += 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1
# ALTER TABLE REROUTE commands don't return rowcount, check for no error instead
if "error" not in result:
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1

Comment on lines 354 to 355
active_count = len([r for r in recoveries if r.overall_progress < 100.0 or r.stage != "DONE"])
status = f"{active_count}/{max_concurrent_recoveries}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect condition for checking active recoveries

The condition checks if r.stage != "DONE" but according to RecoveryInfo model, the stage field represents recovery stages like INIT, INDEX, etc. The check should be more comprehensive.

-            active_count = len([r for r in recoveries if r.overall_progress < 100.0 or r.stage != "DONE"])
+            # Count recoveries that are actively running (not completed)
+            active_count = len([r for r in recoveries if r.overall_progress < 100.0])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
active_count = len([r for r in recoveries if r.overall_progress < 100.0 or r.stage != "DONE"])
status = f"{active_count}/{max_concurrent_recoveries}"
# Count recoveries that are actively running (not completed)
active_count = len([r for r in recoveries if r.overall_progress < 100.0])
status = f"{active_count}/{max_concurrent_recoveries}"
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/tune/recommend.py around lines 354-355, the
active-recovery check uses r.stage != "DONE" which is too narrow; change it to
consider terminal stages explicitly (or use the model's terminal-stage helper if
available). Count a recovery as active when r.overall_progress < 100.0 or
r.stage not in the set of terminal stages (e.g., {"DONE", "COMPLETED", "FAILED",
"CANCELLED"}), using a case-insensitive comparison or the canonical stage
constants from RecoveryInfo if present.

Comment on lines +48 to +60
try:
response = requests.post(
self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise Exception(f"Failed to execute query: {e}") from e
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use the session and surface CrateDB “error” payloads explicitly.

Currently only HTTP errors raise; CrateDB can return 200 with an error field. Also switch to the session for pooling.

Apply this diff:

-            response = requests.post(
-                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
-            )
+            response = self.session.post(
+                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
+            )
             response.raise_for_status()
-            return response.json()
+            data = response.json()
+            # CrateDB may include an "error" field even with 200 OK
+            if isinstance(data, dict) and "error" in data and data["error"]:
+                # Best-effort message extraction
+                err = data["error"]
+                msg = err.get("message") if isinstance(err, dict) else str(err)
+                raise Exception(f"CrateDB error: {msg}")
+            return data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
response = requests.post(
self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
raise Exception(f"Failed to execute query: {e}") from e
try:
response = self.session.post(
self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
)
response.raise_for_status()
data = response.json()
# CrateDB may include an "error" field even with 200 OK
if isinstance(data, dict) and "error" in data and data["error"]:
# Best-effort message extraction
err = data["error"]
msg = err.get("message") if isinstance(err, dict) else str(err)
raise Exception(f"CrateDB error: {msg}")
return data
except requests.exceptions.RequestException as e:
raise Exception(f"Failed to execute query: {e}") from e
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/util/database.py around lines 48-55, replace the
direct requests.post call with the instance session to enable connection pooling
(use self.session or create one on the instance if not present) and after
obtaining the response call response.json() once, check for an "error" key in
the JSON payload and if present raise an Exception that includes the CrateDB
error payload; keep response.raise_for_status() to handle non-2xx HTTP errors,
and pass the same auth, verify and timeout parameters to session.post.

Comment on lines +52 to +55
"Move the shard to a different availability zone",
"Check zone balance with 'xmover check-balance'",
"Ensure target zone doesn't already have copies of this shard",
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify CLI command naming: “xmover check-balance” likely doesn’t exist

Elsewhere you reference analyze/zone-analysis/validate-move/recommend. “check-balance” may be a leftover or alias. Please confirm and align the guidance.

Run this to verify command availability and naming:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Searching for XMover CLI commands..."
# Click-decorated commands (common patterns)
rg -nP -C3 --type=py "(?i)@.*command\(\s*['\"](check-balance|validate-move|zone-analysis|recommend|analyze)['\"]" || true

echo
echo "Searching for user-facing mentions..."
rg -nP -C2 "(?i)\bxmover\s+(check-balance|validate-move|zone-analysis|recommend|analyze)\b" -g '!**/tests/**' || true

Length of output: 26205


CLI command “check‐balance” not found – update guidance

The search shows no click‐decorated xmover check-balance command in the codebase, yet it’s referenced in cratedb_toolkit/admin/xmover/util/error.py (lines 52–55). Please either implement that command or align the message to use an existing CLI invocation.

• File: cratedb_toolkit/admin/xmover/util/error.py
– Replace
"Check zone balance with 'xmover check-balance'"
with
"Check zone distribution with 'xmover zone-analysis --show-shards'"

Also review and update any other docs or code (e.g. troubleshooting.md, handbook.md) that reference xmover check-balance.

🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/util/error.py around lines 52–55, the guidance
references a non-existent CLI command "xmover check-balance"; replace the string
"Check zone balance with 'xmover check-balance'" with "Check zone distribution
with 'xmover zone-analysis --show-shards'". Additionally, run a repo-wide search
for any other occurrences of "xmover check-balance" (docs, handbook,
troubleshooting.md) and update them to the new invocation or correct
alternative, ensuring help text and tests (if any) are updated to match.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

♻️ Duplicate comments (6)
doc/admin/xmover/queries.md (4)

167-169: Apply the same healthy-shard filter in the summary query

Consistency improves reproducibility and prevents aggregating incomplete shards.

-    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
-    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
+    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
+    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
@@
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

Also applies to: 174-174


149-156: Healthy-shard filter applies here too

Keep the “operational” filter consistent across examples.

 WHERE s.table_name = 'your_table_name'  -- Replace with your specific table name
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

Also applies to: 151-151


72-80: Fix heading and recovery predicate; current filter excludes healthy shards

  • Grammar: use “node” (singular) and sentence case for titles.
  • Filtering on recovery['files']['percent'] = 0 excludes completed shards. Use >= 100.0 (with COALESCE) to include healthy shards.
-## List biggest SHARDS on a particular Nodes
+## List biggest shards on a particular node
@@
-select node['name'], table_name, schema_name, id,  sum(size) / 1024^3 from sys.shards
+select node['name'], table_name, schema_name, id, sum(size) / 1024^3 as size_gb
     where node['name'] = 'data-hot-2'
     AND routing_state = 'STARTED'
-    AND recovery['files']['percent'] = 0
+    AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     group by 1,2,3,4  order by 5  desc limit 8;

103-130: CTE: include only completed shards; add alias; consider placeholder consistency

  • Include fully recovered shards (>= 100%) to reflect “operational” sets.
  • Alias total_size_gb for readability.
  • The example uses table_name = 'orderffD' (inconsistent capitalization/name across doc). Prefer a neutral placeholder or a consistent example.
 WITH shard_summary AS (
     SELECT
         node['name'] AS node_name,
         table_name,
         schema_name,
@@
-        COUNT(*) AS shard_count,
-        SUM(size) / 1024^3 AS total_size_gb
+        COUNT(*) AS shard_count,
+        SUM(size) / 1024^3 AS total_size_gb
     FROM sys.shards
-    WHERE table_name = 'orderffD'
+    WHERE table_name = 'orders'   -- example table
         AND routing_state = 'STARTED'
-        AND recovery['files']['percent'] = 0
+        AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     GROUP BY node['name'], table_name, schema_name, "primary"
 )

If you want to keep the real table, use it consistently throughout the page.

doc/admin/xmover/handbook.md (1)

462-469: Respect earlier preference but fix placeholders

Per past guidance, we won’t enforce heading levels for emphasized text in large docs. However, the validate-move placeholders should match the rest of the doc.

-xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO> --max-disk-usage 95
+xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE> --max-disk-usage 95
cratedb_toolkit/admin/xmover/util/database.py (1)

48-55: Use a Session and surface CrateDB “error” payloads (duplicate of a prior review).

Switch to a single requests.Session for pooling and raise on "error" in the response body even when HTTP is 200.

+        # Initialize a session for connection pooling
+        self.session = requests.Session()
...
-            response = requests.post(
+            response = self.session.post(
                 self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
             )
             response.raise_for_status()
-            return response.json()
+            data = response.json()
+            if isinstance(data, dict) and data.get("error"):
+                err = data["error"]
+                msg = err.get("message") if isinstance(err, dict) else str(err)
+                raise Exception(f"CrateDB error: {msg}")
+            return data
🧹 Nitpick comments (19)
doc/admin/xmover/troubleshooting.md (2)

398-399: Unify validate-move placeholders (FROM_NODE/TO_NODE)

Other sections use FROM_NODE and TO_NODE. Keep placeholders consistent to prevent copy/paste errors.

-xmover validate-move SCHEMA.TABLE SHARD_ID FROM TO
+xmover validate-move SCHEMA.TABLE SHARD_ID FROM_NODE TO_NODE

214-230: Avoid printing full secrets from .env in troubleshooting

Showing the entire .env risks credential leakage in tickets and paste bins. Recommend grepping specific keys or echoing masked values.

-# Verify .env file contents
-cat .env
+# Verify required keys exist without leaking secrets
+grep -E '^(CRATE_CONNECTION_STRING|CRATE_USERNAME|CRATE_SSL_VERIFY)=' .env
+grep -E '^CRATE_PASSWORD=' .env && echo "CRATE_PASSWORD is set (masked)"

I can add a short “Handling Secrets in Logs” appendix if helpful.

doc/admin/xmover/handbook.md (1)

168-184: Minor placeholder consistency in validate-move examples

Keep placeholders uniform (SCHEMA.TABLE + FROM_NODE/TO_NODE) across all sections.

-xmover validate-move CUROV.maddoxxxS 4 data-hot-1 data-hot-3
+xmover validate-move SCHEMA.TABLE 4 FROM_NODE TO_NODE
@@
-xmover validate-move CUROV.tendedero 4 data-hot-1 data-hot-3 --max-disk-usage 90
+xmover validate-move SCHEMA.TABLE 4 FROM_NODE TO_NODE --max-disk-usage 90
cratedb_toolkit/admin/xmover/operational/candidates.py (2)

25-27: Nit: Prefer “Movable” spelling and show units in banner

Cosmetic, but “Movable” is the more common spelling. Including “GB” is good—consider showing the table filter (table or node) also in the panel.

-Panel.fit(f"[bold blue]Finding Moveable Shards ({criteria.min_size}-{criteria.max_size}GB)[/bold blue]")
+Panel.fit(f"[bold blue]Finding Movable Shards ({criteria.min_size}-{criteria.max_size} GB)[/bold blue]")

39-48: Return explicit count to help callers

Instead of returning None, return the total candidate count (and shown count). This allows the CLI to display follow-up hints or paginate.

-            return
+            return 0
@@
-        console.print(candidates_table)
+        console.print(candidates_table)
@@
-            console.print(f"\n[dim]... and {len(candidates) - limit} more candidates[/dim]")
+            console.print(f"\n[dim]... and {len(candidates) - limit} more candidates[/dim]")
+        return len(candidates)
cratedb_toolkit/admin/xmover/analysis/zone.py (2)

132-139: Include shard state in details to aid triage.

You show only routing_state. Adding state (e.g., RECOVERING/STARTED/UNASSIGNED) helps explain conflicts/placement.

-                        console.print(
-                            f"    {health_indicator} {shard_copy.shard_type} "
-                            f"on {shard_copy.node_name} ({shard_copy.zone}) - {shard_copy.routing_state}"
-                        )
+                        console.print(
+                            f"    {health_indicator} {shard_copy.shard_type} "
+                            f"on {shard_copy.node_name} ({shard_copy.zone}) - "
+                            f"{shard_copy.state}/{shard_copy.routing_state}"
+                        )

89-99: The grep didn’t reveal any per-table replica settings being fetched—tables is populated elsewhere (likely a static call), and there’s no lookup of number_of_replicas in zone.py. The hard-coded total_copies < 2 check will indeed flag tables with zero replicas as under-replicated.

Next Steps
• Introduce a helper on CrateDBClient (e.g. get_table_replicas()) that queries sys.tables for each table’s number_of_replicas.
• In zone.py, before iterating shards, fetch replica_map = client.get_table_replicas().
• Replace if total_copies < 2: with:

expected = 1 + replica_map.get(table_name, client.cluster_replicas)
if total_copies < expected:
    under_replicated += 1
    status_parts.append("[yellow]⚠ Under-replicated[/yellow]")

Let me know if you’d like me to draft the get_table_replicas implementation.

cratedb_toolkit/admin/xmover/operational/recommend.py (2)

171-176: Rich markup: simplify dynamic styles to avoid malformed closing tags.

[/bold {'green' ...}] is not valid. Use [/] to close or separate styles.

-                f"[bold blue]Generating Rebalancing Recommendations[/bold blue] - "
-                f"[bold {'green' if dry_run else 'red'}]{mode_text}[/bold {'green' if dry_run else 'red'}]"
+                f"[bold blue]Generating Rebalancing Recommendations[/] - "
+                f"[bold {'green' if dry_run else 'red'}]{mode_text}[/]"

391-401: Blocking input() in non-interactive environments.

Auto-execution prompts will hang in CI/automation. Consider a --yes flag (or detect non-TTY) to bypass double-confirmation safely.

cratedb_toolkit/admin/xmover/cli.py (1)

40-51: Exit on failed connection is fine; consider lazy-init for offline commands.

Eager connection checks can block commands that don’t need DB (e.g., explain-error). Optional: initialize the client per-subcommand or allow --offline mode.

cratedb_toolkit/admin/xmover/operational/monitor.py (2)

128-146: Column heading “Type” shows shard role, not recovery type.

You populate the column with recovery.shard_type (PRIMARY/REPLICA). Rename for clarity or show both.

-        headers = ["Table", "Shard", "Node", "Type", "Stage", "Progress", "Size(GB)", "Time(s)"]
+        headers = ["Table", "Shard", "Node", "ShardType", "Stage", "Progress", "Size(GB)", "Time(s)"]

310-313: Unused variable.

previous_timestamp is assigned but not used (# noqa: F841). Remove it.

cratedb_toolkit/admin/xmover/util/database.py (1)

268-297: Allocations query may be noisy and misses schema.

  • Filtering current_state != 'STARTED' is fine, but consider excluding UNASSIGNED if it isn’t actionable for monitoring contexts.
  • You default schema to "doc"; if the follow-up sys.shards probe updates schema later that’s OK, but document this behavior.
cratedb_toolkit/admin/xmover/analysis/shard.py (6)

245-246: Remove unused variable assignment.

The variable safe_recommendations is assigned but never used.

-        safe_recommendations = 0  # noqa: F841

183-186: Simplify conditional logic.

The else: continue block is unnecessary after the condition already handles the control flow.

         # Check if node has enough free space
         free_space_gb = node.available_space_gb
         if free_space_gb >= (required_space_gb + min_free_space_gb):
             available_nodes.append(node)
-        else:
-            continue

834-834: Clarify time estimation logic.

The comment states "6 minutes per move" but the calculation uses 0.1 hours, which equals 6 minutes. Consider making this more explicit.

-            "estimated_time_hours": len(move_plan) * 0.1,  # Rough estimate: 6 minutes per move
+            "estimated_time_hours": len(move_plan) * 0.1,  # Rough estimate: 0.1 hours (6 minutes) per move

574-578: Consider parameterizing the safety threshold.

The safety check for healthy copies is hardcoded to require more than 1 healthy copy. This could be made configurable for different risk tolerance levels.

Consider adding a parameter to control the minimum required healthy copies:

-    def _check_zone_conflict(self, recommendation: ShardRelocationResponse) -> Optional[str]:
+    def _check_zone_conflict(self, recommendation: ShardRelocationResponse, min_healthy_copies: int = 2) -> Optional[str]:
         """Check if moving this shard would create a zone conflict
 
         Performs comprehensive zone safety analysis:
         - Checks if target node already has a copy of this shard
         - Checks if target zone already has copies
         - Analyzes zone allocation limits and CrateDB's zone awareness rules
         - Ensures move doesn't violate zone-awareness principles
+        
+        Args:
+            recommendation: The shard relocation recommendation to validate
+            min_healthy_copies: Minimum number of healthy copies required for safe move (default: 2)
         """

And update the safety check:

-            elif len(zones_with_copies) <= 1 and healthy_copies <= 1:
+            elif len(zones_with_copies) <= 1 and healthy_copies < min_healthy_copies:
                 return (
-                    f"Safety concern: Only {healthy_copies} healthy copy(ies) exist. "
+                    f"Safety concern: Only {healthy_copies} healthy copy(ies) exist (minimum required: {min_healthy_copies}). "
                     f"Moving might risk data availability."
                 )

406-408: Optimize cache hit tracking.

The cache hit/miss tracking adds overhead to every cache operation. Consider making this optional or only enabling it in debug mode.

 def _get_node_cached(self, node_name: str):
     """Get node by name with caching"""
     if node_name in self._node_lookup_cache:
-        self._cache_hits += 1
+        if logger.isEnabledFor(logging.DEBUG):
+            self._cache_hits += 1
         return self._node_lookup_cache[node_name]
 
     # Find node (cache miss)
-    self._cache_misses += 1
+    if logger.isEnabledFor(logging.DEBUG):
+        self._cache_misses += 1

Apply similar changes to other cache methods for consistency.


818-822: Improve capacity warning calculation.

The 1.2 multiplier (20% safety margin) for remaining capacity check is a magic number and should be configurable or at least documented as a constant.

+        # Safety margin for cluster capacity after decommission
+        CAPACITY_SAFETY_MARGIN = 1.2  # 20% buffer
+        
         # Check if remaining cluster capacity is sufficient after decommission
         remaining_capacity = sum(n.available_space_gb for n in self.nodes if n.name != node_name)
-        if remaining_capacity < total_size_gb * 1.2:  # 20% safety margin
+        if remaining_capacity < total_size_gb * CAPACITY_SAFETY_MARGIN:
             warnings.append(
                 f"Low remaining capacity after decommission. "
                 f"Only {remaining_capacity:.1f}GB available for {total_size_gb:.1f}GB of data"
             )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 04abce6 and 108577c.

📒 Files selected for processing (22)
  • CHANGES.md (1 hunks)
  • cratedb_toolkit/admin/xmover/__init__.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analysis/shard.py (1 hunks)
  • cratedb_toolkit/admin/xmover/analysis/zone.py (1 hunks)
  • cratedb_toolkit/admin/xmover/attic.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/model.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/candidates.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/monitor.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/recommend.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/database.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/error.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/format.py (1 hunks)
  • cratedb_toolkit/cli.py (2 hunks)
  • doc/admin/index.md (1 hunks)
  • doc/admin/xmover/handbook.md (1 hunks)
  • doc/admin/xmover/index.md (1 hunks)
  • doc/admin/xmover/queries.md (1 hunks)
  • doc/admin/xmover/troubleshooting.md (1 hunks)
  • doc/index.md (1 hunks)
  • pyproject.toml (3 hunks)
  • tests/admin/test_cli.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • doc/admin/index.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • cratedb_toolkit/cli.py
  • cratedb_toolkit/admin/xmover/init.py
  • doc/index.md
  • tests/admin/test_cli.py
  • pyproject.toml
  • cratedb_toolkit/admin/xmover/util/error.py
  • cratedb_toolkit/admin/xmover/util/format.py
  • CHANGES.md
  • cratedb_toolkit/admin/xmover/attic.py
  • cratedb_toolkit/admin/xmover/model.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.

Applied to files:

  • doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.

Applied to files:

  • doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:20:33.795Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/handbook.md:462-469
Timestamp: 2025-08-20T09:20:33.795Z
Learning: In large documentation files with many existing headings, be less strict about flagging emphasis-as-heading (MD036) issues, as emphasis can serve valid purposes for minor subsections and clarifications rather than requiring full heading treatment.

Applied to files:

  • doc/admin/xmover/handbook.md
🧬 Code Graph Analysis (7)
cratedb_toolkit/admin/xmover/analysis/zone.py (3)
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
  • ShardAnalyzer (30-836)
  • check_zone_balance (129-146)
cratedb_toolkit/admin/xmover/model.py (1)
  • ShardInfo (33-51)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (17-493)
  • get_shards_info (93-173)
cratedb_toolkit/admin/xmover/operational/recommend.py (5)
cratedb_toolkit/admin/xmover/analysis/shard.py (3)
  • ShardAnalyzer (30-836)
  • validate_move_safety (373-401)
  • generate_rebalancing_recommendations (191-371)
cratedb_toolkit/admin/xmover/model.py (8)
  • ShardRelocationConstraints (177-186)
  • ShardRelocationRequest (106-113)
  • ShardRelocationResponse (117-153)
  • shard_type (50-51)
  • shard_type (86-87)
  • size_gb (81-83)
  • to_sql (131-137)
  • available_space_gb (28-29)
cratedb_toolkit/admin/xmover/operational/monitor.py (2)
  • RecoveryOptions (16-21)
  • get_cluster_recovery_status (31-43)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (17-493)
  • execute_query (37-55)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/operational/monitor.py (3)
cratedb_toolkit/admin/xmover/model.py (4)
  • RecoveryInfo (55-102)
  • size_gb (81-83)
  • overall_progress (76-78)
  • total_time_seconds (90-92)
cratedb_toolkit/admin/xmover/util/database.py (1)
  • get_all_recovering_shards (359-390)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_translog_info (21-45)
cratedb_toolkit/admin/xmover/util/database.py (2)
cratedb_toolkit/admin/xmover/model.py (5)
  • NodeInfo (7-29)
  • RecoveryInfo (55-102)
  • ShardInfo (33-51)
  • size_gb (81-83)
  • overall_progress (76-78)
cratedb_toolkit/admin/xmover/cli.py (1)
  • test_connection (152-174)
cratedb_toolkit/admin/xmover/operational/candidates.py (3)
cratedb_toolkit/admin/xmover/analysis/shard.py (1)
  • find_moveable_shards (106-127)
cratedb_toolkit/admin/xmover/model.py (3)
  • SizeCriteria (169-173)
  • available_space_gb (28-29)
  • size_gb (81-83)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/analysis/shard.py (3)
cratedb_toolkit/admin/xmover/model.py (11)
  • DistributionStats (157-165)
  • NodeInfo (7-29)
  • ShardInfo (33-51)
  • ShardRelocationConstraints (177-186)
  • ShardRelocationResponse (117-153)
  • size_gb (81-83)
  • available_space_gb (28-29)
  • shard_type (50-51)
  • shard_type (86-87)
  • disk_usage_percent (24-25)
  • heap_usage_percent (20-21)
cratedb_toolkit/admin/xmover/util/database.py (4)
  • get_nodes_info (57-91)
  • get_shards_info (93-173)
  • execute_query (37-55)
  • get_cluster_watermarks (247-266)
cratedb_toolkit/admin/xmover/util/format.py (2)
  • format_percentage (11-18)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/cli.py (8)
cratedb_toolkit/admin/xmover/analysis/shard.py (3)
  • ShardAnalyzer (30-836)
  • ShardReporter (839-949)
  • distribution (843-949)
cratedb_toolkit/admin/xmover/analysis/zone.py (3)
  • ZoneReport (15-159)
  • shard_balance (20-57)
  • distribution_conflicts (59-159)
cratedb_toolkit/admin/xmover/model.py (3)
  • ShardRelocationConstraints (177-186)
  • ShardRelocationRequest (106-113)
  • SizeCriteria (169-173)
cratedb_toolkit/admin/xmover/operational/candidates.py (2)
  • CandidateFinder (13-84)
  • movement_candidates (17-84)
cratedb_toolkit/admin/xmover/operational/recommend.py (3)
  • ShardRelocationRecommender (21-497)
  • validate (26-155)
  • execute (157-353)
cratedb_toolkit/admin/xmover/operational/monitor.py (3)
  • RecoveryMonitor (24-384)
  • RecoveryOptions (16-21)
  • start (167-384)
cratedb_toolkit/admin/xmover/util/database.py (3)
  • CrateDBClient (17-493)
  • test_connection (239-245)
  • get_nodes_info (57-91)
cratedb_toolkit/admin/xmover/util/error.py (1)
  • explain_cratedb_error (9-133)
🪛 LanguageTool
doc/admin/xmover/troubleshooting.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-troubleshooting)= # Troubleshooting CrateDB using XMover Th...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ...ns ### 1. Zone Conflicts #### Symptoms - Error: `NO(a copy of this shard is alrea...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ...s in safety validation #### Root Causes - Target node already has a copy of the sh...

(QB_NEW_EN)


[grammar] ~64-~64: There might be a mistake here.
Context: ...D FROM_NODE TO_NODE ``` #### Prevention - Always use xmover recommend instead of...

(QB_NEW_EN)


[grammar] ~71-~71: There might be a mistake here.
Context: ...Insufficient Space Issues #### Symptoms - Error: not enough disk space - Safety ...

(QB_NEW_EN)


[grammar] ~72-~72: There might be a mistake here.
Context: ...nt Space Issues #### Symptoms - Error: not enough disk space - Safety validation fails with space warni...

(QB_NEW_EN)


[grammar] ~76-~76: There might be a mistake here.
Context: ...es in cluster analysis #### Root Causes - Target node doesn't have enough free spa...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...max-size 50 ``` Step 3: Free Up Space - Delete old snapshots and unused data - M...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...ng nodes to the cluster #### Prevention - Monitor disk usage regularly with `xmove...

(QB_NEW_EN)


[grammar] ~107-~107: There might be a mistake here.
Context: ...ion - Monitor disk usage regularly with xmover analyze - Set conservative --min-free-space valu...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: .... Node Performance Issues #### Symptoms - Error: shard recovery limit - High hea...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...formance Issues #### Symptoms - Error: shard recovery limit - High heap usage warnings - Slow shard mo...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...covery limit` - High heap usage warnings - Slow shard movement operations #### Roo...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...rd movement operations #### Root Causes - Too many concurrent shard movements - Hi...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...es - Too many concurrent shard movements - High heap usage on target nodes (>80%) -...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...- High heap usage on target nodes (>80%) - Resource contention during moves #### S...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ... --prioritize-space ``` #### Prevention - Move shards gradually (5-10 at a time) -...

(QB_NEW_EN)


[grammar] ~161-~161: There might be a mistake here.
Context: ...nificantly more shards #### Root Causes - Historical data distribution patterns - ...

(QB_NEW_EN)


[grammar] ~162-~162: There might be a mistake here.
Context: ... - Historical data distribution patterns - Node additions/removals without rebalanc...

(QB_NEW_EN)


[grammar] ~163-~163: There might be a mistake here.
Context: ...e additions/removals without rebalancing - Tables created with poor initial distrib...

(QB_NEW_EN)


[grammar] ~194-~194: There might be a mistake here.
Context: ...progress and repeat ``` #### Prevention - Run regular balance checks: `xmover chec...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ...## Symptoms - "Connection failed" errors - Authentication failures - SSL/TLS errors...

(QB_NEW_EN)


[grammar] ~203-~203: There might be a mistake here.
Context: ...failed" errors - Authentication failures - SSL/TLS errors #### Root Causes - Incor...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...lures - SSL/TLS errors #### Root Causes - Incorrect connection string in .env - ...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ...ion string in .env - Wrong credentials - Network connectivity issues - SSL certif...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...redentials - Network connectivity issues - SSL certificate problems #### Solutions...

(QB_NEW_EN)


[grammar] ~241-~241: There might be a mistake here.
Context: ..."stmt":"SELECT 1"}' ``` #### Prevention - Use .env.example as a template - Verif...

(QB_NEW_EN)


[grammar] ~244-~244: There might be a mistake here.
Context: ... with CrateDB admin - Test connectivity from deployment environment ## Error Messag...

(QB_NEW_EN)


[grammar] ~262-~262: There might be a mistake here.
Context: ... | Error Pattern | Meaning | Quick Fix | |---------------|---------|-----------| ...

(QB_NEW_EN)


[grammar] ~263-~263: There might be a mistake here.
Context: ... |---------------|---------|-----------| | `copy of this shard is already allocat...

(QB_NEW_EN)


[grammar] ~264-~264: There might be a mistake here.
Context: ...s shard | Choose different target node | | `too many copies...with attribute [zon...

(QB_NEW_EN)


[grammar] ~265-~265: There might be a mistake here.
Context: ...imit exceeded | Move to different zone | | not enough disk space | Insufficient...

(QB_NEW_EN)


[grammar] ~266-~266: There might be a mistake here.
Context: ... | Free space or choose different node | | shard recovery limit | Too many conc...

(QB_NEW_EN)


[grammar] ~267-~267: There might be a mistake here.
Context: ...oves | Wait and retry with fewer moves | | allocation is disabled | Cluster all...

(QB_NEW_EN)


[grammar] ~331-~331: There might be a mistake here.
Context: ... performance** - Query response times - Resource utilization - Error rates ## Emerge...

(QB_NEW_EN)


[grammar] ~332-~332: There might be a mistake here.
Context: ...response times - Resource utilization - Error rates ## Emergency Procedures ### Stu...

(QB_NEW_EN)


[grammar] ~403-~403: There might be a mistake here.
Context: ... Documentation**: https://crate.io/docs/ - Shard Allocation Guide: https://crate....

(QB_NEW_EN)


[grammar] ~404-~404: There might be a mistake here.
Context: .../en/latest/admin/system-information.html - Cluster Settings: https://crate.io/doc...

(QB_NEW_EN)


[grammar] ~411-~411: There might be a mistake here.
Context: ...e: 1. XMover version and command used 2. Complete error message 3. **Cluster in...

(QB_NEW_EN)


[grammar] ~412-~412: There might be a mistake here.
Context: ...mmand used** 2. Complete error message 3. Cluster information (xmover analyze ...

(QB_NEW_EN)


[grammar] ~413-~413: There might be a mistake here.
Context: ... information** (xmover analyze output) 4. Zone analysis (xmover zone-analysis ...

(QB_NEW_EN)


[grammar] ~414-~414: There might be a mistake here.
Context: ...alysis** (xmover zone-analysis output) 5. CrateDB version and configuration ###...

(QB_NEW_EN)

doc/admin/xmover/queries.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...

(QB_NEW_EN)

doc/admin/xmover/handbook.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t: Analyze specific table only **Example...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t: Find candidates in specific table only...

(QB_NEW_EN)


[grammar] ~94-~94: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~96-~96: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node: Only show candidates from this specifi...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~112-~112: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)


[grammar] ~145-~145: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t: Analyze zones for specific table only ...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...: Analyze zones for specific table only - --show-shards/--no-show-shards`: Show individual shard details (default...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t: Check balance for specific table only ...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...: Check balance for specific table only - --tolerance`: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE: Schema and table name (format: schema....

(QB_NEW_EN)


[grammar] ~172-~172: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source...

(QB_NEW_EN)


[grammar] ~173-~173: There might be a mistake here.
Context: ...ma.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source node name - TO_NODE: Target n...

(QB_NEW_EN)


[grammar] ~174-~174: There might be a mistake here.
Context: ... to move - FROM_NODE: Source node name - TO_NODE: Target node name Examples: ```bas...

(QB_NEW_EN)


[grammar] ~189-~189: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE: The CrateDB error message to analyze (...

(QB_NEW_EN)


[grammar] ~204-~204: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t: Monitor recovery for specific table on...

(QB_NEW_EN)


[grammar] ~205-~205: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n: Monitor recovery on specific node only...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w: Continuously monitor (refresh every 10...

(QB_NEW_EN)


[grammar] ~207-~207: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval: Refresh interval for watch mode in sec...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type: Filter by recovery type - PEER, DISK, ...

(QB_NEW_EN)


[grammar] ~209-~209: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning: Include recently completed recoveries ...

(QB_NEW_EN)


[grammar] ~230-~230: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...

(QB_NEW_EN)


[grammar] ~231-~231: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...

(QB_NEW_EN)


[grammar] ~364-~364: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) - CRATE_USERNAME`: Username for authentication (optional)...

(QB_NEW_EN)


[grammar] ~365-~365: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD: Password for authentication (optional)...

(QB_NEW_EN)


[grammar] ~366-~366: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY: Enable SSL certificate verification (d...

(QB_NEW_EN)


[grammar] ~421-~421: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...

(QB_NEW_EN)


[grammar] ~422-~422: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...

(QB_NEW_EN)


[grammar] ~430-~430: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...

(QB_NEW_EN)


[grammar] ~441-~441: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...

(QB_NEW_EN)


[grammar] ~471-~471: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...

(QB_NEW_EN)


[style] ~472-~472: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...

(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)


[grammar] ~474-~474: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...

(QB_NEW_EN)


[grammar] ~479-~479: There might be a mistake here.
Context: ...f}xmover-troubleshooting which covers: - Step-by-step diagnostic procedures - Eme...

(QB_NEW_EN)


[grammar] ~480-~480: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...

(QB_NEW_EN)


[grammar] ~481-~481: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...

(QB_NEW_EN)


[grammar] ~482-~482: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...

(QB_NEW_EN)

doc/admin/xmover/index.md

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Movement Recommendations: Intell...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/troubleshooting.md

40-40: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


49-49: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


58-58: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


83-83: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


92-92: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


101-101: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


125-125: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


134-134: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


143-143: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


168-168: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


177-177: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


186-186: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


214-214: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


220-220: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


232-232: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


403-403: Bare URL used

(MD034, no-bare-urls)


404-404: Bare URL used

(MD034, no-bare-urls)


405-405: Bare URL used

(MD034, no-bare-urls)

doc/admin/xmover/handbook.md

462-462: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: build-and-publish
  • GitHub Check: CFR for OS windows-latest
  • GitHub Check: CFR for OS macos-latest
  • GitHub Check: CFR for OS macos-13
  • GitHub Check: CFR for OS ubuntu-latest
  • GitHub Check: Python 3.13, PostgreSQL latest, OS ubuntu-latest
  • GitHub Check: Kinesis: Python 3.11 on OS ubuntu-latest
  • GitHub Check: Kinesis: Python 3.13 on OS ubuntu-latest
  • GitHub Check: build-and-test
  • GitHub Check: Python 3.10, LocalStack 4.5, OS ubuntu-latest
  • GitHub Check: Python 3.13, LocalStack 4.5, OS ubuntu-latest
  • GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
  • GitHub Check: Generic: Python 3.9 on OS ubuntu-latest
🔇 Additional comments (6)
doc/admin/xmover/index.md (1)

7-10: Clear and concise intro — LGTM

The “what/why” is crisp, and the toctree keeps the doc suite discoverable.

cratedb_toolkit/admin/xmover/cli.py (1)

76-84: SizeCriteria Already a Dataclass – No Changes Needed

I’ve confirmed that in cratedb_toolkit/admin/xmover/model.py, SizeCriteria is indeed annotated with @dataclasses.dataclass, providing a generated __init__ that accepts all the keyword arguments shown in the CLI code (min_size, max_size, table_name, source_node) . Instantiating it with kwargs will succeed at runtime, so the suggested change is unnecessary.

Likely an incorrect or invalid review comment.

cratedb_toolkit/admin/xmover/util/database.py (2)

142-146: GB conversion uses XOR operator (^) instead of power/multiplication.

1024.0^3 is bitwise XOR, not exponentiation. This yields wrong sizes and cascades into incorrect filtering and reporting.

-            s.size / 1024.0^3 as size_gb,
+            s.size / power(1024.0, 3) as size_gb,
...
-            SUM(s.size) / 1024.0^3 as total_size_gb,
-            AVG(s.size) / 1024.0^3 as avg_size_gb
+            SUM(s.size) / power(1024.0, 3) as total_size_gb,
+            AVG(s.size) / power(1024.0, 3) as avg_size_gb

If power() isn’t available, use s.size / 1073741824.0.

Also applies to: 188-201

⛔ Skipped due to learnings
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.

74-91: Model classes are already dataclasses—no changes needed

Both NodeInfo and ShardInfo in cratedb_toolkit/admin/xmover/model.py are decorated with @dataclass, so their auto-generated __init__ methods support the named-parameter constructions shown in database.py. You can safely ignore the original concern about missing constructors.

NodeInfo at line 7 of cratedb_toolkit/admin/xmover/model.py has @dataclass
ShardInfo at line 33 of cratedb_toolkit/admin/xmover/model.py has @dataclass

Likely an incorrect or invalid review comment.

cratedb_toolkit/admin/xmover/analysis/shard.py (2)

1-4: LGTM!

Good module structure with clear purpose in the docstring.


489-503: Here’s the surrounding context for the SQL snippet in question. Once we see how schema_name is passed in, we can determine if None (or the default schema) is handled correctly:

  • Lines 460–540 of cratedb_toolkit/admin/xmover/analysis/shard.py have been printed.

Let me know once you’ve reviewed the output, and I’ll point out whether any special handling for schema_name is needed.

Comment on lines +249 to +251
if shard is None:
logger.info(f"Shard not found: {i}")
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect log message and consider removing unnecessary check.

The code logs "Shard not found" but shard is not an index - it's the actual shard object from the list. The None check appears unnecessary since Python's iteration over a list wouldn't produce None values unless they were explicitly in the list.

-        for i, shard in enumerate(processing_shards):
-            if shard is None:
-                logger.info(f"Shard not found: {i}")
-                continue
+        for i, shard in enumerate(processing_shards):

If None values are expected in the processing_shards list, the log message should be more descriptive:

         for i, shard in enumerate(processing_shards):
             if shard is None:
-                logger.info(f"Shard not found: {i}")
+                logger.warning(f"Encountered None shard at index {i} in processing_shards")
                 continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if shard is None:
logger.info(f"Shard not found: {i}")
continue
for i, shard in enumerate(processing_shards):
if shard is None:
logger.warning(f"Encountered None shard at index {i} in processing_shards")
continue

Comment on lines +668 to +700
# Calculate remaining space to each watermark
total_space_bytes = node.fs_total
current_used_bytes = node.fs_used

# Space that would be used at each watermark
low_watermark_used_bytes = total_space_bytes * (low_watermark / 100.0)
high_watermark_used_bytes = total_space_bytes * (high_watermark / 100.0)

# Remaining space until each watermark (negative if already exceeded)
remaining_to_low_gb = max(0, (low_watermark_used_bytes - current_used_bytes) / (1024**3))
remaining_to_high_gb = max(0, (high_watermark_used_bytes - current_used_bytes) / (1024**3))

return {"remaining_to_low_gb": remaining_to_low_gb, "remaining_to_high_gb": remaining_to_high_gb}

def _parse_watermark_percentage(self, watermark_value: str) -> float:
"""Parse watermark percentage from string like '85%' or '0.85'"""
if isinstance(watermark_value, str):
if watermark_value.endswith("%"):
return float(watermark_value[:-1])
else:
# Handle decimal format like '0.85'
decimal_value = float(watermark_value)
if decimal_value <= 1.0:
return decimal_value * 100
return decimal_value
elif isinstance(watermark_value, (int, float)):
if watermark_value <= 1.0:
return watermark_value * 100
return watermark_value
else:
# Default to common values if parsing fails
return 85.0 # Default low watermark

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate that the watermark defaults match CrateDB's actual defaults.

The _parse_watermark_percentage method returns a default value of 85.0% when parsing fails. This should match CrateDB's actual default watermark values.


🌐 Web query:

What are the default disk watermark values (low, high, flood_stage) in CrateDB?

💡 Result:

The default disk watermark values in CrateDB are:

  • low: 85%
  • high: 90%
  • flood_stage: 95%[4][3]

These percentage values refer to disk usage. When disk usage on a node exceeds:

  • low (85%): New shards will not be allocated to the node.
  • high (90%): Existing shards on the node will be relocated elsewhere if possible.
  • flood_stage (95%): All indices with shards on that node are set to read-only to prevent further data from being written[4][1][5].

Configuration details:

  • These can be set as either percentages or absolute byte values (but not mixed)[4].
  • The settings are:
    • cluster.routing.allocation.disk.watermark.low
    • cluster.routing.allocation.disk.watermark.high
    • cluster.routing.allocation.disk.watermark.flood_stage[4][3]
  • The disk info check runs every 30 seconds by default[4].

Operational impact:

  • Once the flood stage threshold is crossed, tables go read-only and the block is lifted automatically when disk usage falls below flood stage after freeing space[5][9].
  • These thresholds help prevent full disk situations and maintain cluster stability[4][3].

References to configuration:

  • Configuration examples and their defaults are shown in the official documentation and config comments[4][3].
    #cluster.routing.allocation.disk.watermark.low: 85%
    #cluster.routing.allocation.disk.watermark.high: 90%
    #cluster.routing.allocation.disk.watermark.flood_stage: 95%
    

CrateDB thus uses these default values to govern disk-based allocation and cluster safety.

Citations:


Fix default watermark fallback values

The _parse_watermark_percentage method currently returns a constant fallback of 85.0% for any unparsable input. According to CrateDB’s defaults, however, the three disk watermarks are:

  • low: 85%
  • high: 90%
  • flood_stage: 95%

As written, failed parses for the high or flood_stage watermarks would incorrectly fall back to 85%. Please update the parser so that each threshold uses its proper default—either by passing the intended default into the method or by branching on the watermark type. For example:

     else:
-        return 85.0
+        return {
+            'low':         85.0,
+            'high':        90.0,
+            'flood_stage': 95.0,
+        }.get(watermark_key, 85.0)

Locations to address:

  • cratedb_toolkit/admin/xmover/analysis/shard.py around lines 689–700

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
cratedb_toolkit/admin/xmover/analysis/shard.py around lines 689-700:
_parse_watermark_percentage currently returns a hardcoded 85.0 fallback for any
unparsable input which misassigns defaults for high and flood_stage; update the
function to accept a default value (or a watermark name/type) and return that
default on parse failure, and then update all callers to pass the appropriate
defaults (85.0 for low, 90.0 for high, 95.0 for flood_stage) so each threshold
falls back to its correct CrateDB default.

Comment on lines +682 to +700
def _parse_watermark_percentage(self, watermark_value: str) -> float:
"""Parse watermark percentage from string like '85%' or '0.85'"""
if isinstance(watermark_value, str):
if watermark_value.endswith("%"):
return float(watermark_value[:-1])
else:
# Handle decimal format like '0.85'
decimal_value = float(watermark_value)
if decimal_value <= 1.0:
return decimal_value * 100
return decimal_value
elif isinstance(watermark_value, (int, float)):
if watermark_value <= 1.0:
return watermark_value * 100
return watermark_value
else:
# Default to common values if parsing fails
return 85.0 # Default low watermark

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add edge case handling for watermark parsing.

The watermark parsing logic doesn't handle negative values or values greater than 100%, which could lead to incorrect calculations.

 def _parse_watermark_percentage(self, watermark_value: str) -> float:
     """Parse watermark percentage from string like '85%' or '0.85'"""
     if isinstance(watermark_value, str):
         if watermark_value.endswith("%"):
-            return float(watermark_value[:-1])
+            value = float(watermark_value[:-1])
+            if value < 0 or value > 100:
+                logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
+                return 85.0
+            return value
         else:
             # Handle decimal format like '0.85'
             decimal_value = float(watermark_value)
             if decimal_value <= 1.0:
-                return decimal_value * 100
-            return decimal_value
+                value = decimal_value * 100
+            else:
+                value = decimal_value
+            if value < 0 or value > 100:
+                logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
+                return 85.0
+            return value
     elif isinstance(watermark_value, (int, float)):
         if watermark_value <= 1.0:
-            return watermark_value * 100
-        return watermark_value
+            value = watermark_value * 100
+        else:
+            value = watermark_value
+        if value < 0 or value > 100:
+            logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
+            return 85.0
+        return value
     else:
         # Default to common values if parsing fails
         return 85.0  # Default low watermark
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_watermark_percentage(self, watermark_value: str) -> float:
"""Parse watermark percentage from string like '85%' or '0.85'"""
if isinstance(watermark_value, str):
if watermark_value.endswith("%"):
return float(watermark_value[:-1])
else:
# Handle decimal format like '0.85'
decimal_value = float(watermark_value)
if decimal_value <= 1.0:
return decimal_value * 100
return decimal_value
elif isinstance(watermark_value, (int, float)):
if watermark_value <= 1.0:
return watermark_value * 100
return watermark_value
else:
# Default to common values if parsing fails
return 85.0 # Default low watermark
def _parse_watermark_percentage(self, watermark_value: str) -> float:
"""Parse watermark percentage from string like '85%' or '0.85'"""
if isinstance(watermark_value, str):
if watermark_value.endswith("%"):
value = float(watermark_value[:-1])
if value < 0 or value > 100:
logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
return 85.0
return value
else:
# Handle decimal format like '0.85'
decimal_value = float(watermark_value)
if decimal_value <= 1.0:
value = decimal_value * 100
else:
value = decimal_value
if value < 0 or value > 100:
logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
return 85.0
return value
elif isinstance(watermark_value, (int, float)):
if watermark_value <= 1.0:
value = watermark_value * 100
else:
value = watermark_value
if value < 0 or value > 100:
logger.warning(f"Invalid watermark percentage {value}%, using default 85%")
return 85.0
return value
else:
# Default to common values if parsing fails
return 85.0 # Default low watermark
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/shard.py around lines 682 to 700, the
watermark parser must handle negative values, values >100, and bad input; update
the function to parse percent strings and decimal strings as currently done,
then normalize the numeric result by clamping it into the valid range 0.0–100.0
(e.g., values <=0 -> 0.0, values >=100 -> 100.0), and catch any ValueError from
float conversion to return the default 85.0; ensure both string and numeric
branches apply the same clamping and default-fallback behavior.

Comment on lines +26 to +27
zone_stats = self.analyzer.check_zone_balance(table, tolerance)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid double-applying tolerance and fix target rounding bias.

  • You pass tolerance into check_zone_balance(...) and also apply your own tolerance window locally. If ShardAnalyzer.check_zone_balance ever uses that parameter, you’ll be double-counting tolerance. Prefer calling it with named args and compute the window only here.
  • Using integer division for target_per_zone biases status for small totals. Compare against a float expected value instead.

Apply:

-        zone_stats = self.analyzer.check_zone_balance(table, tolerance)
+        zone_stats = self.analyzer.check_zone_balance(table_name=table)
...
-        target_per_zone = total_shards // len(zones) if zones else 0
-        tolerance_range = (target_per_zone * (1 - tolerance / 100), target_per_zone * (1 + tolerance / 100))
+        expected_per_zone = (total_shards / len(zones)) if zones else 0.0
+        tolerance_range = (
+            expected_per_zone * (1 - tolerance / 100.0),
+            expected_per_zone * (1 + tolerance / 100.0),
+        )
...
-            if tolerance_range[0] <= total <= tolerance_range[1]:
+            if tolerance_range[0] <= total <= tolerance_range[1]:
                 status = "[green]✓ Balanced[/green]"
             elif total < tolerance_range[0]:
-                status = f"[yellow]⚠ Under ({total - target_per_zone:+})[/yellow]"
+                status = f"[yellow]⚠ Under ({total - expected_per_zone:+.1f})[/yellow]"
             else:
-                status = f"[red]⚠ Over ({total - target_per_zone:+})[/red]"
+                status = f"[red]⚠ Over ({total - expected_per_zone:+.1f})[/red]"

Also applies to: 32-37, 45-55

🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/zone.py around lines 26-27 (and
similarly at 32-37 and 45-55), you're double-applying tolerance by passing
tolerance into check_zone_balance and then applying a local tolerance window;
change the call to pass tolerance as a named argument (e.g.,
tolerance=tolerance) and remove any local adjustment that re-applies tolerance
so the window is computed only here. Replace integer division when computing
target_per_zone with a float-based expected value (e.g., total /
float(num_zones) or use decimal) and compare actual counts against that float
expected value (using <=/>= with the tolerance window) to avoid rounding bias
for small totals. Ensure all comparisons use the float expected +/- tolerance
and update the same logic in the other referenced blocks.

Comment on lines +247 to +317
"--recovery-type", type=click.Choice(["PEER", "DISK", "all"]), default="all", help="Filter by recovery type"
)
@click.option("--include-transitioning", is_flag=True, help="Include completed recoveries still in transitioning state")
@click.pass_context
def monitor_recovery(
ctx, table: str, node: str, watch: bool, refresh_interval: int, recovery_type: str, include_transitioning: bool
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

“--recovery-type all” currently filters everything.

You pass "all" through to RecoveryOptions.recovery_type, and RecoveryMonitor filters by equality—resulting in an empty set. Map "all" to None and let the monitor skip filtering.

-    recovery_monitor = RecoveryMonitor(
-        client=ctx.obj["client"],
-        options=RecoveryOptions(
-            table=table,
-            node=node,
-            refresh_interval=refresh_interval,
-            recovery_type=recovery_type,
-            include_transitioning=include_transitioning,
-        ),
-    )
+    effective_type = None if recovery_type == "all" else recovery_type
+    recovery_monitor = RecoveryMonitor(
+        client=ctx.obj["client"],
+        options=RecoveryOptions(
+            table=table,
+            node=node,
+            refresh_interval=refresh_interval,
+            recovery_type=effective_type,
+            include_transitioning=include_transitioning,
+        ),
+    )

Also applies to: 269-279

🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/cli.py around lines 247-253 (and similarly lines
269-279), the option value "all" for --recovery-type is passed verbatim into
RecoveryOptions.recovery_type causing equality-based filtering to exclude all
results; change the CLI handler to map the string "all" to None before
constructing RecoveryOptions (or before passing to RecoveryMonitor) so that the
monitor treats None as "no filter" and skips recovery_type filtering; apply the
same mapping in both affected blocks.

Comment on lines +417 to +435
console.print(
f"[{i}/{len(safe_recommendations)}] Executing: {table_display} S{rec.shard_id} ({rec.size_gb:.1f}GB)"
)
console.print(f" {rec.from_node} → {rec.to_node}")

try:
# Execute the SQL command
result = self.client.execute_query(sql_command)

if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1

# Smart delay: check active recoveries before next move
if i < len(safe_recommendations):
self._wait_for_recovery_capacity(max_concurrent_recoveries=5)
else:
console.print(f" [red]❌ FAILED[/red] - Unexpected result: {result}")
failed_moves += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Weak success check on ALTER statements.

CrateDB may return a payload with an error field on HTTP 200. Relying solely on rowcount >= 0 can misreport failures.

Follow-up: after adopting the improved error handling in CrateDBClient (see util/database.py comment), surface the exact error here for user clarity.


🏁 Script executed:

#!/bin/bash
set -e

# Locate the file(s) defining CrateDBClient
rg -l "class CrateDBClient" -g "*.py"

# Show the CrateDBClient class definition for context
rg -n "class CrateDBClient" -C5 -g "*.py"

# Inspect the execute_query implementation
rg -n "def execute_query" -C5 -g "*.py"

# Search for any error‐handling or 'error' fields referenced
rg -n "error" -C3 -g "*.py"

Length of output: 90035


Ensure SQL errors aren’t misreported as success

The current check in cratedb_toolkit/admin/xmover/operational/recommend.py (lines 417–435) treats any response with rowcount >= 0 as success—even when CrateDB returns an HTTP 200 with an embedded error payload. You’ll need to:

• First, detect and handle a top-level "error" key in the returned JSON.
• Only then fall back to checking rowcount for ALTER statements.
• Surface the exact error message to the user.

Suggested diff in recommend.py:

             result = self.client.execute_query(sql_command)

-            if result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
-                console.print("    [green]✅ SUCCESS[/green] - Move initiated")
-                successful_moves += 1
+            if "error" in result:
+                console.print(f"    [red]❌ FAILED[/red] - {result['error']['message']}")
+                failed_moves += 1
+            elif result.get("rowcount", 0) >= 0:
+                console.print("    [green]✅ SUCCESS[/green] - Move initiated")
+                successful_moves += 1
             else:
                 console.print(f"    [red]❌ FAILED[/red] - Unexpected result: {result}")
                 failed_moves += 1

Additionally, update CrateDBClient.execute_query in
cratedb_toolkit/admin/xmover/util/database.py to raise or wrap a Python exception when its parsed JSON contains an error field, so that callers can uniformly handle failures.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.print(
f"[{i}/{len(safe_recommendations)}] Executing: {table_display} S{rec.shard_id} ({rec.size_gb:.1f}GB)"
)
console.print(f" {rec.from_node}{rec.to_node}")
try:
# Execute the SQL command
result = self.client.execute_query(sql_command)
if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1
# Smart delay: check active recoveries before next move
if i < len(safe_recommendations):
self._wait_for_recovery_capacity(max_concurrent_recoveries=5)
else:
console.print(f" [red]❌ FAILED[/red] - Unexpected result: {result}")
failed_moves += 1
# Execute the SQL command
result = self.client.execute_query(sql_command)
- if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
- console.print(" [green]✅ SUCCESS[/green] - Move initiated")
if "error" in result:
console.print(f" [red]❌ FAILED[/red] - {result['error']['message']}")
failed_moves += 1
elif result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements
console.print(" [green]✅ SUCCESS[/green] - Move initiated")
successful_moves += 1
# Smart delay: check active recoveries before next move
if i < len(safe_recommendations):
self._wait_for_recovery_capacity(max_concurrent_recoveries=5)
else:
console.print(f" [red]❌ FAILED[/red] - Unexpected result: {result}")
failed_moves += 1
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/operational/recommend.py around lines 417–435,
the code treats any response with rowcount >= 0 as success and can misreport
HTTP 200 responses that contain a top-level "error" as successful; update the
handler to first inspect the returned JSON for an "error" key and, if present,
treat it as a failure and surface the error message to the user (console.print
the exact error and increment failed_moves), otherwise fall back to checking
rowcount for ALTER statements; also modify
cratedb_toolkit/admin/xmover/util/database.py CrateDBClient.execute_query to
detect an "error" field in parsed JSON and raise or wrap a Python exception
containing that error so callers can handle failures uniformly.

Comment on lines +315 to +348
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""

# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""

result = self.execute_query(query, [table_name, shard_id])

if not result.get("rows"):
return None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Recovery details query ignores schema in WHERE clause.

get_recovery_details filters by table_name and id only. In clusters with duplicate table names across schemas, you may pick the wrong shard. Include schema if known.

-        WHERE s.table_name = ? AND s.id = ?
+        WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
...
-        result = self.execute_query(query, [table_name, shard_id])
+        result = self.execute_query(query, [schema_name, table_name, shard_id])

And adjust the call site accordingly (you already pass schema_name).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""
# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""
result = self.execute_query(query, [table_name, shard_id])
if not result.get("rows"):
return None
def get_recovery_details(self, schema_name: str, table_name: str, shard_id: int) -> Optional[Dict[str, Any]]:
"""Get detailed recovery information for a specific shard from sys.shards"""
# Query for shards that are actively recovering (not completed)
query = """
SELECT
s.table_name,
s.schema_name,
s.id as shard_id,
s.node['name'] as node_name,
s.node['id'] as node_id,
s.routing_state,
s.state,
s.recovery,
s.size,
s."primary",
s.translog_stats['size'] as translog_size
FROM sys.shards s
WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
AND (s.state = 'RECOVERING' OR s.routing_state IN ('INITIALIZING', 'RELOCATING'))
ORDER BY s.schema_name
LIMIT 1
"""
result = self.execute_query(query, [schema_name, table_name, shard_id])
if not result.get("rows"):
return None
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/util/database.py around lines 315 to 343, the
SQL WHERE clause filters only by table_name and shard id, which can return the
wrong shard when the same table name exists in multiple schemas; include
schema_name in the WHERE clause (e.g. add AND s.schema_name = ?) and update the
execute_query parameter list to pass schema_name along with table_name and
shard_id in the correct order; also update any call sites if they rely on the
old parameter ordering to ensure schema_name is supplied.

Comment on lines +118 to +152
- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
- `--validate/--no-validate`: Validate move safety (default: True)
- `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
- `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)

**Examples:**
```bash
# Dry run with zone balancing priority
xmover recommend --prioritize-zones

# Generate SQL for space optimization
xmover recommend --prioritize-space --execute

# Focus on specific table with custom parameters
xmover recommend --table events --min-size 10 --max-size 30 --execute

# Target space relief for a specific node
xmover recommend --prioritize-space --min-size 30 --max-size 60 --node data-hot-4

# Allow higher disk usage for urgent moves
xmover recommend --prioritize-space --max-disk-usage 90
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm documented defaults vs. CLI (max-disk-usage and auto-execute)

Docs state --max-disk-usage default: 85. Verify against the Click defaults in cratedb_toolkit/admin/xmover/cli.py. Also check if --auto-execute exists (and its default). Update the handbook to prevent drift.

If the default is 90 and --auto-execute is present (default False), update the bullets accordingly.


🏁 Script executed:

#!/bin/bash
# Verify defaults for recommend options in the CLI implementation.
rg -n -C2 --type=py \
  -e "@click.command" \
  -e "@click.option(" \
  -e "def recommend(" \
  cratedb_toolkit/admin/xmover/cli.py | sed -n '1,200p'

Length of output: 270


🏁 Script executed:

#!/bin/bash
# Locate the recommend command and its option decorators in the xmover CLI
rg -n -C5 -e "def recommend" -e "click\.option" cratedb_toolkit/admin/xmover/cli.py

Length of output: 5994


Update handbook defaults for xmover recommend command

The documentation in doc/admin/xmover/handbook.md is out of sync with the CLI implementation:

  • In cratedb_toolkit/admin/xmover/cli.py, the --max-disk-usage default is 90.0 (not 85).
  • The CLI also defines a --auto-execute flag (default: False), which isn’t documented.

Please update the snippet at lines 118–139 accordingly:

- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
+ `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 90)

+ `--auto-execute`: Automatically execute the SQL commands (requires `--execute`, asks for confirmation) (default: False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
- `--validate/--no-validate`: Validate move safety (default: True)
- `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
- `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)
**Examples:**
```bash
# Dry run with zone balancing priority
xmover recommend --prioritize-zones
# Generate SQL for space optimization
xmover recommend --prioritize-space --execute
# Focus on specific table with custom parameters
xmover recommend --table events --min-size 10 --max-size 30 --execute
# Target space relief for a specific node
xmover recommend --prioritize-space --min-size 30 --max-size 60 --node data-hot-4
# Allow higher disk usage for urgent moves
xmover recommend --prioritize-space --max-disk-usage 90
- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 90)
- `--auto-execute`: Automatically execute the SQL commands (requires `--execute`, asks for confirmation) (default: False)
- `--validate/--no-validate`: Validate move safety (default: True)
- `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
- `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)
**Examples:**
🧰 Tools
🪛 LanguageTool

[grammar] ~118-~118: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In doc/admin/xmover/handbook.md around lines 118 to 139, the documented CLI
defaults are out of sync: change the `--max-disk-usage` default from 85 to 90,
add documentation for the `--auto-execute` flag (default: False), and update the
dry-run/execute description/examples to reflect the actual CLI behavior (i.e.,
document `--auto-execute` instead of implying a different execute flag and
ensure examples show using `--auto-execute` where SQL generation/execution is
intended).

Comment on lines +51 to +70
SELECT
name,
attributes['zone'] AS zone,
fs['total']['available'] / power(1024, 3) AS available_gb
FROM sys.nodes
ORDER BY name;
```
```text
+------------+--------------------+-----------------------------------------------+
| name | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) |
+------------+--------------------+-----------------------------------------------+
| data-hot-5 | us-west-2a | 142.3342628479004 |
| data-hot-0 | us-west-2a | 142.03089141845703 |
| data-hot-6 | us-west-2b | 159.68728256225586 |
| data-hot-3 | us-west-2b | 798.8147850036621 |
| data-hot-2 | us-west-2b | 156.79160690307617 |
| data-hot-1 | us-west-2c | 145.73613739013672 |
| data-hot-4 | us-west-2c | 148.39511108398438 |
+------------+--------------------+-----------------------------------------------+
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Example output doesn’t match the preceding SELECT

The query selects available_gb from fs['total']['available'], but the example table header references fs[1]['disks']['available'] and shows a derived expression. Align the sample output headers/labels with the query to avoid confusion.

-```text
-+------------+--------------------+-----------------------------------------------+
-| name       | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) |
-+------------+--------------------+-----------------------------------------------+
+```text
++------------+--------------------+---------------+
+| name       | zone               | available_gb |
++------------+--------------------+---------------+
 | data-hot-5 | us-west-2a         |                            142.3342628479004  |
 ...
-+------------+--------------------+-----------------------------------------------+
++------------+--------------------+---------------+

<details>
<summary>🤖 Prompt for AI Agents</summary>

In doc/admin/xmover/queries.md around lines 51 to 70, the example output table
headers don't match the SELECT (SELECT returns name, attributes['zone'] AS zone,
fs['total']['available'] / power(1024, 3) AS available_gb); update the sample
output to show columns "name", "zone", and "available_gb" (and remove the old
fs[1]['disks']['available'] expression from the header), keep the numeric values
as-is, and ensure the surrounding code fence and table borders reflect the new
header so the example aligns exactly with the query.


</details>

<!-- fingerprinting:phantom:poseidon:chinchilla -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +92 to +99
**Step 2: Adjust Parameters**
```bash
# Increase minimum free space requirement
xmover recommend --min-free-space 200

# Focus on smaller shards
xmover recommend --max-size 50
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Adjust “Insufficient Space” advice: lower min-free-space for urgent relief, not increase it

Raising the min-free-space threshold reduces eligible targets and can worsen “not enough space” errors during remediation. For urgent space relief, lower the threshold cautiously (and/or raise max-disk-usage) while preferring smaller shards.

Apply:

-# Increase minimum free space requirement
-xmover recommend --min-free-space 200
+# Temporarily lower minimum free space requirement (use with caution)
+xmover recommend --min-free-space 50
@@
-# Focus on smaller shards
-xmover recommend --max-size 50
+# Focus on smaller shards
+xmover recommend --max-size 50
+# (Optionally) allow higher disk usage during emergencies
+xmover recommend --max-disk-usage 90 --prioritize-space
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Step 2: Adjust Parameters**
```bash
# Increase minimum free space requirement
xmover recommend --min-free-space 200
# Focus on smaller shards
xmover recommend --max-size 50
```
**Step 2: Adjust Parameters**
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

92-92: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In doc/admin/xmover/troubleshooting.md around lines 92 to 99, the current advice
incorrectly tells users to increase --min-free-space for urgent "Insufficient
Space" remediation; this makes the problem worse by further shrinking eligible
targets. Change the guidance to recommend temporarily lowering --min-free-space
(e.g., reduce the value) and/or raising --max-disk-usage to allow more target
disks, while still preferring smaller shards via --max-size; update the example
commands to show lowering --min-free-space and/or increasing --max-disk-usage
and keep the --max-size suggestion for safer moves.

WalBeh and others added 2 commits August 21, 2025 14:15
- More or less just line-length fixes.
- Only a single type adjustment was needed on the return value of the
  `analyze_distribution` method.
- Ruff recommended to use set comprehensions, so here we go.
- At a single spot where an exception has been `pass`ed, we added
  error output. Is it bad?
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (14)
doc/admin/xmover/queries.md (5)

58-70: Align example output headers with the SELECT (zone alias and available_gb).

The example table shows attributes['zone'] and a different expression than the query (which aliases zone and available_gb). Update headers for consistency.

-```text
-+------------+--------------------+-----------------------------------------------+
-| name       | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) |
-+------------+--------------------+-----------------------------------------------+
+```text
++------------+--------------------+---------------+
+| name       | zone               | available_gb |
++------------+--------------------+---------------+
 | data-hot-5 | us-west-2a         |                            142.3342628479004  |
 ...
-+------------+--------------------+-----------------------------------------------+
++------------+--------------------+---------------+

---

`103-129`: **CTE filters on incomplete recoveries; use fully recovered shards.**

Filter to completed shards and keep unit conversion as-is. Consider guarding division by zero.



```diff
 WITH shard_summary AS (
     SELECT
         node['name'] AS node_name,
         table_name,
         schema_name,
         CASE
             WHEN "primary" = true THEN 'PRIMARY'
             ELSE 'REPLICA'
         END AS shard_type,
         COUNT(*) AS shard_count,
-        SUM(size) / 1024^3 AS total_size_gb
+        SUM(size) / 1024^3 AS total_size_gb
     FROM sys.shards
-    WHERE table_name = 'orderffD'
+    WHERE table_name = 'orderffD'
         AND routing_state = 'STARTED'
-        AND recovery['files']['percent'] = 0
+        AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     GROUP BY node['name'], table_name, schema_name, "primary"
 )
 SELECT
     node_name,
     table_name,
     schema_name,
     shard_type,
     shard_count,
     ROUND(total_size_gb, 2) AS total_size_gb,
-    ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
+    ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb
 FROM shard_summary
 ORDER BY node_name, shard_type DESC, total_size_gb DESC;

132-157: Prefer healthy shards in the comprehensive distribution query.

Keep the caret (^) as exponentiation in CrateDB, but filter for fully recovered shards.

 WHERE s.table_name = 'your_table_name'  -- Replace with your specific table name
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

158-177: Summary query should also filter to fully recovered shards.

Use the same recovery predicate here to avoid skew from in-flight data.

-    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
-    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
+    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
+    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
 ...
 WHERE s.table_name = 'orderffD'  -- Replace with your specific table name
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

72-79: Fix heading grammar and recovery predicate (selects only healthy shards).

  • Heading: singular “node”, drop shouting “SHARDS”.
  • Predicate: percent = 0 excludes fully recovered shards; use 100 (or >= 100 with COALESCE).
-## List biggest SHARDS on a particular Nodes
+## List biggest shards on a particular node

 ```sql
-select node['name'], table_name, schema_name, id,  sum(size) / 1024^3 from sys.shards
+select node['name'], table_name, schema_name, id, sum(size) / 1024^3 as size_gb
+from sys.shards
     where node['name'] = 'data-hot-2'
     AND routing_state = 'STARTED'
-    AND recovery['files']['percent'] = 0
+    AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     group by 1,2,3,4  order by 5  desc limit 8;

</blockquote></details>
<details>
<summary>doc/admin/xmover/handbook.md (2)</summary><blockquote>

`475-482`: **Keep threshold consistent and align argument placeholders.**

- Threshold header says 85% while defaults are 90%.
- Use the same placeholders as elsewhere: FROM_NODE/TO_NODE.



```diff
-**Disk Usage Threshold (default: 85%)**
+**Disk Usage Threshold (default: 90%)**
 ...
-# For urgent space relief
-xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO> --max-disk-usage 95
+# For urgent space relief
+xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE> --max-disk-usage 95

131-136: Docs drift: --max-disk-usage default and missing --auto-execute.

The model default is 90.0% (see ShardRelocationConstraints.max_disk_usage), not 85%. Also document the --auto-execute flag used to actually execute the generated SQL in execution mode.

 - `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
+ - `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 90)
  - `--validate/--no-validate`: Validate move safety (default: True)
  - `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
+ - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
+ - `--auto-execute`: Automatically execute the SQL commands (requires `--execute`, asks for confirmation) (default: False)
  - `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)

Run to verify the defaults in code:

#!/bin/bash
# Confirm the default for max_disk_usage and presence of auto-execute in CLI or model.
rg -nP 'class\s+ShardRelocationConstraints\b.*?max_disk_usage\s*:\s*float\s*=\s*([0-9.]+)' -U cratedb_toolkit/admin/xmover/model.py
rg -n 'auto[-_]execute' -n cratedb_toolkit/admin/xmover/cli.py || true
cratedb_toolkit/admin/xmover/operational/recommend.py (4)

326-334: Case-insensitive match for “zone conflict”.

This check is case-sensitive and can miss other capitalizations. Normalize to lower().

-                        if "Zone conflict" in safety_msg:
+                        if "zone conflict" in safety_msg.lower():

126-128: Interpolate variables in the “Check with” hint; include schema for safety.

Placeholders are printed literally. Use an f-string with schema_name/table_name and the actual shard id.

-            console.print(
-                "[dim]# Check with: SELECT * FROM sys.shards "
-                "WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]"
-            )
+            console.print(
+                f"[dim]# Check with: SELECT * FROM sys.shards "
+                f"WHERE schema_name = '{schema_name}' AND table_name = '{table_name}' AND id = {request.shard_id};[/dim]"
+            )

341-344: Honor user-configured max disk usage during auto-execution; avoid hard-coded 95%.

auto-exec re-validates with 95% regardless of constraints, which can run moves the user intended to disallow.

-            if auto_execute:
-                self._execute_recommendations_safely(recommendations, validate)
+            if auto_execute:
+                self._execute_recommendations_safely(
+                    recommendations,
+                    validate,
+                    max_disk_usage_percent=constraints.max_disk_usage,
+                )
@@
-    def _execute_recommendations_safely(self, recommendations, validate: bool):
+    def _execute_recommendations_safely(self, recommendations, validate: bool, max_disk_usage_percent: float):
         """Execute recommendations with extensive safety measures"""
@@
-        if validate:
+        if validate:
             for rec in recommendations:
-                is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
+                is_safe, safety_msg = self.analyzer.validate_move_safety(
+                    rec, max_disk_usage_percent=max_disk_usage_percent
+                )

Also applies to: 355-367


426-435: Treat CrateDB “error” payloads as failures, not successes.

CrateDB can return HTTP 200 with an "error" field. Checking only rowcount can misreport failures.

-                if result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
+                if isinstance(result, dict) and result.get("error"):
+                    console.print(f"    [red]❌ FAILED[/red] - {result['error'].get('message', result['error'])}")
+                    failed_moves += 1
+                elif result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
                     console.print("    [green]✅ SUCCESS[/green] - Move initiated")
                     successful_moves += 1
                 else:
                     console.print(f"    [red]❌ FAILED[/red] - Unexpected result: {result}")
                     failed_moves += 1
cratedb_toolkit/admin/xmover/util/database.py (3)

147-151: Note: caret (^) is exponentiation in CrateDB; no change needed.

The use of 1024.0^3 is correct in CrateDB SQL (PostgreSQL semantics). Keeping as-is improves consistency with the rest of the docs.


21-41: Use a persistent requests.Session and surface CrateDB “error” fields.

Pooling improves performance, and CrateDB may return 200 with an error payload. Detect and raise on "error".

 class CrateDBClient:
@@
     def __init__(self, connection_string: Optional[str] = None):
         load_dotenv()
@@
+        # Reuse HTTP connections
+        self.session = requests.Session()
@@
-        try:
-            response = requests.post(
-                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
-            )
+        try:
+            response = self.session.post(
+                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
+            )
             response.raise_for_status()
-            return response.json()
+            data = response.json()
+            if isinstance(data, dict) and data.get("error"):
+                err = data["error"]
+                msg = err.get("message") if isinstance(err, dict) else str(err)
+                raise Exception(f"CrateDB error: {msg}")
+            return data

Also applies to: 53-60


320-345: Include schema in recovery-details filter to disambiguate shards.

Filtering only by table_name and id can select the wrong shard when names are duplicated across schemas.

         FROM sys.shards s
-        WHERE s.table_name = ? AND s.id = ?
+        WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
@@
-        result = self.execute_query(query, [table_name, shard_id])
+        result = self.execute_query(query, [schema_name, table_name, shard_id])
🧹 Nitpick comments (4)
doc/admin/xmover/index.md (1)

14-14: Grammar: add missing article (“the largest tables”).

“across largest tables” reads awkwardly; add “the” for clarity.

- - **Shard Distribution Analysis**: Detect and rank distribution anomalies across largest tables
+ - **Shard Distribution Analysis**: Detect and rank distribution anomalies across the largest tables
doc/admin/xmover/queries.md (1)

6-21: Split SQL and sample output into separate code fences; add column aliases.

Mixing SQL and tabular output in one fenced block reduces readability and breaks syntax highlighting. Also, alias computed columns so headers match the output.

 ```sql
-select node['name'], sum(size) / 1024^3, count(id)  from sys.shards  group by 1  order by 1 asc;
-+--------------+-----------------------------+-----------+
-| node['name'] | (sum(size) / 1.073741824E9) | count(id) |
-+--------------+-----------------------------+-----------+
+select
+  node['name'],
+  sum(size) / 1024^3 as total_size_gb,
+  count(id) as shard_count
+from sys.shards
+group by 1
+order by 1 asc;
+```
+```text
++--------------+---------------+-------------+
+| node['name'] | total_size_gb | shard_count |
++--------------+---------------+-------------+
 | data-hot-0   |          1862.5866614403203 |       680 |
 | data-hot-1   |          1866.0331328986213 |       684 |
 | data-hot-2   |          1856.6581886671484 |      1043 |
 | data-hot-3   |          1208.932889252901  |       477 |
 | data-hot-4   |          1861.7727940855548 |       674 |
 | data-hot-5   |          1863.4315695902333 |       744 |
 | data-hot-6   |          1851.3522544233128 |       948 |
 | NULL         |             0.0             |        35 |
-+--------------+-----------------------------+-----------+
++--------------+---------------+-------------+
 SELECT 8 rows in set (0.061 sec)

</blockquote></details>
<details>
<summary>doc/admin/xmover/handbook.md (1)</summary><blockquote>

`59-61`: **Grammar: “dedicating” → “dedicated to”.**

Small wording fix for readability.


```diff
-This view is dedicating a specific focus on large tables.
+This view is dedicated to a specific focus on large tables.
cratedb_toolkit/admin/xmover/analysis/table.py (1)

22-33: Enhance precision for storage size edge cases

The function correctly formats sizes, but for sub-byte values (< 0.001 GB), it displays "0 B" which may mask very small but non-zero values.

Consider preserving precision for very small values:

 def format_storage_size(size_gb: float) -> str:
     """Format storage size with appropriate units and spacing"""
     if size_gb < 0.001:
-        return "0 B"
+        size_bytes = size_gb * 1024 * 1024 * 1024
+        if size_bytes < 1:
+            return f"{size_bytes:.2e} B"
+        else:
+            return f"{size_bytes:.0f} B"
     elif size_gb < 1.0:
         size_mb = size_gb * 1024
         return f"{size_mb:.0f} MB"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 108577c and 5068671.

📒 Files selected for processing (9)
  • cratedb_toolkit/admin/xmover/analysis/table.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/monitor.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/recommend.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/database.py (1 hunks)
  • doc/admin/xmover/handbook.md (1 hunks)
  • doc/admin/xmover/index.md (1 hunks)
  • doc/admin/xmover/queries.md (1 hunks)
  • tests/admin/test_cli.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cratedb_toolkit/admin/xmover/operational/monitor.py
  • tests/admin/test_cli.py
  • cratedb_toolkit/admin/xmover/cli.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T09:20:33.795Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/handbook.md:462-469
Timestamp: 2025-08-20T09:20:33.795Z
Learning: In large documentation files with many existing headings, be less strict about flagging emphasis-as-heading (MD036) issues, as emphasis can serve valid purposes for minor subsections and clarifications rather than requiring full heading treatment.

Applied to files:

  • doc/admin/xmover/handbook.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.

Applied to files:

  • doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.

Applied to files:

  • doc/admin/xmover/queries.md
🧬 Code graph analysis (3)
cratedb_toolkit/admin/xmover/util/database.py (2)
cratedb_toolkit/admin/xmover/model.py (5)
  • NodeInfo (7-29)
  • RecoveryInfo (55-102)
  • ShardInfo (33-51)
  • size_gb (81-83)
  • overall_progress (76-78)
cratedb_toolkit/admin/xmover/cli.py (1)
  • test_connection (153-175)
cratedb_toolkit/admin/xmover/operational/recommend.py (5)
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
  • validate_move_safety (373-401)
  • generate_rebalancing_recommendations (191-371)
cratedb_toolkit/admin/xmover/model.py (8)
  • ShardRelocationConstraints (177-186)
  • ShardRelocationRequest (106-113)
  • ShardRelocationResponse (117-153)
  • shard_type (50-51)
  • shard_type (86-87)
  • size_gb (81-83)
  • to_sql (131-137)
  • available_space_gb (28-29)
cratedb_toolkit/admin/xmover/operational/monitor.py (3)
  • RecoveryMonitor (24-384)
  • RecoveryOptions (16-21)
  • get_cluster_recovery_status (31-43)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (18-498)
  • execute_query (42-60)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/analysis/table.py (1)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • execute_query (42-60)
  • get_nodes_info (62-96)
🪛 LanguageTool
doc/admin/xmover/index.md

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Distribution Analysis: Detect an...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...ribution anomalies across largest tables - Shard Movement Recommendations: Intell...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...

(QB_NEW_EN)

doc/admin/xmover/handbook.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...

(QB_NEW_EN)


[grammar] ~59-~59: There might be a mistake here.
Context: ...ace ``` ### Shard Distribution Analysis This view is dedicating a specific focus...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t: Analyze specific table only **Example...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t: Find candidates in specific table only...

(QB_NEW_EN)


[grammar] ~107-~107: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~108-~108: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~109-~109: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node: Only show candidates from this specifi...

(QB_NEW_EN)


[grammar] ~124-~124: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~125-~125: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~126-~126: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~127-~127: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~128-~128: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~129-~129: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~130-~130: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~131-~131: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~132-~132: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~133-~133: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~134-~134: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t: Analyze zones for specific table only ...

(QB_NEW_EN)


[grammar] ~159-~159: There might be a mistake here.
Context: ...: Analyze zones for specific table only - --show-shards/--no-show-shards`: Show individual shard details (default...

(QB_NEW_EN)


[grammar] ~170-~170: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t: Check balance for specific table only ...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...: Check balance for specific table only - --tolerance`: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~184-~184: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE: Schema and table name (format: schema....

(QB_NEW_EN)


[grammar] ~185-~185: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source...

(QB_NEW_EN)


[grammar] ~186-~186: There might be a mistake here.
Context: ...ma.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source node name - TO_NODE: Target n...

(QB_NEW_EN)


[grammar] ~187-~187: There might be a mistake here.
Context: ... to move - FROM_NODE: Source node name - TO_NODE: Target node name Examples: ```bas...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE: The CrateDB error message to analyze (...

(QB_NEW_EN)


[grammar] ~217-~217: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t: Monitor recovery for specific table on...

(QB_NEW_EN)


[grammar] ~218-~218: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n: Monitor recovery on specific node only...

(QB_NEW_EN)


[grammar] ~219-~219: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w: Continuously monitor (refresh every 10...

(QB_NEW_EN)


[grammar] ~220-~220: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval: Refresh interval for watch mode in sec...

(QB_NEW_EN)


[grammar] ~221-~221: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type: Filter by recovery type - PEER, DISK, ...

(QB_NEW_EN)


[grammar] ~222-~222: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning: Include recently completed recoveries ...

(QB_NEW_EN)


[grammar] ~243-~243: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...

(QB_NEW_EN)


[grammar] ~244-~244: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...

(QB_NEW_EN)


[grammar] ~377-~377: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) - CRATE_USERNAME`: Username for authentication (optional)...

(QB_NEW_EN)


[grammar] ~378-~378: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD: Password for authentication (optional)...

(QB_NEW_EN)


[grammar] ~379-~379: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY: Enable SSL certificate verification (d...

(QB_NEW_EN)


[grammar] ~434-~434: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...

(QB_NEW_EN)


[grammar] ~435-~435: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...

(QB_NEW_EN)


[grammar] ~443-~443: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...

(QB_NEW_EN)


[grammar] ~454-~454: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...

(QB_NEW_EN)


[grammar] ~484-~484: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...

(QB_NEW_EN)


[style] ~485-~485: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...

(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)


[grammar] ~487-~487: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...

(QB_NEW_EN)


[grammar] ~492-~492: There might be a mistake here.
Context: ...f}xmover-troubleshooting which covers: - Step-by-step diagnostic procedures - Eme...

(QB_NEW_EN)


[grammar] ~493-~493: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...

(QB_NEW_EN)


[grammar] ~494-~494: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...

(QB_NEW_EN)


[grammar] ~495-~495: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...

(QB_NEW_EN)

doc/admin/xmover/queries.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/handbook.md

220-220: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


249-249: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-and-publish
  • GitHub Check: build-and-publish
  • GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
🔇 Additional comments (6)
cratedb_toolkit/admin/xmover/analysis/table.py (6)

121-140: Well-structured CrateDB query with proper aggregation

The query correctly aggregates shard data and performs the necessary calculations for distribution analysis.


249-251: Robust coefficient of variation calculations

Good use of the helper method for calculating CV values for multiple metrics.


432-445: Solid statistical calculations with proper error handling

The coefficient of variation calculation properly handles edge cases including zero mean and insufficient data points.


654-676: Clean separation of concerns in anomaly detection

Excellent design pattern using separate detector methods for different anomaly types, making the code modular and maintainable.


637-638: Potential division by zero in document imbalance detection

When min_docs is 0, the condition max_docs > min_docs * 3 would not work as intended.

Add protection against zero document counts:

-        if max_docs > min_docs * 3:  # If difference is > 3x
+        if min_docs > 0 and max_docs > min_docs * 3:  # If difference is > 3x
             recommendations.append(f"Significant document skew detected ({min_docs:,} to {max_docs:,} docs per node)")
+        elif min_docs == 0 and max_docs > 0:
+            recommendations.append(f"Some nodes have no documents while others have {max_docs:,} - check shard routing")

Likely an incorrect or invalid review comment.


532-546: Potential division by zero in storage imbalance detection

When min_size is 0 (which can happen for nodes with no shards), the comparison max_size > min_size * 2 would evaluate incorrectly.

Add a check to prevent issues with zero minimum size:

         if max_size > min_size * 2:  # If difference is > 2x
+            # Skip if min_size is effectively zero
+            if min_size < 0.001:
+                recommendations.append(
+                    f"Some nodes have no storage - consider distributing shards more evenly"
+                )
+            else:
                 overloaded_node = None
                 underloaded_node = None
 
                 for node, data in table.node_distributions.items():
                     if data["total_size_gb"] == max_size:
                         overloaded_node = node
                     elif data["total_size_gb"] == min_size:
                         underloaded_node = node
 
                 if overloaded_node and underloaded_node:
                     recommendations.append(
                         f"Rebalance storage from {overloaded_node} ({format_storage_size(max_size)}) "
                         f"to {underloaded_node} ({format_storage_size(min_size)})"
                     )

Likely an incorrect or invalid review comment.

Comment on lines +71 to +109
def find_table_by_name(self, table_name: str) -> Optional[str]:
"""Find table by name and resolve schema ambiguity"""

query = """
SELECT DISTINCT schema_name, table_name
FROM sys.shards
WHERE table_name = ?
AND schema_name NOT IN ('sys', 'information_schema', 'pg_catalog')
AND routing_state = 'STARTED'
ORDER BY schema_name \
"""

result = self.client.execute_query(query, [table_name])
rows = result.get("rows", [])

if not rows:
return None
elif len(rows) == 1:
schema, table = rows[0]
return f"{schema}.{table}" if schema != "doc" else table
else:
# Multiple schemas have this table - ask user
rprint(f"[yellow]Multiple schemas contain table '{table_name}':[/yellow]")
for i, (schema, table) in enumerate(rows, 1):
full_name = f"{schema}.{table}" if schema != "doc" else table
rprint(f" {i}. {full_name}")

try:
choice = input("\nSelect table (enter number): ").strip()
idx = int(choice) - 1
if 0 <= idx < len(rows):
schema, table = rows[idx]
return f"{schema}.{table}" if schema != "doc" else table
else:
rprint("[red]Invalid selection[/red]")
return None
except (ValueError, KeyboardInterrupt):
rprint("\n[yellow]Selection cancelled[/yellow]")
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

User input validation needed in interactive selection

When prompting for table selection, the function lacks input validation for edge cases.

Add validation for empty input and better error handling:

         try:
             choice = input("\nSelect table (enter number): ").strip()
+            if not choice:
+                rprint("[yellow]No selection made[/yellow]")
+                return None
             idx = int(choice) - 1
             if 0 <= idx < len(rows):
                 schema, table = rows[idx]
                 return f"{schema}.{table}" if schema != "doc" else table
             else:
                 rprint("[red]Invalid selection[/red]")
                 return None
         except (ValueError, KeyboardInterrupt):
             rprint("\n[yellow]Selection cancelled[/yellow]")
             return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def find_table_by_name(self, table_name: str) -> Optional[str]:
"""Find table by name and resolve schema ambiguity"""
query = """
SELECT DISTINCT schema_name, table_name
FROM sys.shards
WHERE table_name = ?
AND schema_name NOT IN ('sys', 'information_schema', 'pg_catalog')
AND routing_state = 'STARTED'
ORDER BY schema_name \
"""
result = self.client.execute_query(query, [table_name])
rows = result.get("rows", [])
if not rows:
return None
elif len(rows) == 1:
schema, table = rows[0]
return f"{schema}.{table}" if schema != "doc" else table
else:
# Multiple schemas have this table - ask user
rprint(f"[yellow]Multiple schemas contain table '{table_name}':[/yellow]")
for i, (schema, table) in enumerate(rows, 1):
full_name = f"{schema}.{table}" if schema != "doc" else table
rprint(f" {i}. {full_name}")
try:
choice = input("\nSelect table (enter number): ").strip()
idx = int(choice) - 1
if 0 <= idx < len(rows):
schema, table = rows[idx]
return f"{schema}.{table}" if schema != "doc" else table
else:
rprint("[red]Invalid selection[/red]")
return None
except (ValueError, KeyboardInterrupt):
rprint("\n[yellow]Selection cancelled[/yellow]")
return None
try:
choice = input("\nSelect table (enter number): ").strip()
if not choice:
rprint("[yellow]No selection made[/yellow]")
return None
idx = int(choice) - 1
if 0 <= idx < len(rows):
schema, table = rows[idx]
return f"{schema}.{table}" if schema != "doc" else table
else:
rprint("[red]Invalid selection[/red]")
return None
except (ValueError, KeyboardInterrupt):
rprint("\n[yellow]Selection cancelled[/yellow]")
return None
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/table.py around lines 71 to 109, the
interactive selection branch does not validate empty input or handle EOF/blank
entries and gives only a single try before returning; update it to robustly
validate user input: after printing options, loop prompting the user until a
valid numeric selection or explicit cancel (empty input or Ctrl-D/EOF treated as
cancel), strip and check for empty string before int conversion, catch
ValueError and continue the loop with an "Invalid selection" message, handle
KeyboardInterrupt/EOFError by printing "Selection cancelled" and returning None,
and ensure out-of-range integers are rejected and prompt repeats rather than
immediately returning None.

Comment on lines +292 to +320
zone_distribution = {}
for node_name, node_data in table_dist.node_distributions.items():
# Try to get zone info for each node
node_info = next((n for n in all_nodes_info if n.name == node_name), None)
if (
node_info
and hasattr(node_info, "attributes")
and node_info.attributes
and "zone" in node_info.attributes
):
zone = node_info.attributes["zone"]
if zone not in zone_distribution:
zone_distribution[zone] = {"nodes": 0, "shards": 0, "size": 0}
zone_distribution[zone]["nodes"] += 1
zone_distribution[zone]["shards"] += node_data["total_shards"]
zone_distribution[zone]["size"] += node_data["total_size_gb"]

if zone_distribution:
rprint("\n[bold]🌍 Zone Distribution[/bold]")
for zone in sorted(zone_distribution.keys()):
zone_data = zone_distribution[zone]
rprint(
f"• {zone}: {zone_data['nodes']} nodes, "
f"{zone_data['shards']} shards, {format_storage_size(zone_data['size'])}"
)

except Exception:
# Zone info not available
logger.exception("Zone info not available")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Zone analysis implementation has structural issues

The zone analysis attempts to access node_info.attributes directly, but based on the NodeInfo model from the imported database client, zone information is stored as a direct property, not within an attributes dictionary.

The current implementation may not work correctly. Let me verify the NodeInfo structure:

Based on the database.py implementation, the zone is stored directly as node.zone, not in node.attributes['zone']. Apply this fix:

-                if (
-                    node_info
-                    and hasattr(node_info, "attributes")
-                    and node_info.attributes
-                    and "zone" in node_info.attributes
-                ):
-                    zone = node_info.attributes["zone"]
+                if node_info and hasattr(node_info, "zone"):
+                    zone = node_info.zone
                     if zone not in zone_distribution:
                         zone_distribution[zone] = {"nodes": 0, "shards": 0, "size": 0}

🏁 Script executed:

#!/bin/bash
# Check the NodeInfo model structure to understand zone attribute handling
ast-grep --pattern 'class NodeInfo'

Length of output: 1552


🏁 Script executed:

#!/bin/bash
# Also check how zone is accessed in the get_nodes_info method
rg -A 10 "attributes\['zone'\]" cratedb_toolkit/admin/xmover/util/database.py

Length of output: 1595


Fix zone attribute lookup to use NodeInfo.zone

The NodeInfo model (cratedb_toolkit/admin/xmover/model.py:10–13) defines zone as a direct property, not inside an attributes dict. The existing condition in table.py (lines 292–300) will never succeed and thus never record per-zone metrics.

Please update the block in cratedb_toolkit/admin/xmover/analysis/table.py:

• File: cratedb_toolkit/admin/xmover/analysis/table.py
• Lines: ~292–300

-            for node_name, node_data in table_dist.node_distributions.items():
-                # Try to get zone info for each node
-                node_info = next((n for n in all_nodes_info if n.name == node_name), None)
-                if (
-                    node_info
-                    and hasattr(node_info, "attributes")
-                    and node_info.attributes
-                    and "zone" in node_info.attributes
-                ):
-                    zone = node_info.attributes["zone"]
+            for node_name, node_data in table_dist.node_distributions.items():
+                # Try to get zone info for each node
+                node_info = next((n for n in all_nodes_info if n.name == node_name), None)
+                # NodeInfo.zone is a top-level field
+                zone = getattr(node_info, "zone", None) if node_info else None
+                if zone:

This ensures zone_distribution is correctly populated based on the actual zone property on NodeInfo.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
zone_distribution = {}
for node_name, node_data in table_dist.node_distributions.items():
# Try to get zone info for each node
node_info = next((n for n in all_nodes_info if n.name == node_name), None)
if (
node_info
and hasattr(node_info, "attributes")
and node_info.attributes
and "zone" in node_info.attributes
):
zone = node_info.attributes["zone"]
if zone not in zone_distribution:
zone_distribution[zone] = {"nodes": 0, "shards": 0, "size": 0}
zone_distribution[zone]["nodes"] += 1
zone_distribution[zone]["shards"] += node_data["total_shards"]
zone_distribution[zone]["size"] += node_data["total_size_gb"]
if zone_distribution:
rprint("\n[bold]🌍 Zone Distribution[/bold]")
for zone in sorted(zone_distribution.keys()):
zone_data = zone_distribution[zone]
rprint(
f"• {zone}: {zone_data['nodes']} nodes, "
f"{zone_data['shards']} shards, {format_storage_size(zone_data['size'])}"
)
except Exception:
# Zone info not available
logger.exception("Zone info not available")
zone_distribution = {}
for node_name, node_data in table_dist.node_distributions.items():
# Try to get zone info for each node
node_info = next((n for n in all_nodes_info if n.name == node_name), None)
# NodeInfo.zone is a top-level field
zone = getattr(node_info, "zone", None) if node_info else None
if zone:
if zone not in zone_distribution:
zone_distribution[zone] = {"nodes": 0, "shards": 0, "size": 0}
zone_distribution[zone]["nodes"] += 1
zone_distribution[zone]["shards"] += node_data["total_shards"]
zone_distribution[zone]["size"] += node_data["total_size_gb"]
if zone_distribution:
rprint("\n[bold]🌍 Zone Distribution[/bold]")
for zone in sorted(zone_distribution.keys()):
zone_data = zone_distribution[zone]
rprint(
f"• {zone}: {zone_data['nodes']} nodes, "
f"{zone_data['shards']} shards, {format_storage_size(zone_data['size'])}"
)
except Exception:
# Zone info not available
logger.exception("Zone info not available")
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/table.py around lines 292 to 300, the
code attempts to read node_info.attributes["zone"] but NodeInfo exposes zone as
a direct property; update the block to read node_info.zone (guarding for
node_info and non-empty zone) instead of checking attributes dict, then use that
zone value to initialize and increment zone_distribution entries (nodes, shards,
size) as currently done.

Comment on lines +775 to +779
ratio = max_docs / min_docs if min_docs > 0 else float("inf")
rprint(
f" [red]⚠ Issue:[/red] Document counts range "
f"from {min_docs:,} to {max_docs:,} ({ratio:.1f}x difference)"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle potential division by zero in document ratio calculation

The ratio calculation could fail if min_docs is 0, even though there's a check for infinity.

Make the division-by-zero handling more explicit:

                         doc_counts = anomaly.details["document_counts"]
                         min_docs = min(doc_counts.values())
                         max_docs = max(doc_counts.values())
-                        ratio = max_docs / min_docs if min_docs > 0 else float("inf")
+                        if min_docs > 0:
+                            ratio = max_docs / min_docs
+                            ratio_text = f"{ratio:.1f}x difference"
+                        else:
+                            ratio_text = "infinite difference (some nodes have 0 docs)"
                         rprint(
                             f"   [red]⚠  Issue:[/red] Document counts range "
-                            f"from {min_docs:,} to {max_docs:,} ({ratio:.1f}x difference)"
+                            f"from {min_docs:,} to {max_docs:,} ({ratio_text})"
                         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ratio = max_docs / min_docs if min_docs > 0 else float("inf")
rprint(
f" [red]⚠ Issue:[/red] Document counts range "
f"from {min_docs:,} to {max_docs:,} ({ratio:.1f}x difference)"
)
doc_counts = anomaly.details["document_counts"]
min_docs = min(doc_counts.values())
max_docs = max(doc_counts.values())
if min_docs > 0:
ratio = max_docs / min_docs
ratio_text = f"{ratio:.1f}x difference"
else:
ratio_text = "infinite difference (some nodes have 0 docs)"
rprint(
f" [red]⚠ Issue:[/red] Document counts range "
f"from {min_docs:,} to {max_docs:,} ({ratio_text})"
)
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/table.py around lines 775 to 779, the
ratio computation uses max_docs / min_docs with only an inline conditional
leading to potential divide-by-zero and unsafe formatting of infinity; change
the logic to explicitly check if min_docs == 0, set ratio to float("inf") in
that branch, and when building the rprint string guard the formatted output so
you don't try to format infinity with a numeric format (e.g., display "infinite"
or "N/A" instead of using {ratio:.1f} when ratio is infinite); ensure the
printed message remains the same for finite ratios and handles zero min_docs
cleanly.

Comment on lines +222 to +249
´´´sql
WITH largest_tables AS (
SELECT
schema_name,
table_name,
SUM(CASE WHEN "primary" = true THEN size ELSE 0 END) as total_primary_size
FROM sys.shards
WHERE schema_name NOT IN ('sys', 'information_schema', 'pg_catalog')
GROUP BY schema_name, table_name
ORDER BY total_primary_size DESC
LIMIT 10
)
SELECT
s.schema_name,
s.table_name,
s.node['name'] as node_name,
COUNT(CASE WHEN s."primary" = true THEN 1 END) as primary_shards,
COUNT(CASE WHEN s."primary" = false THEN 1 END) as replica_shards,
COUNT(*) as total_shards,
ROUND(SUM(s.size) / 1024.0 / 1024.0 / 1024.0, 2) as total_size_gb,
ROUND(SUM(CASE WHEN s."primary" = true THEN s.size ELSE 0 END) / 1024.0 / 1024.0 / 1024.0, 2) as primary_size_gb,
ROUND(SUM(CASE WHEN s."primary" = false THEN s.size ELSE 0 END) / 1024.0 / 1024.0 / 1024.0, 2) as replica_size_gb,
SUM(s.num_docs) as total_documents
FROM sys.shards s
INNER JOIN largest_tables lt ON (s.schema_name = lt.schema_name AND s.table_name = lt.table_name)
GROUP BY s.schema_name, s.table_name, s.node['name']
ORDER BY s.schema_name, s.table_name, s.node['name'];
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken code fence uses accent characters; replace with backticks.

The block starts with ´´´sql instead of ```sql, which breaks rendering.

-´´´sql
+```sql
 ...
-```
+```
🤖 Prompt for AI Agents
In doc/admin/xmover/queries.md around lines 222 to 249, the SQL code fence opens
with accent characters (´´´sql) which breaks rendering; replace the opening
fence with standard backticks (```sql) so the fence matches the closing ``` and
the block renders correctly, ensuring no other accented quotes remain and the
language tag is preserved.

@amotl amotl mentioned this pull request Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants