Skip to content

Conversation

thangckt
Copy link
Contributor

@thangckt thangckt commented Aug 27, 2025

So far, option to control number of resubmit fail jobs is partially implemented and inconsistent in different Machines.
This PR try to address this problem, as also discussed in issue #525

thangckt and others added 30 commits March 26, 2024 16:25
- add option: "strategy": {"customized_script_header_template_file": ""},

- add option: `sge_pe_name`
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

Warning

Rate limit exceeded

@thangckt has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 9 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 6b244da and e897d8a.

📒 Files selected for processing (1)
  • tests/test_argcheck.py (1 hunks)
📝 Walkthrough

Walkthrough

Adds a configurable retry_count parameter to Machine, exposes it via arginfo, and stores it as an instance attribute. Removes subclass-specific initialization of retry_count in dp_cloud_server.Bohrium and OpenAPI (commented out). Adds several test fixture files containing a relative path to graph.pb.

Changes

Cohort / File(s) Summary
Core Machine API
dpdispatcher/machine.py
Adds keyword-only retry_count=3 to Machine.__init__, stores self.retry_count, and exposes it via arginfo() with doc_retry_count. No control-flow changes.
Cloud/OpenAPI Machines
dpdispatcher/machines/dp_cloud_server.py, dpdispatcher/machines/openapi.py
Removes assignment of self.retry_count from remote_profile (lines commented). No other initialization changes.
Test fixtures (graph.pb references)
tests/test_context_dir/0_md/graph.pb, tests/test_hdfs_dir/0_md/graph.pb, tests/test_lsf_dir/0_md/graph.pb, tests/test_pbs_dir/0_md/graph.pb, tests/test_shell_trival_dir/parent_dir/graph.pb, tests/test_slurm_dir/0_md/graph.pb
Adds files each containing a single line ../../graph.pb (no trailing newline). Test data only; no code changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.37%. Comparing base (8b98b36) to head (e897d8a).

❗ There is a different number of reports uploaded between BASE (8b98b36) and HEAD (e897d8a). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (8b98b36) HEAD (e897d8a)
15 14
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #545       +/-   ##
===========================================
- Coverage   60.15%   47.37%   -12.78%     
===========================================
  Files          39       39               
  Lines        3905     3905               
===========================================
- Hits         2349     1850      -499     
- Misses       1556     2055      +499     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dpdispatcher/machine.py (1)

78-101: Ensure retry_count is persisted and restored across (de)serialization.

retry_count is exposed via arginfo and set in init, but Machine.serialize/load_from_dict neither save nor rehydrate it. This drops user configuration on save/load and undermines centralization.

Apply:

 def load_from_dict(cls, machine_dict):
@@
-        context = BaseContext.load_from_dict(machine_dict)
-        machine = machine_class(context=context)
+        context = BaseContext.load_from_dict(machine_dict)
+        machine = machine_class(context=context)
+        # propagate retry_count even if subclass overrides __init__
+        if "retry_count" in machine_dict:
+            setattr(machine, "retry_count", machine_dict["retry_count"])
         return machine
 def serialize(self, if_empty_remote_profile=False):
     machine_dict = {}
     machine_dict["batch_type"] = self.__class__.__name__
     machine_dict["context_type"] = self.context.__class__.__name__
     machine_dict["local_root"] = self.context.init_local_root
     machine_dict["remote_root"] = self.context.init_remote_root
+    machine_dict["retry_count"] = getattr(self, "retry_count", 3)
     if not if_empty_remote_profile:
         machine_dict["remote_profile"] = self.context.remote_profile
     else:
         machine_dict["remote_profile"] = {}
     # normalize the dict
     base = self.arginfo()
     machine_dict = base.normalize_value(machine_dict, trim_pattern="_*")
     return machine_dict

Optionally clarify docs:

-        doc_retry_count = "Number of retries to resubmit failed jobs."
+        doc_retry_count = "Maximum number of retries to resubmit failed jobs (0 = no retries)."

Also applies to: 401-420, 152-170

🧹 Nitpick comments (6)
dpdispatcher/machines/openapi.py (1)

31-43: Call Machine.init to initialize retry_count; drop stale commented code.

Subclasses overriding init should invoke the base initializer so Machine-level attributes (e.g., retry_count) are consistently set. This also avoids relying on hasattr fallbacks elsewhere and aligns with the PR goal of centralizing this option.

Apply:

 class OpenAPI(Machine):
     def __init__(self, context):
         if not found_bohriumsdk:
             raise ModuleNotFoundError(
                 "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`"
             )
-        self.context = context
+        super().__init__(context=context)
         self.remote_profile = context.remote_profile.copy()

         self.grouped = self.remote_profile.get("grouped", True)
-        # self.retry_count = self.remote_profile.get("retry_count", 3)
         self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)
dpdispatcher/machine.py (1)

321-336: Guard errlog None when constructing last_err_file tail command.

Task.errlog allows None; formatting err_file unconditionally via shlex.quote(task.errlog) risks a TypeError or invalid tail paths. Consider a safe fallback.

Example:

-                err_file=shlex.quote(task.errlog),
+                err_file=shlex.quote(task.errlog or "err"),
dpdispatcher/machines/dp_cloud_server.py (2)

22-36: Initialize base Machine to set retry_count; remove commented legacy assignment.

Align with centralization by calling Machine.init, which binds context and sets retry_count. The commented line can stay removed.

Apply:

 class Bohrium(Machine):
@@
-    def __init__(self, context):
-        self.context = context
+    def __init__(self, context):
+        super().__init__(context=context)
         self.input_data = context.remote_profile["input_data"].copy()
@@
-        # self.retry_count = context.remote_profile.get("retry_count", 3)
         self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)

70-73: Use super() directly for clarity.

-        shell_script = super(DpCloudServer, self).gen_script(job)
+        shell_script = super().gen_script(job)
dpdispatcher/submission.py (1)

840-861: Retry policy simplification looks good; fix log typo and harden edge cases.

  • Typo: "fail_cout" ➜ "fail_count".
  • Clamp negative retry_count to 0 to avoid surprising behavior.
  • Optional: log attempt progress.

Apply:

             dlog.info(
-                f"job: {self.job_hash} {self.job_id} terminated; "
-                f"fail_cout is {self.fail_count}; resubmitting job"
+                f"job: {self.job_hash} {self.job_id} terminated; "
+                f"fail_count is {self.fail_count}; resubmitting job"
             )
             retry_count = 3  # Default retry count
             assert self.machine is not None
             if hasattr(self.machine, "retry_count"):
                 retry_count = self.machine.retry_count

-            dlog.info(f"retry_count: {retry_count}")
+            retry_count = max(0, int(retry_count))
+            dlog.info(f"retry_count: {retry_count}")
+            dlog.info(f"retry attempt: {self.fail_count}/{retry_count}")
 
-            if self.fail_count > retry_count:
+            if self.fail_count > retry_count:
dpdispatcher/machines/batch.py (1)

8-8: Use a proper Windows batch script header.

The current header @echo off is appropriate for Windows batch files, but the class name suggests this might be intended as a generic batch system. Consider adding a comment to clarify this is specifically for Windows batch files.

-shell_script_header_template = """@echo off"""
+shell_script_header_template = """@echo off"""  # Windows batch script header
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b98b36 and 909e595.

📒 Files selected for processing (6)
  • dpdispatcher/machine.py (4 hunks)
  • dpdispatcher/machines/batch.py (1 hunks)
  • dpdispatcher/machines/dp_cloud_server.py (1 hunks)
  • dpdispatcher/machines/openapi.py (1 hunks)
  • dpdispatcher/machines/pbs.py (1 hunks)
  • dpdispatcher/submission.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpdispatcher/machines/batch.py (3)
dpdispatcher/utils/job_status.py (1)
  • JobStatus (4-11)
dpdispatcher/utils/utils.py (1)
  • customized_script_header_template (216-221)
dpdispatcher/machines/shell.py (6)
  • gen_script (14-16)
  • gen_script_header (18-30)
  • do_submit (32-61)
  • check_status (63-91)
  • check_finish_tag (104-107)
  • kill (109-119)
🪛 GitHub Check: pyright
dpdispatcher/machines/batch.py

[failure] 57-57:
Method "default_resources" overrides class "Machine" in an incompatible manner
  Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride)

🪛 GitHub Actions: Type checker
dpdispatcher/machines/batch.py

[error] 57-57: Pyright error: Method "default_resources" overrides class "Machine" in an incompatible manner. Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride). Command: /home/runner/actions-runner/cached/externals/node20/bin/node /opt/hostedtoolcache/pyright/1.1.404/x64/package/index.js --outputjson

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (3.9, windows-latest)
  • GitHub Check: build (slurm)
🔇 Additional comments (3)
dpdispatcher/machines/batch.py (3)

12-14: LGTM!

The script generation properly delegates to the parent class implementation, maintaining consistency with other machine implementations.


16-28: LGTM!

The header generation logic correctly follows the same pattern as the Shell machine, supporting both customized headers and falling back to the default template.


81-83: LGTM!

The finish tag checking follows the same pattern as other machine implementations and correctly uses the context to check file existence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant