revise option to control number of resubmit fail jobs #545

thangckt · 2025-08-27T11:01:21Z

So far, option to control number of resubmit fail jobs is partially implemented and inconsistent in different Machines.
This PR try to address this problem, as also discussed in issue #525

- add option: "strategy": {"customized_script_header_template_file": ""}, - add option: `sge_pe_name`

for more information, see https://pre-commit.ci

u

for more information, see https://pre-commit.ci

u

for more information, see https://pre-commit.ci

u

for more information, see https://pre-commit.ci

u

for more information, see https://pre-commit.ci

Update pbs.py

for more information, see https://pre-commit.ci

improve path check (deepmodeling#501)

docs: replace sphinx-rtd-theme with sphinx-book-theme (deepmodeling#505)

fetch

build(deps): bump astral-sh/setup-uv from 4 to 5 (deepmodeling#512)

fetch

Fixed when a log decoding error caused the process to terminate. (deepmodeling#515)

for more information, see https://pre-commit.ci

coderabbitai · 2025-08-27T11:05:18Z

Warning

Rate limit exceeded

@thangckt has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 9 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 6b244da and e897d8a.

📒 Files selected for processing (1)

tests/test_argcheck.py (1 hunks)

📝 Walkthrough

Walkthrough

Adds a configurable retry_count parameter to Machine, exposes it via arginfo, and stores it as an instance attribute. Removes subclass-specific initialization of retry_count in dp_cloud_server.Bohrium and OpenAPI (commented out). Adds several test fixture files containing a relative path to graph.pb.

Changes

Cohort / File(s)	Summary
Core Machine API `dpdispatcher/machine.py`	Adds keyword-only `retry_count=3` to `Machine.__init__`, stores `self.retry_count`, and exposes it via `arginfo()` with `doc_retry_count`. No control-flow changes.
Cloud/OpenAPI Machines `dpdispatcher/machines/dp_cloud_server.py`, `dpdispatcher/machines/openapi.py`	Removes assignment of `self.retry_count` from `remote_profile` (lines commented). No other initialization changes.
Test fixtures (graph.pb references) `tests/test_context_dir/0_md/graph.pb`, `tests/test_hdfs_dir/0_md/graph.pb`, `tests/test_lsf_dir/0_md/graph.pb`, `tests/test_pbs_dir/0_md/graph.pb`, `tests/test_shell_trival_dir/parent_dir/graph.pb`, `tests/test_slurm_dir/0_md/graph.pb`	Adds files each containing a single line `../../graph.pb` (no trailing newline). Test data only; no code changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-08-27T11:05:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.37%. Comparing base (8b98b36) to head (e897d8a).

❗ There is a different number of reports uploaded between BASE (8b98b36) and HEAD (e897d8a). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (8b98b36) HEAD (e897d8a)

15 14

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #545       +/-   ##
===========================================
- Coverage   60.15%   47.37%   -12.78%     
===========================================
  Files          39       39               
  Lines        3905     3905               
===========================================
- Hits         2349     1850      -499     
- Misses       1556     2055      +499

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dpdispatcher/machine.py (1)

78-101: Ensure retry_count is persisted and restored across (de)serialization.

retry_count is exposed via arginfo and set in init, but Machine.serialize/load_from_dict neither save nor rehydrate it. This drops user configuration on save/load and undermines centralization.

Apply:

 def load_from_dict(cls, machine_dict):
@@
-        context = BaseContext.load_from_dict(machine_dict)
-        machine = machine_class(context=context)
+        context = BaseContext.load_from_dict(machine_dict)
+        machine = machine_class(context=context)
+        # propagate retry_count even if subclass overrides __init__
+        if "retry_count" in machine_dict:
+            setattr(machine, "retry_count", machine_dict["retry_count"])
         return machine

 def serialize(self, if_empty_remote_profile=False):
     machine_dict = {}
     machine_dict["batch_type"] = self.__class__.__name__
     machine_dict["context_type"] = self.context.__class__.__name__
     machine_dict["local_root"] = self.context.init_local_root
     machine_dict["remote_root"] = self.context.init_remote_root
+    machine_dict["retry_count"] = getattr(self, "retry_count", 3)
     if not if_empty_remote_profile:
         machine_dict["remote_profile"] = self.context.remote_profile
     else:
         machine_dict["remote_profile"] = {}
     # normalize the dict
     base = self.arginfo()
     machine_dict = base.normalize_value(machine_dict, trim_pattern="_*")
     return machine_dict

Optionally clarify docs:

-        doc_retry_count = "Number of retries to resubmit failed jobs."
+        doc_retry_count = "Maximum number of retries to resubmit failed jobs (0 = no retries)."

Also applies to: 401-420, 152-170

🧹 Nitpick comments (6)

dpdispatcher/machines/openapi.py (1)
31-43: Call Machine.init to initialize retry_count; drop stale commented code.

Subclasses overriding init should invoke the base initializer so Machine-level attributes (e.g., retry_count) are consistently set. This also avoids relying on hasattr fallbacks elsewhere and aligns with the PR goal of centralizing this option.

Apply:
 class OpenAPI(Machine):
     def __init__(self, context):
         if not found_bohriumsdk:
             raise ModuleNotFoundError(
                 "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`"
             )
-        self.context = context
+        super().__init__(context=context)
         self.remote_profile = context.remote_profile.copy()

         self.grouped = self.remote_profile.get("grouped", True)
-        # self.retry_count = self.remote_profile.get("retry_count", 3)
         self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)
dpdispatcher/machine.py (1)
321-336: Guard errlog None when constructing last_err_file tail command.

Task.errlog allows None; formatting err_file unconditionally via shlex.quote(task.errlog) risks a TypeError or invalid tail paths. Consider a safe fallback.

Example:
-                err_file=shlex.quote(task.errlog),
+                err_file=shlex.quote(task.errlog or "err"),
dpdispatcher/machines/dp_cloud_server.py (2)
22-36: Initialize base Machine to set retry_count; remove commented legacy assignment.

Align with centralization by calling Machine.init, which binds context and sets retry_count. The commented line can stay removed.

Apply:
 class Bohrium(Machine):
@@
-    def __init__(self, context):
-        self.context = context
+    def __init__(self, context):
+        super().__init__(context=context)
         self.input_data = context.remote_profile["input_data"].copy()
@@
-        # self.retry_count = context.remote_profile.get("retry_count", 3)
         self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)
70-73: Use super() directly for clarity.
-        shell_script = super(DpCloudServer, self).gen_script(job)
+        shell_script = super().gen_script(job)
dpdispatcher/submission.py (1)
840-861: Retry policy simplification looks good; fix log typo and harden edge cases.

Typo: "fail_cout" ➜ "fail_count".

Clamp negative retry_count to 0 to avoid surprising behavior.

Optional: log attempt progress.

Apply:
             dlog.info(
-                f"job: {self.job_hash} {self.job_id} terminated; "
-                f"fail_cout is {self.fail_count}; resubmitting job"
+                f"job: {self.job_hash} {self.job_id} terminated; "
+                f"fail_count is {self.fail_count}; resubmitting job"
             )
             retry_count = 3  # Default retry count
             assert self.machine is not None
             if hasattr(self.machine, "retry_count"):
                 retry_count = self.machine.retry_count

-            dlog.info(f"retry_count: {retry_count}")
+            retry_count = max(0, int(retry_count))
+            dlog.info(f"retry_count: {retry_count}")
+            dlog.info(f"retry attempt: {self.fail_count}/{retry_count}")
 
-            if self.fail_count > retry_count:
+            if self.fail_count > retry_count:
dpdispatcher/machines/batch.py (1)
8-8: Use a proper Windows batch script header.

The current header @echo off is appropriate for Windows batch files, but the class name suggests this might be intended as a generic batch system. Consider adding a comment to clarify this is specifically for Windows batch files.
-shell_script_header_template = """@echo off"""
+shell_script_header_template = """@echo off"""  # Windows batch script header

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b98b36 and 909e595.

📒 Files selected for processing (6)

dpdispatcher/machine.py (4 hunks)
dpdispatcher/machines/batch.py (1 hunks)
dpdispatcher/machines/dp_cloud_server.py (1 hunks)
dpdispatcher/machines/openapi.py (1 hunks)
dpdispatcher/machines/pbs.py (1 hunks)
dpdispatcher/submission.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

dpdispatcher/machines/batch.py (3)

dpdispatcher/utils/job_status.py (1)

JobStatus (4-11)

dpdispatcher/utils/utils.py (1)

customized_script_header_template (216-221)

dpdispatcher/machines/shell.py (6)

gen_script (14-16)

gen_script_header (18-30)

do_submit (32-61)

check_status (63-91)

check_finish_tag (104-107)

kill (109-119)

🪛 GitHub Check: pyright

dpdispatcher/machines/batch.py

[failure] 57-57:
Method "default_resources" overrides class "Machine" in an incompatible manner
Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride)

🪛 GitHub Actions: Type checker

dpdispatcher/machines/batch.py

[error] 57-57: Pyright error: Method "default_resources" overrides class "Machine" in an incompatible manner. Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride). Command: /home/runner/actions-runner/cached/externals/node20/bin/node /opt/hostedtoolcache/pyright/1.1.404/x64/package/index.js --outputjson

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (3.9, windows-latest)
GitHub Check: build (slurm)

🔇 Additional comments (3)

dpdispatcher/machines/batch.py (3)

12-14: LGTM!

The script generation properly delegates to the parent class implementation, maintaining consistency with other machine implementations.

16-28: LGTM!

The header generation logic correctly follows the same pattern as the Shell machine, supporting both customized headers and falling back to the default template.

81-83: LGTM!

The finish tag checking follows the same pattern as other machine implementations and correctly uses the context to check file existence.

dpdispatcher/machines/batch.py

dpdispatcher/machines/pbs.py

thangckt and others added 30 commits March 26, 2024 16:25

revise SGE

11fe336

- add option: "strategy": {"customized_script_header_template_file": ""}, - add option: `sge_pe_name`

Update pbs.py

8140a43

Update pbs.py

b893544

Update pbs.py

82b7262

add sge_qe_name

01ffd14

Update pbs.py

56e3afe

y

703aa49

Update submission.py

2c3b38a

u

f73e0f9

[pre-commit.ci] auto fixes from pre-commit.com hooks

957cf4a

for more information, see https://pre-commit.ci

Update pbs.py

034c9d7

Update pbs.py

3f07742

Merge branch 'PR' into master

499c691

Merge pull request #1 from thangckt/master

76150bf

u

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef28a1b

for more information, see https://pre-commit.ci

u

65dca91

Merge pull request #2 from thangckt/master

ffe52e1

u

[pre-commit.ci] auto fixes from pre-commit.com hooks

8d7b566

for more information, see https://pre-commit.ci

u

8c50c4f

Merge branch 'PR' into master

89d2a41

Merge pull request #3 from thangckt/master

2093336

u

[pre-commit.ci] auto fixes from pre-commit.com hooks

920984d

for more information, see https://pre-commit.ci

y

c8f5657

Merge branch 'master' of https://github.com/thangckt/dpdispatcher

ef6ebeb

Merge branch 'PR' into master

f56fa50

Merge pull request #4 from thangckt/master

e06c648

u

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d6c973

for more information, see https://pre-commit.ci

Update pbs.py

dae7909

Merge pull request #5 from thangckt/master

4ca179c

Update pbs.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

83644c3

for more information, see https://pre-commit.ci

thangckt and others added 23 commits October 16, 2024 02:00

Merge pull request #19 from deepmodeling/master

d28233d

improve path check (deepmodeling#501)

Update submission.py

1f29674

Update submission.py

ba5147d

Update submission.py

7ee3919

Update submission.py

2fb5e81

Update submission.py

033443c

Update submission.py

f977a7a

Update pbs.py

1787bcc

Merge pull request #20 from deepmodeling/master

2e8b001

docs: replace sphinx-rtd-theme with sphinx-book-theme (deepmodeling#505)

Merge pull request #21 from deepmodeling/master

c56ef45

fetch

Update shell.py

6bd72ad

u

57a5991

Update batch.py

69b4a0b

Update batch.py

e80b712

Merge pull request #22 from deepmodeling/master

2f22fe5

build(deps): bump astral-sh/setup-uv from 4 to 5 (deepmodeling#512)

Update machine.py

4bf6a76

Update machine.py

170b1ef

Merge pull request #23 from deepmodeling/master

9db2abc

fetch

Merge pull request #24 from deepmodeling/master

4ad8a82

Fixed when a log decoding error caused the process to terminate. (deepmodeling#515)

Merge remote-tracking branch 'upstream/master'

96176dc

revise option to control number of resubmit fail jobs

284f1f8

revise resubmit fail jobs

8b04937

[pre-commit.ci] auto fixes from pre-commit.com hooks

909e595

for more information, see https://pre-commit.ci

u

1ee843f

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

thangckt added 3 commits August 27, 2025 20:15

u

6b244da

u

4bb42d8

Update test_argcheck.py

e897d8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

revise option to control number of resubmit fail jobs #545

revise option to control number of resubmit fail jobs #545

Uh oh!

thangckt commented Aug 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Aug 27, 2025 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

codecov bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

revise option to control number of resubmit fail jobs #545

Are you sure you want to change the base?

revise option to control number of resubmit fail jobs #545

Uh oh!

Conversation

thangckt commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thangckt commented Aug 27, 2025 •

edited

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Aug 27, 2025 •

edited

Loading