-
Notifications
You must be signed in to change notification settings - Fork 59
revise option to control number of resubmit fail jobs #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- add option: "strategy": {"customized_script_header_template_file": ""}, - add option: `sge_pe_name`
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Update pbs.py
for more information, see https://pre-commit.ci
improve path check (deepmodeling#501)
docs: replace sphinx-rtd-theme with sphinx-book-theme (deepmodeling#505)
build(deps): bump astral-sh/setup-uv from 4 to 5 (deepmodeling#512)
Fixed when a log decoding error caused the process to terminate. (deepmodeling#515)
for more information, see https://pre-commit.ci
Warning Rate limit exceeded@thangckt has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 9 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a configurable retry_count parameter to Machine, exposes it via arginfo, and stores it as an instance attribute. Removes subclass-specific initialization of retry_count in dp_cloud_server.Bohrium and OpenAPI (commented out). Adds several test fixture files containing a relative path to graph.pb. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #545 +/- ##
===========================================
- Coverage 60.15% 47.37% -12.78%
===========================================
Files 39 39
Lines 3905 3905
===========================================
- Hits 2349 1850 -499
- Misses 1556 2055 +499 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dpdispatcher/machine.py (1)
78-101
: Ensure retry_count is persisted and restored across (de)serialization.retry_count is exposed via arginfo and set in init, but Machine.serialize/load_from_dict neither save nor rehydrate it. This drops user configuration on save/load and undermines centralization.
Apply:
def load_from_dict(cls, machine_dict): @@ - context = BaseContext.load_from_dict(machine_dict) - machine = machine_class(context=context) + context = BaseContext.load_from_dict(machine_dict) + machine = machine_class(context=context) + # propagate retry_count even if subclass overrides __init__ + if "retry_count" in machine_dict: + setattr(machine, "retry_count", machine_dict["retry_count"]) return machinedef serialize(self, if_empty_remote_profile=False): machine_dict = {} machine_dict["batch_type"] = self.__class__.__name__ machine_dict["context_type"] = self.context.__class__.__name__ machine_dict["local_root"] = self.context.init_local_root machine_dict["remote_root"] = self.context.init_remote_root + machine_dict["retry_count"] = getattr(self, "retry_count", 3) if not if_empty_remote_profile: machine_dict["remote_profile"] = self.context.remote_profile else: machine_dict["remote_profile"] = {} # normalize the dict base = self.arginfo() machine_dict = base.normalize_value(machine_dict, trim_pattern="_*") return machine_dict
Optionally clarify docs:
- doc_retry_count = "Number of retries to resubmit failed jobs." + doc_retry_count = "Maximum number of retries to resubmit failed jobs (0 = no retries)."Also applies to: 401-420, 152-170
🧹 Nitpick comments (6)
dpdispatcher/machines/openapi.py (1)
31-43
: Call Machine.init to initialize retry_count; drop stale commented code.Subclasses overriding init should invoke the base initializer so Machine-level attributes (e.g., retry_count) are consistently set. This also avoids relying on hasattr fallbacks elsewhere and aligns with the PR goal of centralizing this option.
Apply:
class OpenAPI(Machine): def __init__(self, context): if not found_bohriumsdk: raise ModuleNotFoundError( "bohriumsdk not installed. Install dpdispatcher with `pip install dpdispatcher[bohrium]`" ) - self.context = context + super().__init__(context=context) self.remote_profile = context.remote_profile.copy() self.grouped = self.remote_profile.get("grouped", True) - # self.retry_count = self.remote_profile.get("retry_count", 3) self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)dpdispatcher/machine.py (1)
321-336
: Guard errlog None when constructing last_err_file tail command.Task.errlog allows None; formatting err_file unconditionally via shlex.quote(task.errlog) risks a TypeError or invalid tail paths. Consider a safe fallback.
Example:
- err_file=shlex.quote(task.errlog), + err_file=shlex.quote(task.errlog or "err"),dpdispatcher/machines/dp_cloud_server.py (2)
22-36
: Initialize base Machine to set retry_count; remove commented legacy assignment.Align with centralization by calling Machine.init, which binds context and sets retry_count. The commented line can stay removed.
Apply:
class Bohrium(Machine): @@ - def __init__(self, context): - self.context = context + def __init__(self, context): + super().__init__(context=context) self.input_data = context.remote_profile["input_data"].copy() @@ - # self.retry_count = context.remote_profile.get("retry_count", 3) self.ignore_exit_code = context.remote_profile.get("ignore_exit_code", True)
70-73
: Use super() directly for clarity.- shell_script = super(DpCloudServer, self).gen_script(job) + shell_script = super().gen_script(job)dpdispatcher/submission.py (1)
840-861
: Retry policy simplification looks good; fix log typo and harden edge cases.
- Typo: "fail_cout" ➜ "fail_count".
- Clamp negative retry_count to 0 to avoid surprising behavior.
- Optional: log attempt progress.
Apply:
dlog.info( - f"job: {self.job_hash} {self.job_id} terminated; " - f"fail_cout is {self.fail_count}; resubmitting job" + f"job: {self.job_hash} {self.job_id} terminated; " + f"fail_count is {self.fail_count}; resubmitting job" ) retry_count = 3 # Default retry count assert self.machine is not None if hasattr(self.machine, "retry_count"): retry_count = self.machine.retry_count - dlog.info(f"retry_count: {retry_count}") + retry_count = max(0, int(retry_count)) + dlog.info(f"retry_count: {retry_count}") + dlog.info(f"retry attempt: {self.fail_count}/{retry_count}") - if self.fail_count > retry_count: + if self.fail_count > retry_count:dpdispatcher/machines/batch.py (1)
8-8
: Use a proper Windows batch script header.The current header
@echo off
is appropriate for Windows batch files, but the class name suggests this might be intended as a generic batch system. Consider adding a comment to clarify this is specifically for Windows batch files.-shell_script_header_template = """@echo off""" +shell_script_header_template = """@echo off""" # Windows batch script header
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
dpdispatcher/machine.py
(4 hunks)dpdispatcher/machines/batch.py
(1 hunks)dpdispatcher/machines/dp_cloud_server.py
(1 hunks)dpdispatcher/machines/openapi.py
(1 hunks)dpdispatcher/machines/pbs.py
(1 hunks)dpdispatcher/submission.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
dpdispatcher/machines/batch.py (3)
dpdispatcher/utils/job_status.py (1)
JobStatus
(4-11)dpdispatcher/utils/utils.py (1)
customized_script_header_template
(216-221)dpdispatcher/machines/shell.py (6)
gen_script
(14-16)gen_script_header
(18-30)do_submit
(32-61)check_status
(63-91)check_finish_tag
(104-107)kill
(109-119)
🪛 GitHub Check: pyright
dpdispatcher/machines/batch.py
[failure] 57-57:
Method "default_resources" overrides class "Machine" in an incompatible manner
Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride)
🪛 GitHub Actions: Type checker
dpdispatcher/machines/batch.py
[error] 57-57: Pyright error: Method "default_resources" overrides class "Machine" in an incompatible manner. Parameter 2 name mismatch: base parameter is named "res", override parameter is named "resources" (reportIncompatibleMethodOverride). Command: /home/runner/actions-runner/cached/externals/node20/bin/node /opt/hostedtoolcache/pyright/1.1.404/x64/package/index.js --outputjson
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (3.9, windows-latest)
- GitHub Check: build (slurm)
🔇 Additional comments (3)
dpdispatcher/machines/batch.py (3)
12-14
: LGTM!The script generation properly delegates to the parent class implementation, maintaining consistency with other machine implementations.
16-28
: LGTM!The header generation logic correctly follows the same pattern as the Shell machine, supporting both customized headers and falling back to the default template.
81-83
: LGTM!The finish tag checking follows the same pattern as other machine implementations and correctly uses the context to check file existence.
So far, option to control number of resubmit fail jobs is partially implemented and inconsistent in different Machines.
This PR try to address this problem, as also discussed in issue #525