Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/Redesign ParallelRunner #4602

Open
merelcht opened this issue Mar 21, 2025 · 2 comments
Open

Improve/Redesign ParallelRunner #4602

merelcht opened this issue Mar 21, 2025 · 2 comments

Comments

@merelcht
Copy link
Member

merelcht commented Mar 21, 2025

Description

We have a growing list of issues related to the ParallelRunner. To effectively address them and improve the runner, we need a clear understanding of each issue, how the ParallelRunner's design contributes to them, and potential solutions.

The task here is to create a summary of the issues and a proposal on how to address them.

@noklam
Copy link
Contributor

noklam commented Apr 9, 2025

kedro-org/kedro-viz#1801

Not sure if we want to add this into the parent ticket. Other than datasets, it's tricky (or impossible) to make hooks working with ParallelRunner.

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Apr 10, 2025

Overview from Tech Design

This issue proposes a series of changes to Kedro’s ParallelRunner to address long-standing usability and reliability problems.

The main goals are:

  • Fix PartitionedDataset caching conflicts in a parallel environment.
  • Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3).
  • Improve compatibility with OS-specific multiprocessing start methods (fork vs. spawn).
  • Provide a safer integration path for plugins like kedro-viz (hooks in a parallel setting).

Summary

PartitionedDataset Caching
PartitionedDataset uses an internal cache (via @cachedmethod) to list partitions. In a parallel pipeline, newly created partitions won’t appear in another process’s cached listing, causing No partitions found errors.

Cloud-Based Dataset Pickling Cloud-based datasets, such as S3, often rely on unpicklable objects (e.g., s3fs.core.S3FileSystem._glob). Python’s multiprocessing attempts to pickle these for inter-process communication, leading to PicklingError or AttributeError.

OS-Specific Multiprocessing Defaults
On Linux, the default multiprocessing start method is fork. Certain libraries like Polars (Rust-based) can hang indefinitely after a fork if they hold open resources. Windows uses spawn, avoiding these issues but introducing other overheads.

kedro-viz Hook Incompatibility
Hooks used by kedro-viz (e.g. DatasetStatsHook) are not multi-process aware. They can’t reliably collect dataset usage data across multiple processes when run under ParallelRunner.

Discussion:

Polars + fork: The discussion highlighted that Polars' internal threading can conflict with the fork method on Unix-like systems, leading to hangs. Switching to spawn (or forkserver) is recommended to avoid these issues, aligning with Polars’ own guidance.

Performance vs. Stability: While fork can be more efficient, it poses risks with certain C-extensions or Rust-based libraries. spawn is safer but slower to initialize new processes.

Usage Data: Historical telemetry indicates ParallelRunner usage is significantly lower than SequentialRunner or ThreadRunner, though exact usage patterns are partially masked in recent data. This raises questions about maintaining or deprecating ParallelRunner.

Potential Deprecation: Some team members asked whether removing ParallelRunner is feasible, given its complexities. However, backward compatibility and the runner’s existing user base must be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

3 participants