Improve/Redesign `ParallelRunner` #4602

merelcht · 2025-03-21T10:01:15Z

Description

We have a growing list of issues related to the ParallelRunner. To effectively address them and improve the runner, we need a clear understanding of each issue, how the ParallelRunner's design contributes to them, and potential solutions.

The task here is to create a summary of the issues and a proposal on how to address them.

noklam · 2025-04-09T13:10:04Z

kedro-org/kedro-viz#1801

Not sure if we want to add this into the parent ticket. Other than datasets, it's tricky (or impossible) to make hooks working with ParallelRunner.

SajidAlamQB · 2025-04-10T14:16:50Z

Overview from Tech Design

This issue proposes a series of changes to Kedro’s ParallelRunner to address long-standing usability and reliability problems.

The main goals are:

Fix PartitionedDataset caching conflicts in a parallel environment.
Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3).
Improve compatibility with OS-specific multiprocessing start methods (fork vs. spawn).
Provide a safer integration path for plugins like kedro-viz (hooks in a parallel setting).

Summary

PartitionedDataset Caching
PartitionedDataset uses an internal cache (via @cachedmethod) to list partitions. In a parallel pipeline, newly created partitions won’t appear in another process’s cached listing, causing No partitions found errors.

Cloud-Based Dataset Pickling Cloud-based datasets, such as S3, often rely on unpicklable objects (e.g., s3fs.core.S3FileSystem._glob). Python’s multiprocessing attempts to pickle these for inter-process communication, leading to PicklingError or AttributeError.

OS-Specific Multiprocessing Defaults
On Linux, the default multiprocessing start method is fork. Certain libraries like Polars (Rust-based) can hang indefinitely after a fork if they hold open resources. Windows uses spawn, avoiding these issues but introducing other overheads.

kedro-viz Hook Incompatibility
Hooks used by kedro-viz (e.g. DatasetStatsHook) are not multi-process aware. They can’t reliably collect dataset usage data across multiple processes when run under ParallelRunner.

Discussion:

Polars + fork: The discussion highlighted that Polars' internal threading can conflict with the fork method on Unix-like systems, leading to hangs. Switching to spawn (or forkserver) is recommended to avoid these issues, aligning with Polars’ own guidance.

Performance vs. Stability: While fork can be more efficient, it poses risks with certain C-extensions or Rust-based libraries. spawn is safer but slower to initialize new processes.

Usage Data: Historical telemetry indicates ParallelRunner usage is significantly lower than SequentialRunner or ThreadRunner, though exact usage patterns are partially masked in recent data. This raises questions about maintaining or deprecating ParallelRunner.

Potential Deprecation: Some team members asked whether removing ParallelRunner is feasible, given its complexities. However, backward compatibility and the runner’s existing user base must be considered.

merelcht added the Type: Parent Issue label Mar 21, 2025

merelcht added this to the Improve Runners (code & efficiency) milestone Mar 21, 2025

merelcht added this to Kedro 🔶 Mar 21, 2025

merelcht moved this to To Do in Kedro 🔶 Mar 21, 2025

merelcht assigned SajidAlamQB Mar 24, 2025

SajidAlamQB moved this from To Do to In Progress in Kedro 🔶 Mar 27, 2025

github-actions bot mentioned this issue Apr 1, 2025

Monthly issue metrics report #4628

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/Redesign `ParallelRunner` #4602

Improve/Redesign `ParallelRunner` #4602

merelcht commented Mar 21, 2025 •

edited

Loading

noklam commented Apr 9, 2025

SajidAlamQB commented Apr 10, 2025 •

edited

Loading

Improve/Redesign ParallelRunner #4602

Improve/Redesign ParallelRunner #4602

Comments

merelcht commented Mar 21, 2025 • edited Loading

Description

noklam commented Apr 9, 2025

SajidAlamQB commented Apr 10, 2025 • edited Loading

Summary

Discussion:

Improve/Redesign `ParallelRunner` #4602

Improve/Redesign `ParallelRunner` #4602

merelcht commented Mar 21, 2025 •

edited

Loading

SajidAlamQB commented Apr 10, 2025 •

edited

Loading