Read table by chunks in `SqlToS3Operator` if `max_rows_per_file` specified #47665

Illumaria · 2025-03-12T08:54:07Z

Illumaria
Mar 12, 2025

Hello!

I've been trying to help my colleagues to set up SqlToS3Operator recently, and together we've come across the fact that this operator can't handle big tables correctly. See, it uses get_pandas_df method to read the full table in RAM first, then loads it to S3, optionally in multiple files if max_rows_per_file argument is provided. The problem is, this logic is not suitable for big tables, but can be (with not so much effort) fixed.

Given that SqlToS3Operator._get_hook() method is designed to return DbApiHook instance, and that the latter has a get_pandas_df_by_chunks method, isn't it only natural to use this method instead of get_pandas_df when max_rows_per_file is specified for the SqlToS3Operator?

potiuk · 2025-03-16T09:24:33Z

potiuk
Mar 16, 2025
Collaborator

MIght be worth trying -> PR with proposal is better to discuss such things, and by preparing a PR proposal you might find out more limitations/issues or that it is actually easy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read table by chunks in `SqlToS3Operator` if `max_rows_per_file` specified #47665

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Read table by chunks in SqlToS3Operator if max_rows_per_file specified #47665

Illumaria Mar 12, 2025

Replies: 1 comment

potiuk Mar 16, 2025 Collaborator

Read table by chunks in `SqlToS3Operator` if `max_rows_per_file` specified #47665

Illumaria
Mar 12, 2025

potiuk
Mar 16, 2025
Collaborator