Skip to content

Commit

Permalink
Document the ParquetRecordBatchStream buffering (#6947)
Browse files Browse the repository at this point in the history
* Document the ParquetRecordBatchStream buffering

* Update parquet/src/arrow/async_reader/mod.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
  • Loading branch information
alamb and tustvold authored Jan 8, 2025
1 parent 4f1f6e5 commit f18dadd
Showing 1 changed file with 14 additions and 2 deletions.
16 changes: 14 additions & 2 deletions parquet/src/arrow/async_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -611,11 +611,23 @@ impl<T> std::fmt::Debug for StreamState<T> {
}
}

/// An asynchronous [`Stream`](https://docs.rs/futures/latest/futures/stream/trait.Stream.html) of [`RecordBatch`]
/// for a parquet file that can be constructed using [`ParquetRecordBatchStreamBuilder`].
/// An asynchronous [`Stream`]of [`RecordBatch`] constructed using [`ParquetRecordBatchStreamBuilder`] to read parquet files.
///
/// `ParquetRecordBatchStream` also provides [`ParquetRecordBatchStream::next_row_group`] for fetching row groups,
/// allowing users to decode record batches separately from I/O.
///
/// # I/O Buffering
///
/// `ParquetRecordBatchStream` buffers *all* data pages selected after predicates
/// (projection + filtering, etc) and decodes the rows from those buffered pages.
///
/// For example, if all rows and columns are selected, the entire row group is
/// buffered in memory during decode. This minimizes the number of IO operations
/// required, which is especially important for object stores, where IO operations
/// have latencies in the hundreds of milliseconds
///
///
/// [`Stream`]: https://docs.rs/futures/latest/futures/stream/trait.Stream.html
pub struct ParquetRecordBatchStream<T> {
metadata: Arc<ParquetMetaData>,

Expand Down

0 comments on commit f18dadd

Please sign in to comment.