-
Notifications
You must be signed in to change notification settings - Fork 941
Setup experimental Parquet reader headers for hybrid scan #18471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup experimental Parquet reader headers for hybrid scan #18471
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
* that appear in the filter expression) and the second pass optimally reads the `payload` columns | ||
* (i.e. columns that do not appear in the filter expression) | ||
*/ | ||
class hybrid_scan_reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main public hybrid_scan_reader
class to be used by Spark and other libcudf users
* @brief Internal experimental Parquet reader optimized for highly selective filters (Hybrid Scan | ||
* operation). | ||
*/ | ||
class impl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward declared. Hides implementation details
/** | ||
* @brief Class for parsing dataset metadata | ||
*/ | ||
struct metadata : private metadata_base { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derived from metadata
in reader_impl_helpers.hpp
allowing file metadata construction from footer bytes
}; | ||
|
||
class aggregate_reader_metadata : public aggregate_reader_metadata_base { | ||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derived from aggregate_reader_metadata
in reader_impl_helpers.hpp
and expands its functionality.
@@ -134,6 +135,7 @@ struct surviving_row_group_metrics { | |||
}; | |||
|
|||
class aggregate_reader_metadata { | |||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to protected to access from the child class
@@ -113,6 +113,7 @@ struct row_group_info { | |||
* @brief Class for parsing dataset metadata | |||
*/ | |||
struct metadata : public FileMetaData { | |||
metadata() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default constructor to use footer bytes based construction in the child class
/** | ||
* @brief Implementation for Parquet reader | ||
*/ | ||
class impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly identical to the reader::impl
class in reader_impl.hpp
* | ||
* @return Parquet file footer metadata | ||
*/ | ||
[[nodiscard]] cudf::io::parquet::FileMetaData const& get_parquet_metadata() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the materialized parquet file footer aka the FileMetaData
struct. The footer also holds materialized ColumnIndex and OffsetIndex
if this API is called after setup_page_index()
rmm::cuda_stream_view stream) const | ||
{ | ||
// Temporary vector with row group indices from the first source | ||
auto const input_row_group_indices = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply create a vector> of row_group_indices (to expand for multiple input sources in the future) and pass them on to _impl
@@ -0,0 +1,196 @@ | |||
/* | |||
* Copyright (c) 2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review. Only contains empty definitions for impl
's member functions declared in reader_impl.hpp
@@ -0,0 +1,101 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review. Only contains empty definitions for impl
's member functions declared in reader_impl.hpp
@@ -0,0 +1,155 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review. Only contains empty definitions for member functions of aggregate_reader_metadata
and metadata
declared in reader_impl_helpers.hpp
@@ -0,0 +1,79 @@ | |||
/* | |||
* Copyright (c) 2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to review. Only empty definitions of impl
functions declared in reader_impl.hpp
Closing this PR as the same changes being reviewed in #18480 |
Description
Contributes to #17896. Part of #18011.
This PR sets up header file and basic interfaces for the experimental Parquet reader for highly-selective hybrid scan queries. The PR also includes empty member function definitions
Checklist