Reduce memory usage in writer with more memory efficient output buffer implementation #24913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently ChunkedSliceOutput is used for storing compressed output in writer. It managed list of buffers with size of power of 2 (e.g. 8k, 16k, 32k), and reuse buffers after flushing. It could leads to extra memory usage and OOM due to 1) mismatch in compressed output size and buffer size, 2) reusing buffers and not freeing buffers leads to extra memory usage by design.
Common scenario which leads to OOM includes
This PR introduce OrcLazyChunkedOutputBuffer which focus on avoiding used memory.
This behavior is controlled by lazyOutputBuffer in OrcWriterOptions, and it's disabled by default.
Impact
Reduce memory usage in writer.
Test Plan
Tested with Spark workload with high memory usage.
~10% improvement in run time and resource usage (memory reservation time), reduced GC time.
Tested with general Spark workload
No change in cpu time, slight reduction in run time and GC time.
Release Notes
General change
Reduce memory usage in writer with more memory efficient output buffer implementation