Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: remove clone on uninitiated_partitions in SortPreservingMergeStream #15562

Merged
merged 4 commits into from
Apr 5, 2025

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Apr 3, 2025

Which issue does this PR close?

Related to:

Rationale for this change

I saw in my code that clone and pop front and push takes some of the time from poll_next_inner

This is from my sampling

Before:
image

After:
image

What changes are included in this PR?

  1. Created custom FixedSizeQueue which avoid cloning the vector (could be avoided even without introducing this)
  2. Not removing items instead just playing with indexes

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

rluvaton added 2 commits April 3, 2025 17:13
… custom fixed size queue

this is done to avoid having to clone or change the actual values
@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

I noticed something similar when profiling some click bench queries locally. I am going to take a look at this one

@alamb alamb added the performance Make DataFusion faster label Apr 3, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rluvaton -- this looks neat. I am running some benchmarks now to see if I can notice any improvement or not

I like the idea of using a fixed size queue rather than reallocating a lot

/// ```plain
/// index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
/// | -------------------------------------------------------- |
/// value: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find these examples easier to follow if index and value didn't have the same values. maybe something like

Suggested change
/// value: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
/// value: | 9 | 8 | 7 | 5 | 5 | 4 | 3 | 2 | 1 | 0 |

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

FYI @jayzhan211

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

BTW I filed a ticket to track this as I have seen it too

@rluvaton
Copy link
Contributor Author

rluvaton commented Apr 3, 2025

I removed the custom fixed size queue and replaced with matching vec deque functions that are 0-cost as it's just index manipulation like my prev impl

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rluvaton -- I am going to run some benchmarks on this PR but it is looking good to me

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing remove-clone-in-merge-perf to 190634b using Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and remove-clone-in-merge-perf
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ remove-clone-in-merge-perf ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1994.03ms │                  2114.36ms │ 1.06x slower │
│ QQuery 1     │   714.60ms │                   725.59ms │    no change │
│ QQuery 2     │  1479.60ms │                  1491.36ms │    no change │
│ QQuery 3     │   726.84ms │                   724.85ms │    no change │
│ QQuery 4     │  1528.33ms │                  1471.64ms │    no change │
│ QQuery 5     │ 17332.42ms │                 17252.62ms │    no change │
│ QQuery 6     │  6593.65ms │                  6733.66ms │    no change │
└──────────────┴────────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 30369.47ms │
│ Total Time (remove-clone-in-merge-perf)   │ 30514.09ms │
│ Average Time (HEAD)                       │  4338.50ms │
│ Average Time (remove-clone-in-merge-perf) │  4359.16ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          1 │
│ Queries with No Change                    │          6 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ remove-clone-in-merge-perf ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.91ms │                     1.91ms │     no change │
│ QQuery 1     │    34.64ms │                    36.00ms │     no change │
│ QQuery 2     │    94.23ms │                    96.36ms │     no change │
│ QQuery 3     │   103.02ms │                   105.47ms │     no change │
│ QQuery 4     │   724.85ms │                   769.54ms │  1.06x slower │
│ QQuery 5     │   864.21ms │                   871.60ms │     no change │
│ QQuery 6     │    34.85ms │                    38.09ms │  1.09x slower │
│ QQuery 7     │    41.16ms │                    40.01ms │     no change │
│ QQuery 8     │   931.81ms │                   934.06ms │     no change │
│ QQuery 9     │  1250.91ms │                  1237.64ms │     no change │
│ QQuery 10    │   273.90ms │                   274.68ms │     no change │
│ QQuery 11    │   314.50ms │                   316.47ms │     no change │
│ QQuery 12    │   968.37ms │                   970.30ms │     no change │
│ QQuery 13    │  1444.19ms │                  1311.05ms │ +1.10x faster │
│ QQuery 14    │   887.14ms │                   897.88ms │     no change │
│ QQuery 15    │  1084.27ms │                  1094.64ms │     no change │
│ QQuery 16    │  1784.60ms │                  1783.92ms │     no change │
│ QQuery 17    │  1657.46ms │                  1665.42ms │     no change │
│ QQuery 18    │  3160.85ms │                  3196.08ms │     no change │
│ QQuery 19    │    87.76ms │                    88.33ms │     no change │
│ QQuery 20    │  1170.79ms │                  1170.42ms │     no change │
│ QQuery 21    │  1398.07ms │                  1385.79ms │     no change │
│ QQuery 22    │  2546.82ms │                  2526.59ms │     no change │
│ QQuery 23    │  8891.65ms │                  8881.91ms │     no change │
│ QQuery 24    │   492.37ms │                   490.54ms │     no change │
│ QQuery 25    │   409.47ms │                   402.15ms │     no change │
│ QQuery 26    │   549.76ms │                   555.44ms │     no change │
│ QQuery 27    │  1727.32ms │                  1710.49ms │     no change │
│ QQuery 28    │ 12450.52ms │                 12704.98ms │     no change │
│ QQuery 29    │   522.10ms │                   542.05ms │     no change │
│ QQuery 30    │   873.89ms │                   863.31ms │     no change │
│ QQuery 31    │   913.68ms │                   909.05ms │     no change │
│ QQuery 32    │  2806.87ms │                  2821.48ms │     no change │
│ QQuery 33    │  3467.36ms │                  3467.75ms │     no change │
│ QQuery 34    │  3453.66ms │                  3495.88ms │     no change │
│ QQuery 35    │  1310.54ms │                  1309.88ms │     no change │
│ QQuery 36    │   233.77ms │                   233.57ms │     no change │
│ QQuery 37    │    90.88ms │                    88.76ms │     no change │
│ QQuery 38    │   128.11ms │                   126.09ms │     no change │
│ QQuery 39    │   412.72ms │                   428.64ms │     no change │
│ QQuery 40    │    52.51ms │                    53.87ms │     no change │
│ QQuery 41    │    44.35ms │                    43.72ms │     no change │
│ QQuery 42    │    56.83ms │                    53.23ms │ +1.07x faster │
└──────────────┴────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 59748.64ms │
│ Total Time (remove-clone-in-merge-perf)   │ 59995.04ms │
│ Average Time (HEAD)                       │  1389.50ms │
│ Average Time (remove-clone-in-merge-perf) │  1395.23ms │
│ Queries Faster                            │          2 │
│ Queries Slower                            │          2 │
│ Queries with No Change                    │         39 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ remove-clone-in-merge-perf ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 126.92ms │                   122.43ms │    no change │
│ QQuery 2     │  25.38ms │                    24.44ms │    no change │
│ QQuery 3     │  36.98ms │                    36.69ms │    no change │
│ QQuery 4     │  21.87ms │                    21.46ms │    no change │
│ QQuery 5     │  60.11ms │                    59.56ms │    no change │
│ QQuery 6     │   8.14ms │                     8.08ms │    no change │
│ QQuery 7     │ 108.25ms │                   106.63ms │    no change │
│ QQuery 8     │  26.90ms │                    27.60ms │    no change │
│ QQuery 9     │  64.05ms │                    62.59ms │    no change │
│ QQuery 10    │  61.00ms │                    59.09ms │    no change │
│ QQuery 11    │  13.21ms │                    12.93ms │    no change │
│ QQuery 12    │  39.17ms │                    38.54ms │    no change │
│ QQuery 13    │  30.77ms │                    30.24ms │    no change │
│ QQuery 14    │  10.20ms │                     9.81ms │    no change │
│ QQuery 15    │  25.45ms │                    25.51ms │    no change │
│ QQuery 16    │  23.12ms │                    23.25ms │    no change │
│ QQuery 17    │  98.87ms │                    99.68ms │    no change │
│ QQuery 18    │ 252.14ms │                   249.58ms │    no change │
│ QQuery 19    │  28.62ms │                    30.49ms │ 1.07x slower │
│ QQuery 20    │  40.49ms │                    39.48ms │    no change │
│ QQuery 21    │ 175.57ms │                   175.44ms │    no change │
│ QQuery 22    │  17.54ms │                    17.30ms │    no change │
└──────────────┴──────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1294.77ms │
│ Total Time (remove-clone-in-merge-perf)   │ 1280.86ms │
│ Average Time (HEAD)                       │   58.85ms │
│ Average Time (remove-clone-in-merge-perf) │   58.22ms │
│ Queries Faster                            │         0 │
│ Queries Slower                            │         1 │
│ Queries with No Change                    │        21 │
└───────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing remove-clone-in-merge-perf to 190634b using Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and remove-clone-in-merge-perf
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ remove-clone-in-merge-perf ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2012.59ms │                  2224.54ms │ 1.11x slower │
│ QQuery 1     │   705.65ms │                   704.64ms │    no change │
│ QQuery 2     │  1479.89ms │                  1469.61ms │    no change │
│ QQuery 3     │   725.32ms │                   727.35ms │    no change │
│ QQuery 4     │  1476.68ms │                  1527.84ms │    no change │
│ QQuery 5     │ 17149.05ms │                 16940.93ms │    no change │
│ QQuery 6     │  6820.99ms │                  6536.01ms │    no change │
└──────────────┴────────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 30370.18ms │
│ Total Time (remove-clone-in-merge-perf)   │ 30130.94ms │
│ Average Time (HEAD)                       │  4338.60ms │
│ Average Time (remove-clone-in-merge-perf) │  4304.42ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          1 │
│ Queries with No Change                    │          6 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ remove-clone-in-merge-perf ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.98ms │                     2.36ms │ 1.20x slower │
│ QQuery 1     │    35.90ms │                    35.63ms │    no change │
│ QQuery 2     │    94.86ms │                    93.34ms │    no change │
│ QQuery 3     │   100.90ms │                    99.66ms │    no change │
│ QQuery 4     │   779.52ms │                   781.17ms │    no change │
│ QQuery 5     │   885.45ms │                   869.92ms │    no change │
│ QQuery 6     │    33.42ms │                    34.61ms │    no change │
│ QQuery 7     │    39.17ms │                    38.60ms │    no change │
│ QQuery 8     │   946.06ms │                   937.39ms │    no change │
│ QQuery 9     │  1231.08ms │                  1218.05ms │    no change │
│ QQuery 10    │   278.58ms │                   284.20ms │    no change │
│ QQuery 11    │   316.25ms │                   329.94ms │    no change │
│ QQuery 12    │   983.88ms │                   949.08ms │    no change │
│ QQuery 13    │  1327.36ms │                  1406.21ms │ 1.06x slower │
│ QQuery 14    │   866.82ms │                   880.91ms │    no change │
│ QQuery 15    │  1085.62ms │                  1064.39ms │    no change │
│ QQuery 16    │  1851.07ms │                  1785.02ms │    no change │
│ QQuery 17    │  1659.18ms │                  1683.25ms │    no change │
│ QQuery 18    │  3161.02ms │                  3172.99ms │    no change │
│ QQuery 19    │    89.02ms │                    87.60ms │    no change │
│ QQuery 20    │  1160.48ms │                  1162.10ms │    no change │
│ QQuery 21    │  1388.23ms │                  1371.16ms │    no change │
│ QQuery 22    │  2528.81ms │                  2501.21ms │    no change │
│ QQuery 23    │  8890.03ms │                  8804.63ms │    no change │
│ QQuery 24    │   487.33ms │                   492.84ms │    no change │
│ QQuery 25    │   411.86ms │                   404.45ms │    no change │
│ QQuery 26    │   556.26ms │                   553.16ms │    no change │
│ QQuery 27    │  1731.10ms │                  1697.32ms │    no change │
│ QQuery 28    │ 12539.36ms │                 12831.34ms │    no change │
│ QQuery 29    │   528.63ms │                   533.84ms │    no change │
│ QQuery 30    │   868.89ms │                   845.91ms │    no change │
│ QQuery 31    │   891.14ms │                   902.47ms │    no change │
│ QQuery 32    │  2817.96ms │                  2819.24ms │    no change │
│ QQuery 33    │  3442.76ms │                  3445.81ms │    no change │
│ QQuery 34    │  3455.33ms │                  3452.43ms │    no change │
│ QQuery 35    │  1299.95ms │                  1338.68ms │    no change │
│ QQuery 36    │   233.57ms │                   225.67ms │    no change │
│ QQuery 37    │    88.45ms │                    89.81ms │    no change │
│ QQuery 38    │   128.21ms │                   132.65ms │    no change │
│ QQuery 39    │   401.18ms │                   401.34ms │    no change │
│ QQuery 40    │    50.02ms │                    49.04ms │    no change │
│ QQuery 41    │    44.33ms │                    44.90ms │    no change │
│ QQuery 42    │    53.24ms │                    54.78ms │    no change │
└──────────────┴────────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 59764.23ms │
│ Total Time (remove-clone-in-merge-perf)   │ 59909.09ms │
│ Average Time (HEAD)                       │  1389.87ms │
│ Average Time (remove-clone-in-merge-perf) │  1393.23ms │
│ Queries Faster                            │          0 │
│ Queries Slower                            │          2 │
│ Queries with No Change                    │         41 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ remove-clone-in-merge-perf ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 122.26ms │                   121.81ms │     no change │
│ QQuery 2     │  24.99ms │                    23.84ms │     no change │
│ QQuery 3     │  35.84ms │                    36.19ms │     no change │
│ QQuery 4     │  21.82ms │                    21.61ms │     no change │
│ QQuery 5     │  57.49ms │                    56.90ms │     no change │
│ QQuery 6     │   8.25ms │                     8.36ms │     no change │
│ QQuery 7     │ 107.72ms │                   107.71ms │     no change │
│ QQuery 8     │  27.40ms │                    27.66ms │     no change │
│ QQuery 9     │  62.50ms │                    63.66ms │     no change │
│ QQuery 10    │  58.45ms │                    61.71ms │  1.06x slower │
│ QQuery 11    │  13.38ms │                    13.28ms │     no change │
│ QQuery 12    │  39.56ms │                    37.43ms │ +1.06x faster │
│ QQuery 13    │  29.90ms │                    30.27ms │     no change │
│ QQuery 14    │   9.94ms │                    10.14ms │     no change │
│ QQuery 15    │  25.68ms │                    26.16ms │     no change │
│ QQuery 16    │  23.97ms │                    24.27ms │     no change │
│ QQuery 17    │  99.55ms │                    96.84ms │     no change │
│ QQuery 18    │ 252.95ms │                   253.93ms │     no change │
│ QQuery 19    │  30.48ms │                    29.15ms │     no change │
│ QQuery 20    │  40.16ms │                    41.20ms │     no change │
│ QQuery 21    │ 182.38ms │                   179.11ms │     no change │
│ QQuery 22    │  17.33ms │                    17.31ms │     no change │
└──────────────┴──────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1291.99ms │
│ Total Time (remove-clone-in-merge-perf)   │ 1288.55ms │
│ Average Time (HEAD)                       │   58.73ms │
│ Average Time (remove-clone-in-merge-perf) │   58.57ms │
│ Queries Faster                            │         1 │
│ Queries Slower                            │         1 │
│ Queries with No Change                    │        20 │
└───────────────────────────────────────────┴───────────┘

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @rluvaton

Unfortunately I didn't see much improvement in my local benchmark runs, but I think this is an improvement regardless

@Dandandan
Copy link
Contributor

Interesting - I wonder if this is something related to how sampling is done with samply (which I learned about from @comphead, thanks). I can confirm it shows up as "very heavy" with TPC-H benchmarks as well (just the first queries I test), but I believe in earlier profiling it didn't really show up. I'll compare some profiling with cargo flamegraph to confirm this is the case.

@Dandandan
Copy link
Contributor

Dandandan commented Apr 4, 2025

My current intuition about it is twofold:

  • It doesn't show up as clearly in a normal flamegraph as it doesn't show you the profile per thread but only the "overall" samples, so the percentage will be divided by the number of total samples (e.g. 2% instead of 20%)
  • SortPreservingMergeStream polls the required input streams - so whenever all the child streams don't produce any new batches it will keep on polling - therefore self.uninitiated_partitions.clone() comes up very often in sample as it is a relatively expensive part of the "polling" loop. Whenever there is useful work todo, self.uninitiated_partitions.clone() is not a bottleneck, processing the batches is much more expensive. Optimizing it will mainly move the CPU power / samples to other parts of the polling loop (it can slightly more quickly poll the streams).

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

My current intuition about it is twofold:

I agree with this assesment -- so we save CPU work with this change but the total query time doesn't really decrease because we are not fully using all of the cores

I also have found Samply super awesome -- thank you @comphead for showing that

Something else I have observed when looking at Samply is that on my laptop at least, it appears there are several times where processing stalls due to parsing parquet metadata:

Screenshot 2025-04-04 at 4 47 14 PM

I was running this

./datafusion-cli-filter-pushdown -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM hits WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

It would be an interesting experiment to see if we can improve performance by caching the metadata.

I will file a ticket to investigate this

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

I will file a ticket to investigate this

@Dandandan Dandandan changed the title perf: replace merge uninitiated_partitions VecDeque<usize> with custom fixed size queue perf: repl merge uninitiated_partitions VecDeque<usize> Apr 5, 2025
@Dandandan Dandandan changed the title perf: repl merge uninitiated_partitions VecDeque<usize> Perf: remove clone on uninitiated_partitions in SortPreservingMergeStream Apr 5, 2025
@Dandandan Dandandan merged commit 80f0489 into apache:main Apr 5, 2025
29 checks passed
@Dandandan
Copy link
Contributor

Thanks @rluvaton

@rluvaton rluvaton deleted the remove-clone-in-merge-perf branch April 6, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve time for SortPreservingMerge stream / uninitiated_partitions VecDeque<usize>
4 participants