-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Clickbench Q29 5x faster for datafusion #15524
Comments
Clickhouse also has the optimization: |
take |
Can anyone guide me how to do this in datafusion, i am not familiar with the rewrite now. So need some code reference. cc @alamb @jonahgao @jayzhan211 |
You can implement this in rewrite rule datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs Lines 730 to 748 in 507f6b6
if there is |
Thank you @jayzhan211 for the guide, i will try this! |
Nice idea! |
As long as this optimization is plausibly general purpose (not just for ClickBench) and the code required is isolated / doesn't add a bunch of complexity, I am in support of this optimization Thank you @zhuqi-lucas |
I found a duckdb implementation of a seemingling similar optimization: https://github.com/duckdb/duckdb/blob/7912713493b38b1eda162f29b7759d5024989a5f/src/optimizer/sum_rewriter.cpp#L25 |
Thank you @Dandandan @alamb for double check and confirm! |
I found the rationale behind adding it in DuckDB was also ClickBench.
😆 So we're now officially benchmaxxing! |
I'm wondering if instead of this specialized
We can do instead the more general (distributive property of summation - doesn't need to be a constant):
And have other optimizations follow (transform |
The nice part of that is that addition in |
Thank you @Dandandan for this good idea, i agree with you, it's a more common solution and it will benefit more cases. And meanwhile, may be we can do more cases in future, such as when MAX(x) > 0: MAX(5 * x) => 5 * MAX(x) etc |
I haven't reviewed the PR yet, but I agree with @Dandandan, and I think we can improve this further. We've actually thought about this issue before and sketched out an initial design. Let me share some notes from that: This simplification should be based on the linearity property, not just f(x + y) = f(x) + f(y), if f is a linear function. So, I believe we should define a Consider the same query: SELECT SUM(id), SUM(id + 1), SUM(id + 2), ..., SUM(id + 89) FROM employees; LP:
PP:
We should apply the linearity property here to simplify expressions like SUM(id + n) into SUM(id) + n * COUNT(1), when n is a constant. It doesn't effect the performance of this clickbench query, but we should also properly handle the cases when n is not constant as well. |
Thank you @berkaysynnada, i agree it's a common linearity property, this is a great idea. I will try to address it, and may be we can start from SUM function. And add more cases to extend it in future, such as add linearity enum and add SUM, etc to it. |
It looks like there were several follow up tickets in DuckDB (no test changes which is weird) |
Try to address the comments for sum(3a + 2b) => 3sum(a) + 2sum(b), but it's more complex than i expected, and more corner cases and testing failed, i am looking into and try to fix it. |
Need help, i am not sure if i do the right direction for this ticket: I am still debugging my draft PR, if folks can cooperate, feel free to add a fix to my PR because it will meet more corner cases i believe, and i am still not fixing the existing error: More problems need to be fixed based on my PR, for example:
logical_plan after simplify_expressions
01)Projection: sum(simple_explain_test.a + simple_explain_test.b)
02)--Aggregate: groupBy=[[]], aggr=[[sum(simple_explain_test.a) + sum(simple_explain_test.b) AS sum(simple_explain_test.a + simple_explain_test.b)]]
03)----TableScan: simple_explain_test logical_plan
01)Aggregate: groupBy=[[]], aggr=[[sum(simple_explain_test.a) + sum(simple_explain_test.b) AS sum(simple_explain_test.a + simple_explain_test.b)]]
02)--TableScan: simple_explain_test projection=[a, b]
physical_plan_error
01)Internal error: Invalid aggregate expression 'BinaryExpr(BinaryExpr { left: AggregateFunction(AggregateFunction { func: AggregateUDF { inner: Sum { signature: Signature { type_signature: UserDefined, volatility: Immutable } } }, params: AggregateFunctionParams { args: [Column(Column { relation: Some(Bare { table: "simple_explain_test" }), name: "a" })], distinct: false, filter: None, order_by: None, null_treatment: None } }), op: Plus, right: AggregateFunction(AggregateFunction { func: AggregateUDF { inner: Sum { signature: Signature { type_signature: UserDefined, volatility: Immutable } } }, params: AggregateFunctionParams { args: [Column(Column { relation: Some(Bare { table: "simple_explain_test" }), name: "b" })], distinct: false, filter: None, order_by: None, null_treatment: None } }) })'.
02)This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
@alamb
|
So in my opinion before we get too fancy we should figure out what optimizaton we are really trying to do I understand it may be possible to do a more general rewrite such as @berkaysynnada suggests above #15524 (comment) However, I think it would help to have an actual usecase / real query. The only usecase I know of is the click bench query SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(ResolutionWidth + 18), SUM(ResolutionWidth + 19), SUM(ResolutionWidth + 20), SUM(ResolutionWidth + 21), SUM(ResolutionWidth + 22), SUM(ResolutionWidth + 23), SUM(ResolutionWidth + 24), SUM(ResolutionWidth + 25), SUM(ResolutionWidth + 26), SUM(ResolutionWidth + 27), SUM(ResolutionWidth + 28), SUM(ResolutionWidth + 29), SUM(ResolutionWidth + 30), SUM(ResolutionWidth + 31), SUM(ResolutionWidth + 32), SUM(ResolutionWidth + 33), SUM(ResolutionWidth + 34), SUM(ResolutionWidth + 35), SUM(ResolutionWidth + 36), SUM(ResolutionWidth + 37), SUM(ResolutionWidth + 38), SUM(ResolutionWidth + 39), SUM(ResolutionWidth + 40), SUM(ResolutionWidth + 41), SUM(ResolutionWidth + 42), SUM(ResolutionWidth + 43), SUM(ResolutionWidth + 44), SUM(ResolutionWidth + 45), SUM(ResolutionWidth + 46), SUM(ResolutionWidth + 47), SUM(ResolutionWidth + 48), SUM(ResolutionWidth + 49), SUM(ResolutionWidth + 50), SUM(ResolutionWidth + 51), SUM(ResolutionWidth + 52), SUM(ResolutionWidth + 53), SUM(ResolutionWidth + 54), SUM(ResolutionWidth + 55), SUM(ResolutionWidth + 56), SUM(ResolutionWidth + 57), SUM(ResolutionWidth + 58), SUM(ResolutionWidth + 59), SUM(ResolutionWidth + 60), SUM(ResolutionWidth + 61), SUM(ResolutionWidth + 62), SUM(ResolutionWidth + 63), SUM(ResolutionWidth + 64), SUM(ResolutionWidth + 65), SUM(ResolutionWidth + 66), SUM(ResolutionWidth + 67), SUM(ResolutionWidth + 68), SUM(ResolutionWidth + 69), SUM(ResolutionWidth + 70), SUM(ResolutionWidth + 71), SUM(ResolutionWidth + 72), SUM(ResolutionWidth + 73), SUM(ResolutionWidth + 74), SUM(ResolutionWidth + 75), SUM(ResolutionWidth + 76), SUM(ResolutionWidth + 77), SUM(ResolutionWidth + 78), SUM(ResolutionWidth + 79), SUM(ResolutionWidth + 80), SUM(ResolutionWidth + 81), SUM(ResolutionWidth + 82), SUM(ResolutionWidth + 83), SUM(ResolutionWidth + 84), SUM(ResolutionWidth + 85), SUM(ResolutionWidth + 86), SUM(ResolutionWidth + 87), SUM(ResolutionWidth + 88), SUM(ResolutionWidth + 89) FROM hits; I have never seen such a query elsewhere and I would struggle to explain to someone in human language what that was calculating and why it was written that way Thus for this one, I would personally suggest either:
Alternately we can spend some time trying to do something more general, but I probably don't have much time to help here as I don't think it will be widely applicable |
Is your feature request related to a problem or challenge?
Our datafusion is 5x slower than duckdb for q29, it's easy for us to optimize to 5x faster, here is the try:
Extraction of Constants in Multiple AGG Calls
In ClickBench, some SQL queries can be optimized using RBO (Rule-Based Optimization) without changing semantics. For example, Q29 computes
SUM(ResolutionWidth + constant)
90 times, requiring 90 columns in execution. Using the distributive property, we can rewrite it as:Before Optimization
After Optimization
This reduces redundant computations and improves execution efficiency.
Testing result:
Before rewrite:
After rewrite:
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: