Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for parsing common JSON operators #10981

Closed
wants to merge 1 commit into from

Conversation

dmontagu
Copy link

Opening this PR at this (early) stage primarily for the sake of getting initial feedback on its viability; @samuelcolvin has additional context.

The idea is that we want to experiment with rewriting operators to use the JSON functions in https://github.com/datafusion-contrib/datafusion-functions-json, but it would help if datafusion supported parsing the relevant operators. We can fork datafusion to add this support if necessary, but even if that's the case I was hoping to get some guidance about the implementation, and in particular feedback on whether the path this is going down is "wrong" or has a chance of getting merged eventually as the JSON functions stuff gets more mature.

Which issue does this PR close?

Related to #7845, though it does not close that issue.

Rationale for this change

We are using https://github.com/datafusion-contrib/datafusion-functions-json, but the problem is that, as far as I can tell, there is not a way to add support for new operators without modifying the SQL parsing in the main datafusion crate. We want to add support for the common JSON operators (at least ? and ->/->>, if not the others used by Postgres), and are happy to try those out ourselves, but it would be nice if parsing support was included in the datafusion crate even if the new operators remain completely unused.

The good news is that the various operators can already be parsed by the sqlparser crate so adding support is mostly mechanical, other than determining their signatures (which admittedly may pose a problem/debate).

What changes are included in this PR?

This PR currently just adds support for parsing the ? (Question), -> (Arrow), and ->> (LongArrow) operators from sqlparser::ast::BinaryOperator.

I'll note that Postgres also has #> (HashArrow), #>> (HashLongArrow), #- (HashMinus), @? (AtQuestion), ?& (QuestionAnd), ?| (QuestionPipe), @@ (AtAt), as established JSON operators that it might be nice to add support for at the same time, considering these are already cases in the sqlparser::ast::BinaryOperator enum. I would be happy to also add support for some/all of those as well if desirable.

Are these changes tested?

Not yet, I'll be happy to add some tests if this has a chance of being accepted.

Are there any user-facing changes?

This PR adds the ability to parse these operators. For most/all of them, we'll presumably want any actual use to produce errors if the operator isn't involved in a user-defined rewrite. Right now, given our interest in using them for JSON, I think we'd like to support Utf8/LargeUtf8 as inputs where a JSON value would be expected, but I'm not sure if that makes sense if we want to introduce support for other types.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions substrait labels Jun 18, 2024
@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

We are using https://github.com/datafusion-contrib/datafusion-functions-json, but the problem is that, as far as I can tell, there is not a way to add support for new operators without modifying the SQL parsing in the main datafusion crate. We want to add support for the common JSON operators (at least ? and ->/->>, if not the others used by Postgres), and are happy to try those out ourselves, but it would be nice if parsing support was included in the datafusion crate even if the new operators remain completely unused.

One way might be to use functions (rather than operators) to represent this

So for example, the planner might convert col->foo into a function call like fieldAccess(col, 'foo')

Edit; though I can see it is not obvious how to do this in the existing planner 🤔

@samuelcolvin
Copy link
Contributor

@alamb, yup we'd be very happy to do this via query rewriting (either in datafusion-functions-json, another open source package, or our application).

The problem is AFAIK right now an error is raised when the operator (e.g. ->>) is encountered, before register_function_rewrite gets to replace it with a function.

Hence we just need the operator to survive in the AST long enough for us to replace it.

@dmontagu
Copy link
Author

dmontagu commented Jun 18, 2024

I imagined that the implementation path would be something similar to how ArrayFunctionRewriter in datafusion/functions-array/src/rewrite.rs handles the AtArrow operator.

If there's a way to do the rewriting that doesn't require changes to the SQL parsing to support the additional operators, happy to stick to that for now, we just didn't see a way to do that. (We hit the error produced by this line

_ => not_impl_err!("Unsupported SQL binary operator {op:?}"),
)

(Sorry for confusion if I'm misunderstanding your comment!)

@alamb
Copy link
Contributor

alamb commented Jun 19, 2024

If there's a way to do the rewriting that doesn't require changes to the SQL parsing to support the additional operators, happy to stick to that for now, we just didn't see a way to do that. (We hit the error produced by this line

I agree - If there is no other way to support the syntax feature we should add it to the core.

Another idea would be to add an API to the planner that allows planning custom operators that is called before the built in rules. That API might take a little longer to get in, but it would be much more flexible

@samuelcolvin
Copy link
Contributor

Another idea would be to add an API to the planner that allows planning custom operators that is called before the built in rules. That API might take a little longer to get in, but it would be much more flexible

I hope #11132 is the first step towards implementing this.

@samuelcolvin
Copy link
Contributor

See #11137 which should fully replace this.

@dmontagu
Copy link
Author

Closing in favor of #11137

@dmontagu dmontagu closed this Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants