Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48700][SQL] Mode expression for complex types (all collations) #47154

Closed
wants to merge 27 commits into from

Conversation

GideonPotok
Copy link
Contributor

@GideonPotok GideonPotok commented Jun 30, 2024

What changes were proposed in this pull request?

Add support for complex types with subfields that are collated strings, for the mode operator.

Why are the changes needed?

Full support for collations as per SPARK-48700

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Unit tests only, so far.

Was this patch authored or co-authored using generative AI tooling?

No.

@GideonPotok GideonPotok changed the title complex [WIP] Jun 30, 2024
@github-actions github-actions bot added the SQL label Jun 30, 2024
@GideonPotok GideonPotok changed the title [WIP] [WIP] [SPARK-46837] [SPARK-48700] Mode expression for complex types (all collations) Jul 2, 2024
@GideonPotok GideonPotok changed the title [WIP] [SPARK-46837] [SPARK-48700] Mode expression for complex types (all collations) [SPARK-46837] [SPARK-48700] Mode expression for complex types (all collations) Jul 9, 2024
@GideonPotok GideonPotok marked this pull request as ready for review July 10, 2024 02:35
@GideonPotok
Copy link
Contributor Author

@uros-db This is ready for initial review. Let me know what you think. I am working on simplifying the implementation -- suggestions for how to do so are welcome. All suggestions welcome.

@GideonPotok GideonPotok changed the title [SPARK-46837] [SPARK-48700] Mode expression for complex types (all collations) [SPARK-48700] [SQL] Mode expression for complex types (all collations) Jul 10, 2024
@GideonPotok
Copy link
Contributor Author

@uros-db

@GideonPotok GideonPotok requested a review from uros-db July 15, 2024 12:58
@GideonPotok GideonPotok requested a review from uros-db July 23, 2024 18:52
@GideonPotok GideonPotok force-pushed the collationmodecomplex branch 2 times, most recently from 44b78f9 to 9504ba6 Compare July 30, 2024 20:02
@GideonPotok
Copy link
Contributor Author

@MaxGekk so sorry about that! New computer, thus default IntelliJ settings for auto formatting. But it is all fixed now.

Do you have some default IDE settings in a file form I could import into mine, for next time? If not I will adjust the settings manually before creating my next spark PR.

CollationFactory.getCollationKey(data.asInstanceOf[UTF8String], st.collationId)
case _ =>
throw new SparkUnsupportedOperationException(
"UNSUPPORTED_MODE_DATA_TYPE",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNSUPPORTED_MODE_DATA_TYPE is a sub-condition of DATATYPE_MISMATCH. You cannot refers to it in this way. Could you add a test for this case, please.

BTW, if you cannot reproduce the error via public API, we should consider to convert it to an internal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk For the test, I suppose the situation where a type is not binary stable and is not covered by checkInputDataTypes would include some UDTs? checkInputType just confirms it is not a MapType (a blacklist/blocklist), whereas collationAwareTransform is an allowlist/whitelist).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk @uros-db I am having a lot of trouble with this one! I implemented it as COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT, and plan to create a different subclass that represents this situation (Maybe BAD_INPUT), once I just have it working. But When I throw the following:

SparkUnsupportedOperationException(
          errorClass = "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT",

I end up getting:

// org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot
        // find sub error class 'COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.NO_INPUT' SQLSTATE: XX000

Yet if I do

SparkUnsupportedOperationException(
          errorClass = "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT",

Then org.apache.spark.ErrorClassesJsonReader#getMessageTemplate fails during the assertion assert(errorInfo.subClass.isDefined == subErrorClass.isDefined)

As the subClass is missing.

Would either of you be able to tell me if this is the right pattern (eg. maybe it should be ComplexExpressionException("BAD_INPUT"), not sure what is the "old" pattern and which is the preferred one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk please re-review

@GideonPotok
Copy link
Contributor Author

@MaxGekk please review (new error condition and test is up and running as was requested -- I can change the name and details of the new error condition, now that it and the test are all set up, just let me know if it needs a rename)

@MaxGekk
Copy link
Member

MaxGekk commented Oct 1, 2024

@GideonPotok Thank you for the ping. I will review this PR soon.

@MaxGekk MaxGekk changed the title [SPARK-48700] [SQL] Mode expression for complex types (all collations) [SPARK-48700][SQL] Mode expression for complex types (all collations) Oct 1, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Oct 1, 2024

+1, LGTM. Merging to master.
Thank you, @GideonPotok and @uros-db for review.

@MaxGekk MaxGekk closed this in 97e9bb3 Oct 1, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

Add support for complex types with subfields that are collated strings, for the mode operator.

### Why are the changes needed?

Full support for collations as per SPARK-48700

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Unit tests only, so far.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47154 from GideonPotok/collationmodecomplex.

Lead-authored-by: Gideon P <[email protected]>
Co-authored-by: Gideon Potok <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants